-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Etymology coding in MW #362
Comments
Oh, this defect Unicode.
Can you extract them, please? I would take an eye on them.
Not sure I understood, but I can fix at least part of the wrongness of etymologies. |
Angl.Sax. words partly marked. |
This is not a Unicode defect; in all the three places (Russain words) where the c8 occurs in mw_orig.txt, they do indicate c-caron only. It is the defect (wrong encoding) in digitisation, I would say!
You're yet to persuade Jim to incorporate the Lithuanian words done few months back, @gasyoun ! |
@drdhaval2785 I had mentioned about this before, but just gave the Lithuanian portion in another thread. |
Sure. |
it should be the similar way as you had worked with my file now - just a find/replacement of strings/lines. |
let's try to do whatever is possible without bothering @funderburkjim. for major points there should be some consensus anyway ! |
Kindly give the files as you mention. |
yes, I posted there first and then saw this and posted here. That issue can be closed as this will cover it also. I will give the file in a day or two, as it needs to be prepared in the "new format" I myself have introduced now. |
however, that Lithuanian issue has one important piece that can remain still, may be as a separate issue, as it is worth some consideration. And it was also proposed by me! It is to have present wordforms as tooltips, for the archaic words in all those olden works. I am sure quite many of the words would have changed over the century+ period that has elapsed since they got printed in these dictionaries. |
incidentally this issue also has the wrong notion of IAST for encompassing all European languages. It seems in went into nerves and blood of the team, since beginning!! |
seems I have to put some good amount of effort to bring my old data into present format of Cologne; too many changes took place in taggings over the past 5-6 years. just searching with "lang" tag gave 1200+ lines in my old file, and only about 850 in the present cologne file. thinking of how best to make my life (task) easy! |
Don't worry. We will figure out some way to reconcile the differences. But it would be necessary to see your files first. |
After carefully looking at Cologne's present file and my old file, noticed some systematic changes.
So I am splitting my data into two parts now, one matching with Cologne's present And would leave it to you to handle the parts appropriately with this above info. |
Forgot to mention another point, I had all the etym. portions starting with "[cf." in another line. Cologne data has some in same line at the end and some in another line. So enough care should be taken to replace only part of the Cologne line, while incorporating my data. No full line replacements in such cases. |
Successfully completed splitting my data, first portion aligned to the 825 Two important observations:1. My file has some Greek words in Capital letters, and Cologne has them in small letters. Do not know the reason! Probably something might've happened while I gave the Greek letters data for correction earlier, in Excel form; but even the Greek expert @jmigliori did not identify these!! (He seemed to have referred to the scan/book at some words.) Giving the first portion data as Excel file, for human reading (I wanted to draw attention to two places that I had remarked). @drdhaval2785 can make the text file out of it just by copy/paste, and try to look for the possible means of handling the indicated corrections/differences. Incidentally, as I had extracted the 2. So my file has 96 lines extra, that are yet to be identified with corresponding present Cologne lines. |
BTW, the column B in Excel is hidden, which has the HW part from the Cologne lines. It may be unhidden and used, if needed. |
just a thought. as there are just about 1200+ lines, probably its better to read the present cologne data and correct them directly. this should not take more than 3-4 days. I had already spent more than a day and atleast another half-day is needed to align the 2nd part lines from my file. I had looked only at the tagged portions those days and not at other portions like punctuations etc. it will be another reading from my side for those complete lines (and I am more "matured" now in the process), and also it saves Dhaval's time considerably. what would @drdhaval2785 say about this? |
this whole exercise that went now may be treated as some "show-off" (I do mean it in that sense, as that work has some flaws) that I did some work those days. |
@drdhaval2785 must love it. |
Anyway, first let Dhaval have a look at the file sent and try to make a plan about using its data, as that was the original idea agreed upon. |
Good idea! |
File supplied by AB has been examined and necessary changes have been made. |
Good improvement! |
e8
noticed by userUser pcipolla submitted a Correction:
Indeed there were found 22 such cases. The '8' in 'e8' was the letter-number coding used by Thomas for adding the 'breve' diacritic.
I also noticed 'u8' , 'o8', and 'c8' .
Since we are now comfortable with using unicode IAST in the digitizations rather than the original letter-number coding, I've changed these to unicode e-breve, u-breve, o-breve, and c-caron (since there is no c-breve as a single unicode code point).
More to be done in Etymologies
I'm sure that that are still other odd characters with diacritics in the 'etymology' sections of entries in MW.
These characters are coded in the letter-number format, within MW
Current displays change the AS coding.
While these etymology diacritics appear in mw.xml with the letter-number (AS) codings. the displays
generally transcode these to Unicode characters.
For instance, in headword 'pard', the display shows:
But the mw.xml coding still uses letter-numbers:
It would be good to change the coding in mw.xml to Unicode from letter-number in the etymology
sections of mw.
<ls>
abbreviations of literary sources<as0>
IAST form of Sanskrit words.It is not known whether the letter-number codings represent the same Unicode characters in these 'Sanskrit words' as they do in the non-Sanskrit words that appear in the etymologies. That's what the
warning is about: Carrying through the Unicodification in the etymology sections must be done with
care.
The text was updated successfully, but these errors were encountered: