Etymology coding in MW #362

funderburkjim · 2017-08-20T22:12:05Z

`e8` noticed by user

User pcipolla submitted a Correction:

L=102621.2, hw=na 2
Lat. ne8-   --> Lat. nĕ-
Comment: False scan/encoding: You may wish to check the whole document for just under two dozen
 other instances of "ĕ" falsely coded as "e8". Exempli gratia, for [L38692] and [L38693] sub voce ṛghā 
"e8re8ghant" should read "ĕrĕghant".

Indeed there were found 22 such cases. The '8' in 'e8' was the letter-number coding used by Thomas for adding the 'breve' diacritic.
I also noticed 'u8' , 'o8', and 'c8' .

Since we are now comfortable with using unicode IAST in the digitizations rather than the original letter-number coding, I've changed these to unicode e-breve, u-breve, o-breve, and c-caron (since there is no c-breve as a single unicode code point).

More to be done in Etymologies

I'm sure that that are still other odd characters with diacritics in the 'etymology' sections of entries in MW.
These characters are coded in the letter-number format, within MW

Current displays change the AS coding.

While these etymology diacritics appear in mw.xml with the letter-number (AS) codings. the displays
generally transcode these to Unicode characters.
For instance, in headword 'pard', the display shows:

(H1) pard [p= 606] : cl.1 A1. (Dhātup.  ii, 28) to break wind downwards Sarasv.  i, 25. 
[cf. Gk. πέρδω ; Lat. pēdo, pōdex ; Lith. pérdżu ; 
Germ. farzen, furzen ; Angl.Sax. feortan ; Eng. fart.] 
[L=119581]

But the mw.xml coding still uses letter-numbers:

<H1><h><hc3>500</hc3><key1>pard</key1><hc1>1</hc1><key2>pard</key2></h><body> 
<vlex type="root"></vlex> <vlex>cl.1 A1.</vlex> <p><ls>Dha1tup._ii_,_28</ls></p> <c>
<to/>to_break_wind_downwards</c> <ls>Sarasv._i_,_25.</ls> <b><c><ab>cf.</ab>_<ab>Gk.
</ab>_<gk>1</gk>_;_<ab>Lat.</ab></c>~<etym>pe1do</etym>~,~<etym>po1dex</etym>~
<c>;_<ab>Lith.</ab></c>~<etym>pe4rdz3u</etym>~<c>;_<ab>Germ.</ab></c>~
<etym>farzen</etym>~,~<etym>furzen</etym>~<c>;_<ab>Angl.Sax.</ab></c>~
<etym>feortan</etym>~<c>;_<ab>Eng.</ab></c>~<etym>fart</etym>.</b> 
</body><tail><mul/> <MW>076691</MW> <pc>606,3</pc> <L>119581</L></tail></H1>

It would be good to change the coding in mw.xml to Unicode from letter-number in the etymology
sections of mw.

⚠️ The letter-number scheme also appears in other sections of MW:

<ls> abbreviations of literary sources
<as0> IAST form of Sanskrit words.

It is not known whether the letter-number codings represent the same Unicode characters in these 'Sanskrit words' as they do in the non-Sanskrit words that appear in the etymologies. That's what the
warning is about: Carrying through the Unicodification in the etymology sections must be done with
care.

The text was updated successfully, but these errors were encountered:

gasyoun · 2017-08-21T04:49:10Z

c-caron

Oh, this defect Unicode.

I'm sure that that are still other odd characters with diacritics in the 'etymology' sections of entries in MW

Can you extract them, please? I would take an eye on them.

It is not known whether the letter-number codings represent the same Unicode characters in these 'Sanskrit words' as they do in the non-Sanskrit words that appear in the etymologies. That's what the
warning is about: Carrying through the Unicodification in the etymology sections must be done with
care.

Not sure I understood, but I can fix at least part of the wrongness of etymologies.

gasyoun · 2017-11-09T21:05:29Z

cf. Lat. vir ; Lith. vy4ras ; Goth. wair ; Angl.Sax. wr, wre-wulf ; Eng. werewolf ; Germ. Werwolf, Wergeld. ] [L=203601]

Angl.Sax. words partly marked.

Andhrabharati · 2021-09-02T14:15:35Z

I also noticed 'u8' , 'o8', and 'c8' .

Since we are now comfortable with using unicode IAST in the digitizations rather than the original letter-number coding, I've changed these to unicode e-breve, u-breve, o-breve, and c-caron (since there is no c-breve as a single unicode code point).

c-caron

Oh, this defect Unicode.

This is not a Unicode defect; in all the three places (Russain words) where the c8 occurs in mw_orig.txt, they do indicate c-caron only. It is the defect (wrong encoding) in digitisation, I would say!

I'm sure that that are still other odd characters with diacritics in the 'etymology' sections of entries in MW

Can you extract them, please? I would take an eye on them.

You're yet to persuade Jim to incorporate the Lithuanian words done few months back, @gasyoun !

Andhrabharati · 2021-09-06T16:49:40Z

@drdhaval2785
are you interested in cleaning up the [cf.] blocks of etym. words, if I give out my data (lying in my folders for ~6 years now) ?

I had mentioned about this before, but just gave the Lithuanian portion in another thread.

drdhaval2785 · 2021-09-06T17:11:39Z

Sure.
I would work out some way to incorporate the details.

Andhrabharati · 2021-09-06T17:19:11Z

it should be the similar way as you had worked with my file now - just a find/replacement of strings/lines.

Andhrabharati · 2021-09-06T17:21:45Z

let's try to do whatever is possible without bothering @funderburkjim.

for major points there should be some consensus anyway !

drdhaval2785 · 2021-09-07T00:29:54Z

Kindly give the files as you mention.
I see one file in Lithuanian words issue.
But as you are suggesting to give all the language works together, I would wait for it.

Andhrabharati · 2021-09-07T02:28:08Z

yes, I posted there first and then saw this and posted here.

That issue can be closed as this will cover it also.

I will give the file in a day or two, as it needs to be prepared in the "new format" I myself have introduced now.

Andhrabharati · 2021-09-07T02:48:25Z

however, that Lithuanian issue has one important piece that can remain still, may be as a separate issue, as it is worth some consideration. And it was also proposed by me!

It is to have present wordforms as tooltips, for the archaic words in all those olden works. I am sure quite many of the words would have changed over the century+ period that has elapsed since they got printed in these dictionaries.

Andhrabharati · 2021-09-07T02:54:43Z

incidentally this issue also has the wrong notion of IAST for encompassing all European languages.

It seems in went into nerves and blood of the team, since beginning!!

Andhrabharati · 2021-09-07T04:09:00Z

@drdhaval2785

seems I have to put some good amount of effort to bring my old data into present format of Cologne; too many changes took place in taggings over the past 5-6 years.

just searching with "lang" tag gave 1200+ lines in my old file, and only about 850 in the present cologne file.

thinking of how best to make my life (task) easy!

drdhaval2785 · 2021-09-07T04:12:54Z

Don't worry. We will figure out some way to reconcile the differences. But it would be necessary to see your files first.

Andhrabharati · 2021-09-07T15:17:08Z

After carefully looking at Cologne's present file and my old file, noticed some systematic changes.

My file has <lang> tag for all non-Skt. languages (incl. Prakrit!), and Cologne's present data has it only for Greek, Arabic & Persian. Other languages have it as <etym>; do not recall if I changed it while I was working on it to make it uniform across all languages, or it is changed by Cologne team sometime later.
My file has no accents in Devanagari strings, as we had "removed" them during our conversion those days for whatever reason.
Also there are many changes in tagging style & additionally introduced tags in Cologne data now.

So I am splitting my data into two parts now, one matching with Cologne's present <lang> tags and all others in another part.
Probably I should be able to close this process by tonight.

And would leave it to you to handle the parts appropriately with this above info.

Andhrabharati · 2021-09-07T16:04:19Z

Forgot to mention another point, I had all the etym. portions starting with "[cf." in another line. Cologne data has some in same line at the end and some in another line.

So enough care should be taken to replace only part of the Cologne line, while incorporating my data. No full line replacements in such cases.

Andhrabharati · 2021-09-08T15:02:16Z

Successfully completed splitting my data, first portion aligned to the 825 <lang> lines of present cologne data, and second portion separated out as another file in the process.

Two important observations:

1. My file has some Greek words in Capital letters, and Cologne has them in small letters. Do not know the reason!

Probably something might've happened while I gave the Greek letters data for correction earlier, in Excel form; but even the Greek expert @jmigliori did not identify these!! (He seemed to have referred to the scan/book at some words.)

Giving the first portion data as Excel file, for human reading (I wanted to draw attention to two places that I had remarked).
MW_lang lines.xlsx

@drdhaval2785 can make the text file out of it just by copy/paste, and try to look for the possible means of handling the indicated corrections/differences.
-------------
If Dhaval feels it is alright, I shall give the other 476 lines from my old data having <lang> tag, aligning them with <etym> lines of present Cologne data. [I will be waiting for his response before talking up this work.]

Incidentally, as I had extracted the<etym>strings from Cologne data and removed the lines having Greek, Arabic and Persian (as they are already covered above), only 380 have remained.

2. So my file has 96 lines extra, that are yet to be identified with corresponding present Cologne lines.

Andhrabharati · 2021-09-08T15:36:36Z

BTW, the column B in Excel is hidden, which has the HW part from the Cologne lines.

It may be unhidden and used, if needed.

Andhrabharati · 2021-09-08T17:53:56Z

just a thought.

as there are just about 1200+ lines, probably its better to read the present cologne data and correct them directly.

this should not take more than 3-4 days.

I had already spent more than a day and atleast another half-day is needed to align the 2nd part lines from my file.

I had looked only at the tagged portions those days and not at other portions like punctuations etc.

it will be another reading from my side for those complete lines (and I am more "matured" now in the process), and also it saves Dhaval's time considerably.

what would @drdhaval2785 say about this?

Andhrabharati · 2021-09-08T18:00:04Z

this whole exercise that went now may be treated as some "show-off" (I do mean it in that sense, as that work has some flaws) that I did some work those days.

gasyoun · 2021-09-08T19:32:17Z

also it saves Dhaval's time considerably.

@drdhaval2785 must love it.

Andhrabharati · 2021-09-08T20:12:31Z

Anyway, first let Dhaval have a look at the file sent and try to make a plan about using its data, as that was the original idea agreed upon.

funderburkjim · 2021-09-11T17:39:20Z

whatever is possible without bothering @funderburkjim.

Good idea!

drdhaval2785 · 2021-10-03T12:36:20Z

File supplied by AB has been examined and necessary changes have been made.
Repository is at https://github.com/sanskrit-lexicon/MWS/tree/master/CORRECTIONS_issue_362 .
Pending items are being tracked at sanskrit-lexicon/MWS#113, sanskrit-lexicon/MWS#114, sanskrit-lexicon/MWS#115, sanskrit-lexicon/MWS#116, sanskrit-lexicon/MWS#117, sanskrit-lexicon/MWS#118, sanskrit-lexicon/MWS#119.

funderburkjim · 2021-10-03T17:38:57Z

Good improvement!

gasyoun added the Research label Oct 22, 2017

drdhaval2785 mentioned this issue Dec 20, 2020

todo list in 2021 (in descending order of importance) sanskrit-lexicon/COLOGNE#325

Open

drdhaval2785 added a commit to sanskrit-lexicon/MWS that referenced this issue Oct 3, 2021

init sanskrit-lexicon/CORRECTIONS#362

f42644a

Andhrabharati mentioned this issue Oct 3, 2021

Lithuanian in mAs sanskrit-lexicon/MWS#118

Closed

drdhaval2785 added a commit to sanskrit-lexicon/csl-orig that referenced this issue Oct 3, 2021

corrected sanskrit-lexicon/CORRECTIONS#362

229f74b

drdhaval2785 added a commit to sanskrit-lexicon/csl-devanagari that referenced this issue Oct 3, 2021

started handling sanskrit-lexicon/CORRECTIONS#362

427d6dc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Etymology coding in MW #362

Etymology coding in MW #362

funderburkjim commented Aug 20, 2017 •

edited

Loading

gasyoun commented Aug 21, 2017

gasyoun commented Nov 9, 2017

Andhrabharati commented Sep 2, 2021

Andhrabharati commented Sep 6, 2021

drdhaval2785 commented Sep 6, 2021

Andhrabharati commented Sep 6, 2021

Andhrabharati commented Sep 6, 2021

drdhaval2785 commented Sep 7, 2021

Andhrabharati commented Sep 7, 2021 •

edited

Loading

Andhrabharati commented Sep 7, 2021

Andhrabharati commented Sep 7, 2021

Andhrabharati commented Sep 7, 2021

drdhaval2785 commented Sep 7, 2021

Andhrabharati commented Sep 7, 2021 •

edited

Loading

Andhrabharati commented Sep 7, 2021 •

edited

Loading

Andhrabharati commented Sep 8, 2021 •

edited

Loading

Andhrabharati commented Sep 8, 2021

Andhrabharati commented Sep 8, 2021

Andhrabharati commented Sep 8, 2021 •

edited

Loading

gasyoun commented Sep 8, 2021

Andhrabharati commented Sep 8, 2021 •

edited

Loading

funderburkjim commented Sep 11, 2021

drdhaval2785 commented Oct 3, 2021

funderburkjim commented Oct 3, 2021

Etymology coding in MW #362

Etymology coding in MW #362

Comments

funderburkjim commented Aug 20, 2017 • edited Loading

e8 noticed by user

More to be done in Etymologies

Current displays change the AS coding.

gasyoun commented Aug 21, 2017

gasyoun commented Nov 9, 2017

Andhrabharati commented Sep 2, 2021

Andhrabharati commented Sep 6, 2021

drdhaval2785 commented Sep 6, 2021

Andhrabharati commented Sep 6, 2021

Andhrabharati commented Sep 6, 2021

drdhaval2785 commented Sep 7, 2021

Andhrabharati commented Sep 7, 2021 • edited Loading

Andhrabharati commented Sep 7, 2021

Andhrabharati commented Sep 7, 2021

Andhrabharati commented Sep 7, 2021

drdhaval2785 commented Sep 7, 2021

Andhrabharati commented Sep 7, 2021 • edited Loading

Andhrabharati commented Sep 7, 2021 • edited Loading

Andhrabharati commented Sep 8, 2021 • edited Loading

Two important observations:

Andhrabharati commented Sep 8, 2021

Andhrabharati commented Sep 8, 2021

Andhrabharati commented Sep 8, 2021 • edited Loading

gasyoun commented Sep 8, 2021

Andhrabharati commented Sep 8, 2021 • edited Loading

funderburkjim commented Sep 11, 2021

drdhaval2785 commented Oct 3, 2021

funderburkjim commented Oct 3, 2021

funderburkjim commented Aug 20, 2017 •

edited

Loading

`e8` noticed by user

Andhrabharati commented Sep 7, 2021 •

edited

Loading

Andhrabharati commented Sep 7, 2021 •

edited

Loading

Andhrabharati commented Sep 7, 2021 •

edited

Loading

Andhrabharati commented Sep 8, 2021 •

edited

Loading

Andhrabharati commented Sep 8, 2021 •

edited

Loading

Andhrabharati commented Sep 8, 2021 •

edited

Loading