Indentation / data error for headword categories in MW #1617

drdhaval2785 · 2024-01-15T08:42:57Z

A user @aumsanskrit reported the following error

QUOTE
I have included two screenshots for the first Hierarchy Error that I referenced. In the selected word online, you will see in the left column that the word is “indented” indicating that the online dictionary has placed this word “underneath” the word listed above (as if there is an etymological relationship). However no such direct etymological relation exists in the printed dictionary because the word is actually listed “independently”. Certainly the word is listed directly below on the printed page, but the listing is “independent” and not “Etymologically” related. In the Online version the “indentation” of this particular word in the left-hand column suggests an “Etymological” relationship to the above word which actually does not exist.
UNQUOTE

Website (List display)

Print

Data

<L>240913<pc>1197,1<k1>sAMvAhika<k2>sAMvAhika<e>1
<s>sAMvAhika</s> ¦ <lex>mf(<s>A</s> and <s>I</s>)n.</lex> (<ab>fr.</ab> <s>saM-vAha</s>) <ab>g.</ab> <s>kASy-Adi</s> and <s>guqA<srs/>di</s>.<info lex="m:f#A:f#I:n"/>
<LEND>
<L>240914<pc>1197,1<k1>sAMvittika<k2>sAMvittika<e>2
<s>sAMvittika</s> ¦ <lex>mfn.</lex> (<ab>fr.</ab> <s>saM-vitti</s>) based on a (mere) feeling or perception, subjective, <ls>Kap.</ls>, <ab>Sch.</ab><info lex="m:f:n"/>
<LEND>

I am not sure why sAMvAhika bears <e>1 and sAMvittika bears <e>2.

This seems to be leading to the error in indentation in the list display.

The text was updated successfully, but these errors were encountered:

gasyoun · 2024-01-15T08:52:56Z

So it's off the list, lower than it belongs to.

funderburkjim · 2024-01-18T18:13:00Z

am not sure why sAMvAhika bears 1 and sAMvittika bears 2.
This seems to be leading to the error in indentation in the list display.

Yes. sAMvittika should be changed to <e>1. This will solve the indentation problem.

Here are details.

In mw dictionary, <e>2 in the metaline of mw.txt turns into <H2> in mw.xml, courtesy make_xml.py:

%if dictlo == 'mw':
 #data = "<H1><h>%s</h><body>%s</body><tail>%s</tail></H1>" % (h,body,tail)
 data = "<h>%s</h><body>%s</body><tail>%s</tail>" % (h,body,tail)
 tag = 'H%s' %hwrec.e         NOTE HERE e value used
 data = '<%s>%s</%s>' %(tag,data,tag)
%endif

In the list display, the <H2> in mw.xml is used to generate html to show indentation with periods. See listhierview.php

 $spcchar = ".";
...
  if (preg_match('/^<H([2])/',$data2,$matches)) {
   $spc="$spcchar";    NOTE <H2> GETS ONE PRECEDING PERIOD 
  }else if(preg_match('/^<H([3])/',$data2,$matches)) {
   $spc="$spcchar$spcchar";
  }else if(preg_match('/^<H([4])/',$data2,$matches)) {
   $spc="$spcchar$spcchar$spcchar";
  }else {
   $spc="";
  }
.....................
  $out1 = "$spc<a  onclick='getWordAlt_keyboard(\"<SA>$key2</SA>\");'><span style='$c'$class><SA>$key2show</SA></span>$hom2</a>$xtraskip<br/>\n";

drdhaval2785 · 2024-01-24T05:02:31Z

Correction made, but it seems that there are many such cases.
e.g. sAMvidya just beneath sAMvittika was also made <e>2.

Some programmatic way to find such errors should be found out.

One way would be to find out the old digitization of monier.xml and see for <H1> tags and see what are wrongly converted to <e>2.
I understand that earlier monier.xml was the file. mw.txt was generated from it at some point, to make it consistent with other CDSL dictionaries.

Same with other categories.

@funderburkjim may like to throw some light on the same.

funderburkjim · 2024-01-26T03:24:53Z

There is need to recall MW's description of the H1-4: https://sanskrit-lexicon.uni-koeln.de/scans/csldev/csldoc/build/dictionaries/prefaces/mwpref/mwpref11.html

Based on MW's criteria:

1 sAmvidya should be H2, since the headword is in Roman in print (L=240915),
- Thus, this needs to be reverted in csl-orig. @drdhaval2785 Agree?
2 sAmvidya should be H1, since the headword is in Devanagari in print (L=240916)

No NLP-type accuracy test for the H-values comes to mind.

Are there several other examples of such errors, in addition to the one found at sAmvittika. ?

aumsanskrit · 2024-01-26T07:28:35Z

I submitted many such errors through the following webpage:

https://www.sanskrit-lexicon.uni-koeln.de/scans/csl-corrections/app/correction_form.php?dict=MW

I termed all such errors as "Hierarchy" or "Hierarchical" errors.

drdhaval2785 · 2024-01-26T15:39:39Z

Just for the record, I saw in MONIER.ALL file (very early digitization) and in that too the data is as per mw.txt. So not possible to find some pattern by which such errors can be fetched programmatically.

So, it seems to be a manual work ahead.

Andhrabharati · 2024-01-26T15:52:22Z

Based on MW's criteria:

* `1 sAmvidya`  should be H2, since the headword is in Roman in print (L=240915),
  
  * Thus, this needs to be reverted in csl-orig. @drdhaval2785 Agree?

* `2 sAmvidya` should be H1, since the headword is in Devanagari in print (L=240916)

@drdhaval2785 / @funderburkjim

Just like to bring your attention to these two lines of text from mw_orig_utf8.txt--

Only issue I see is that the mw.txt (later) has been split further (as per @gasyoun's request) to a level bit too-much, to correlate with this old data. [But, still I can see a way ahead, that Jim might come up with quite soon, if he puts his mind on the issue.]

Andhrabharati · 2024-01-26T15:56:27Z

And it is not out of context here, for me to say that I am reverting this MW99 split-up to the 'theme' that I had envisaged to be applicable to almost all the CDSL works (PWG, pwk, Apte, MW72, MD, MW99, ...), in my current working.

funderburkjim · 2024-01-26T17:18:07Z

I submitted many such errors

mw_correctionform.txt in csl-corrections repo shows about 11 of these hierarchy errors. Thanks for mentioning, @aumsanskrit .

Andhrabharati · 2024-01-27T02:53:02Z

On a closer look, even the mw_orig_utf8.txt is NOT proper wrt the print (so far as Hn numbering is concerned) at too many places.

Andhrabharati · 2024-01-27T13:07:42Z

Not only the H1 & H2 marking differences (as discussed above), it is now identified that quite many H2 entities were marked as H3!!

Andhrabharati · 2024-01-27T15:22:01Z

Now came across cases of H3 entries marked as H2 (the reverse to above)!

drdhaval2785 assigned funderburkjim Jan 15, 2024

drdhaval2785 added the bug Something technical isn't working label Jan 15, 2024

drdhaval2785 added a commit that referenced this issue Jan 24, 2024

correction of #1617

e075c45

funderburkjim mentioned this issue May 31, 2024

Correction backlog, 1 #1637

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indentation / data error for headword categories in MW #1617

Indentation / data error for headword categories in MW #1617

drdhaval2785 commented Jan 15, 2024

gasyoun commented Jan 15, 2024

funderburkjim commented Jan 18, 2024

drdhaval2785 commented Jan 24, 2024

funderburkjim commented Jan 26, 2024

aumsanskrit commented Jan 26, 2024

drdhaval2785 commented Jan 26, 2024

Andhrabharati commented Jan 26, 2024 •

edited

Loading

Andhrabharati commented Jan 26, 2024 •

edited

Loading

funderburkjim commented Jan 26, 2024

Andhrabharati commented Jan 27, 2024

Andhrabharati commented Jan 27, 2024

Andhrabharati commented Jan 27, 2024

Indentation / data error for headword categories in MW #1617

Indentation / data error for headword categories in MW #1617

Comments

drdhaval2785 commented Jan 15, 2024

Website (List display)

Print

Data

gasyoun commented Jan 15, 2024

funderburkjim commented Jan 18, 2024

drdhaval2785 commented Jan 24, 2024

funderburkjim commented Jan 26, 2024

aumsanskrit commented Jan 26, 2024

drdhaval2785 commented Jan 26, 2024

Andhrabharati commented Jan 26, 2024 • edited Loading

Andhrabharati commented Jan 26, 2024 • edited Loading

funderburkjim commented Jan 26, 2024

Andhrabharati commented Jan 27, 2024

Andhrabharati commented Jan 27, 2024

Andhrabharati commented Jan 27, 2024

Andhrabharati commented Jan 26, 2024 •

edited

Loading

Andhrabharati commented Jan 26, 2024 •

edited

Loading