Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indentation / data error for headword categories in MW #1617

Open
drdhaval2785 opened this issue Jan 15, 2024 · 12 comments
Open

Indentation / data error for headword categories in MW #1617

drdhaval2785 opened this issue Jan 15, 2024 · 12 comments
Assignees
Labels
bug Something technical isn't working

Comments

@drdhaval2785
Copy link
Collaborator

A user @aumsanskrit reported the following error

QUOTE
I have included two screenshots for the first Hierarchy Error that I referenced. In the selected word online, you will see in the left column that the word is “indented” indicating that the online dictionary has placed this word “underneath” the word listed above (as if there is an etymological relationship). However no such direct etymological relation exists in the printed dictionary because the word is actually listed “independently”. Certainly the word is listed directly below on the printed page, but the listing is “independent” and not “Etymologically” related. In the Online version the “indentation” of this particular word in the left-hand column suggests an “Etymological” relationship to the above word which actually does not exist.
UNQUOTE

Website (List display)

Hierarchy Error 1

Print

Hierarchy Error 1 (printed)

Data

<L>240913<pc>1197,1<k1>sAMvAhika<k2>sAMvAhika<e>1
<s>sAMvAhika</s> ¦ <lex>mf(<s>A</s> and <s>I</s>)n.</lex> (<ab>fr.</ab> <s>saM-vAha</s>) <ab>g.</ab> <s>kASy-Adi</s> and <s>guqA<srs/>di</s>.<info lex="m:f#A:f#I:n"/>
<LEND>
<L>240914<pc>1197,1<k1>sAMvittika<k2>sAMvittika<e>2
<s>sAMvittika</s> ¦ <lex>mfn.</lex> (<ab>fr.</ab> <s>saM-vitti</s>) based on a (mere) feeling or perception, subjective, <ls>Kap.</ls>, <ab>Sch.</ab><info lex="m:f:n"/>
<LEND>

I am not sure why sAMvAhika bears <e>1 and sAMvittika bears <e>2.

This seems to be leading to the error in indentation in the list display.

@drdhaval2785 drdhaval2785 added the bug Something technical isn't working label Jan 15, 2024
@gasyoun
Copy link
Member

gasyoun commented Jan 15, 2024

So it's off the list, lower than it belongs to.

@funderburkjim
Copy link
Contributor

am not sure why sAMvAhika bears 1 and sAMvittika bears 2.
This seems to be leading to the error in indentation in the list display.

Yes. sAMvittika should be changed to <e>1. This will solve the indentation problem.

Here are details.

In mw dictionary, <e>2 in the metaline of mw.txt turns into <H2> in mw.xml, courtesy make_xml.py:

%if dictlo == 'mw':
 #data = "<H1><h>%s</h><body>%s</body><tail>%s</tail></H1>" % (h,body,tail)
 data = "<h>%s</h><body>%s</body><tail>%s</tail>" % (h,body,tail)
 tag = 'H%s' %hwrec.e         NOTE HERE e value used
 data = '<%s>%s</%s>' %(tag,data,tag)
%endif

In the list display, the <H2> in mw.xml is used to generate html to show indentation with periods. See listhierview.php

 $spcchar = ".";
...
  if (preg_match('/^<H([2])/',$data2,$matches)) {
   $spc="$spcchar";    NOTE <H2> GETS ONE PRECEDING PERIOD 
  }else if(preg_match('/^<H([3])/',$data2,$matches)) {
   $spc="$spcchar$spcchar";
  }else if(preg_match('/^<H([4])/',$data2,$matches)) {
   $spc="$spcchar$spcchar$spcchar";
  }else {
   $spc="";
  }
.....................
  $out1 = "$spc<a  onclick='getWordAlt_keyboard(\"<SA>$key2</SA>\");'><span style='$c'$class><SA>$key2show</SA></span>$hom2</a>$xtraskip<br/>\n";

drdhaval2785 added a commit that referenced this issue Jan 24, 2024
@drdhaval2785
Copy link
Collaborator Author

Correction made, but it seems that there are many such cases.
e.g. sAMvidya just beneath sAMvittika was also made <e>2.

Some programmatic way to find such errors should be found out.

One way would be to find out the old digitization of monier.xml and see for <H1> tags and see what are wrongly converted to <e>2.
I understand that earlier monier.xml was the file. mw.txt was generated from it at some point, to make it consistent with other CDSL dictionaries.

Same with other categories.

@funderburkjim may like to throw some light on the same.

@funderburkjim
Copy link
Contributor

There is need to recall MW's description of the H1-4: https://sanskrit-lexicon.uni-koeln.de/scans/csldev/csldoc/build/dictionaries/prefaces/mwpref/mwpref11.html

Based on MW's criteria:

  • 1 sAmvidya should be H2, since the headword is in Roman in print (L=240915),
    • Thus, this needs to be reverted in csl-orig. @drdhaval2785 Agree?
  • 2 sAmvidya should be H1, since the headword is in Devanagari in print (L=240916)

No NLP-type accuracy test for the H-values comes to mind.

Are there several other examples of such errors, in addition to the one found at sAmvittika. ?

@aumsanskrit
Copy link
Contributor

I submitted many such errors through the following webpage:

https://www.sanskrit-lexicon.uni-koeln.de/scans/csl-corrections/app/correction_form.php?dict=MW

I termed all such errors as "Hierarchy" or "Hierarchical" errors.

@drdhaval2785
Copy link
Collaborator Author

Just for the record, I saw in MONIER.ALL file (very early digitization) and in that too the data is as per mw.txt. So not possible to find some pattern by which such errors can be fetched programmatically.

So, it seems to be a manual work ahead.

@Andhrabharati
Copy link

Andhrabharati commented Jan 26, 2024

Based on MW's criteria:

* `1 sAmvidya`  should be H2, since the headword is in Roman in print (L=240915),
  
  * Thus, this needs to be reverted in csl-orig. @drdhaval2785 Agree?

* `2 sAmvidya` should be H1, since the headword is in Devanagari in print (L=240916)

@drdhaval2785 / @funderburkjim

Just like to bring your attention to these two lines of text from mw_orig_utf8.txt--

image

Only issue I see is that the mw.txt (later) has been split further (as per @gasyoun's request) to a level bit too-much, to correlate with this old data. [But, still I can see a way ahead, that Jim might come up with quite soon, if he puts his mind on the issue.]

@Andhrabharati
Copy link

Andhrabharati commented Jan 26, 2024

And it is not out of context here, for me to say that I am reverting this MW99 split-up to the 'theme' that I had envisaged to be applicable to almost all the CDSL works (PWG, pwk, Apte, MW72, MD, MW99, ...), in my current working.

@funderburkjim
Copy link
Contributor

I submitted many such errors

mw_correctionform.txt in csl-corrections repo shows about 11 of these hierarchy errors. Thanks for mentioning, @aumsanskrit .

@Andhrabharati
Copy link

On a closer look, even the mw_orig_utf8.txt is NOT proper wrt the print (so far as Hn numbering is concerned) at too many places.

@Andhrabharati
Copy link

Not only the H1 & H2 marking differences (as discussed above), it is now identified that quite many H2 entities were marked as H3!!

@Andhrabharati
Copy link

Now came across cases of H3 entries marked as H2 (the reverse to above)!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something technical isn't working
Projects
None yet
Development

No branches or pull requests

5 participants