Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

whitelist #5

Open
funderburkjim opened this issue Sep 11, 2016 · 13 comments
Open

whitelist #5

funderburkjim opened this issue Sep 11, 2016 · 13 comments

Comments

@funderburkjim
Copy link
Contributor

The whitelist directory contains work aimed at identifying headwords of the various Sanskrit dictionaries that may have spelling errors.

The underlying set of headwords is hwnorm1c.txt, which currently has 385,011 headwords.

The idea of the whitelist approach is to identify words which, on the basis of rules, are probably NOT misspelled. For instance, one such rule for a given spelling is that the word with that spelling appears as a headword in two or more dictionaries. All words satisfying a particular such rule are put into a whitelist file (whitelist0.txt for the rule just described). The latest batches of these whitelist files are in the output/all directory. There are currently 24 such whitelist files.

Then, the headwords whose spellings have as yet no rule to justify the correctness of their spelling are gathered into a graylist.txt file.

Currently, there are 21818 graylisted headwords.

According to the logic of this whitelist approach, the graylisted words are the most fertile ground for remaining headword spelling errors.

Of course, many of the graylisted words are surely spelled correctly. But, as of yet, we don't have any programmatic (or other) way to distinguish these as correctly spelled.

I hope others will examine these lists, especially the graylist, with an eye to:

  • Develop additional rules that would whitelist chunks of these
  • identifying by hand remaining errors, maybe by focusing on those graylisted words in a particular dictionary.
@funderburkjim
Copy link
Contributor Author

funderburkjim commented Sep 11, 2016

The latest run of the whitelist program shows these statistics regarding the number of cases whitelisted
by the various rules:

$ sh redo.sh
Recreating auxiliary/special.txt
regenerating graylist.txt and all whitelistX.txt files
385011 records from ../hwnorm1c.txt
187992 headwords coded as 0: In two or more dictionaries
  6357 headwords coded as 1: key1=X+am and X+a is found
  1773 headwords coded as 2: SKD nouns shown in nominative singular
 19238 headwords coded as 3a: prefix of known word
 11095 headwords coded as 3b: suffix of known word
  8823 headwords coded as 0a: special words (icf, foreign, etc.)
  2589 headwords coded as 4a: probable f. nouns ending in 'A'
   618 headwords coded as 4b: inflected form
 74009 headwords coded as cpd1: simple compound of 2 parts, first ending in 'aiufeoAIOxs'
  8749 headwords coded as cpd1a: simple compound of 2 parts, first ending at 'A,I'
   698 headwords coded as cpd1b: simple compound of 2 parts, first ending at 't'
  9712 headwords coded as cpd2: simple compound of 3 or more parts, each part ending in 'aiufeo'
  4441 headwords coded as cpd2a: simple compound of 3 or more parts, each part ending in 'aiufeoAI'
 17696 headwords coded as cpdsrs1: Simple compound with vowel sandhi at 'AIUoeEOvy'
  3319 headwords coded as cpdsrs1a: non-Simple compound with vowel sandhi at 'AIUoeEOvy'
    16 headwords coded as 5a: kar<->kf
    57 headwords coded as 5b: Ikf, IBU, Ikfta, IBUta
   810 headwords coded as 3bcpd: Compound word + suffix
   629 headwords coded as 3acpd: prefix + Compound word
  1103 headwords coded as cpdsandhi1: Compound word with sandhi
   819 headwords coded as 3a1: prefix of known compound
    95 headwords coded as cpdsandhi2: Compound word with sandhi
  2555 headwords coded as cpdsrs1b: non-Simple compound with vowel sandhi at 'AIUoeEOvy'
 21818 headwords coded as gray: Not yet whitelisted

@gasyoun
Copy link
Member

gasyoun commented Sep 12, 2016

Wow!

If we exclude PD for now, it's only 5547 words. I hoped it would be more.
1326 ACC - only one who knows valid names of manuscripts can approve.

I have checked a few

paSU    PW
plIyA   PW
buDi    PW
BAgI    PW
mUrKI   PW

Why plIyA and not pliyA? SLP1 converter issue (because PalIkartavE same I issue, but ok in livI)?

susu

Why in s.u. [see under = sehe unter] s. is bolded, u. is not? Makes no sense, let's check if it's same in different entries.

@drdhaval2785
Copy link
Contributor

My views regarding a few of the observations on whitelist approach

  1. whitelist0 - We should try to ignore the historically similar dictionary pairs from this approach. If SKD and VCP show the same word / YAT and WIL show the same word / PW and PWG show the same word, but no other dictionary shows the same word, they should not be put in whitelist0. They tend to repeat the same mistakes as their predecessors.

@gasyoun
Copy link
Member

gasyoun commented Sep 12, 2016

try to ignore the historically similar dictionary pairs from this approach yes, that was my main concern. Otherwise it's no real whitelisting, only a ghost list.

@funderburkjim
Copy link
Contributor Author

The 'paired dictionary' observation makes sense as a good refinement.

The graylist file is still a good candidate error list. If we excluded the paired words, graylist would be
INCREASED, not decreased.

There is a tool which may be used to examine paired-dictionary lists.

For example, to generate a list of all headwords which appear ONLY in WIL and YAT,

# change to the whitelist directory
python filterdict.py output/all/whitelist0.txt old/wilyat.txt wil yat
#The output is old/wilyat.txt.  I put the output in the 'old' subdirectory of whitelist, since
# files in that directory are excluded, due to the way .gitignore for hwnorm1 repository is set up.

There turn out to be 185 such words (out of 187,992 words in whitelist0).

@funderburkjim
Copy link
Contributor Author

Why in s.u. [see under = sehe unter] s. is bolded, u. is not?

Here's the reason:

  • In pw.txt, the entry is coded as:

    <H1>100{plIyA}1{plIyA/}¦ •f. •»s.u. #{plI/TA}. PW75184
    

    Note the two instances of the special symbol: •

  • The program which converts pw.txt to pw.xml interprets that special symbol as the beginning of
    grammatical information; and, further, it makes some guess as to the scope of this grammatical information. This results in the following coding of this record in pw.xml:

    <gram n="f">f.</gram> »<gram n="s">s.</gram>u. <s>plI/TA</s>.
    
  • Finally, the display program (web/webtc/disp.php) marks the text within the <gram> element as
    being in an html <span class='gram'> element, and css renders the gram class as bold.

So, that explains what is going on.

There are 1165 instances in pw.txt of •»s.u. , all of which are presumably rendered as just described.

@funderburkjim
Copy link
Contributor Author

Let's continue the discussion of markup •»s.u. in this issue under the PWK repository.

@gasyoun
Copy link
Member

gasyoun commented Sep 12, 2016

If we excluded the paired words, graylist would be INCREASED, not decreased.

Indeed, but that would make the logic usable. Now we exclude from greylist words, that are still fishy. A word that YAT took from WIL does not become less fishy. So pairs:

  • SKD, VCP
  • YAT, WIL
  • PW, PWG
  • MW72, MW

Should be counted as one for whitelisting needs or at least marked.

@gasyoun
Copy link
Member

gasyoun commented Sep 13, 2016

What I really lack is sample words for each case. Otherwise some seem equal or I do not understand at all what should go there. Maybe the statement rules would help?

187992 headwords coded as 0: In two or more dictionaries

aMSagaRa:IEG,PD
paryantIkfta:BHS,MW
paryavadApayitar:BHS,SCH

6357 headwords coded as 1: key1=X+am and X+a is found
1773 headwords coded as 2: SKD nouns shown in nominative singular
19238 headwords coded as 3a: prefix of known word
11095 headwords coded as 3b: suffix of known word
8823 headwords coded as 0a: special words (icf, foreign, etc.)
2589 headwords coded as 4a: probable f. nouns ending in 'A'

paryavadAtaSrutatA:MW
paryavasA:STC
SatrUccAwanakriyA:ACC

618 headwords coded as 4b: inflected form
74009 headwords coded as cpd1: simple compound of 2 parts, first ending in 'aiufeoAIOxs'
8749 headwords coded as cpd1a: simple compound of 2 parts, first ending at 'A,I'
698 headwords coded as cpd1b: simple compound of 2 parts, first ending at 't'
9712 headwords coded as cpd2: simple compound of 3 or more parts, each part ending in 'aiufeo'
4441 headwords coded as cpd2a: simple compound of 3 or more parts, each part ending in 'aiufeoAI'
17696 headwords coded as cpdsrs1: Simple compound with vowel sandhi at 'AIUoeEOvy'
3319 headwords coded as cpdsrs1a: non-Simple compound with vowel sandhi at 'AIUoeEOvy'
16 headwords coded as 5a: kar<->kf
Should we we make it not only kar<->kf, but ar<->f?

57 headwords coded as 5b: Ikf, IBU, Ikfta, IBUta

810 headwords coded as 3bcpd: Compound word + suffix
629 headwords coded as 3acpd: prefix + Compound word
1103 headwords coded as cpdsandhi1: Compound word with sandhi
819 headwords coded as 3a1: prefix of known compound
95 headwords coded as cpdsandhi2: Compound word with sandhi
2555 headwords coded as cpdsrs1b: non-Simple compound with vowel sandhi at 'AIUoeEOvy'
21818 headwords coded as gray: Not yet whitelisted

vitaritar   PW
vituzI  PW
viduzIbruvA PW
viDanI  PW
viDurI  PW
vinAyikI    PW
vinikze PW
vinigaqI    PW
vinDyAy PW
vipaYcay    PW
vipaTay PW
vipay   PW

I would love to tag the greylisted ones. In vipay I see a praefix, it's a verb. Where should it go? Without samples I do not understand the details of the above classification.

@funderburkjim
Copy link
Contributor Author

What I really lack is sample words for each case.

For each of the categories in that summary, there is a corresponding file of examples.

For instance '11095 headwords coded as 3b: suffix of known word' . The corresponding file is

output/all/whitelist3b.txt

@funderburkjim
Copy link
Contributor Author

Here are 4 files, generated using the 'tool' mentioned above, that contain the words that appear ONLY
in the particular pairs of dictionaries. This is done in response to above requests. I hope someone
learns something useful from these.

@gasyoun
Copy link
Member

gasyoun commented Sep 13, 2016

I guess @drdhaval2785 would agree, I would exclude someone learns something useful from these from whitelist, because they are almost as one.

@funderburkjim
Copy link
Contributor Author

Are there headwords misspelled in both wilson and yates ? That should be the focus of attention, it seems to me. Thus finding misspellings would be something useful.

Similarly for the other paired dictionaries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants