whitelist #5

funderburkjim · 2016-09-11T22:38:45Z

The whitelist directory contains work aimed at identifying headwords of the various Sanskrit dictionaries that may have spelling errors.

The underlying set of headwords is hwnorm1c.txt, which currently has 385,011 headwords.

The idea of the whitelist approach is to identify words which, on the basis of rules, are probably NOT misspelled. For instance, one such rule for a given spelling is that the word with that spelling appears as a headword in two or more dictionaries. All words satisfying a particular such rule are put into a whitelist file (whitelist0.txt for the rule just described). The latest batches of these whitelist files are in the output/all directory. There are currently 24 such whitelist files.

Then, the headwords whose spellings have as yet no rule to justify the correctness of their spelling are gathered into a graylist.txt file.

Currently, there are 21818 graylisted headwords.

According to the logic of this whitelist approach, the graylisted words are the most fertile ground for remaining headword spelling errors.

Of course, many of the graylisted words are surely spelled correctly. But, as of yet, we don't have any programmatic (or other) way to distinguish these as correctly spelled.

I hope others will examine these lists, especially the graylist, with an eye to:

Develop additional rules that would whitelist chunks of these
identifying by hand remaining errors, maybe by focusing on those graylisted words in a particular dictionary.

funderburkjim · 2016-09-11T22:42:08Z

The latest run of the whitelist program shows these statistics regarding the number of cases whitelisted
by the various rules:

$ sh redo.sh
Recreating auxiliary/special.txt
regenerating graylist.txt and all whitelistX.txt files
385011 records from ../hwnorm1c.txt
187992 headwords coded as 0: In two or more dictionaries
  6357 headwords coded as 1: key1=X+am and X+a is found
  1773 headwords coded as 2: SKD nouns shown in nominative singular
 19238 headwords coded as 3a: prefix of known word
 11095 headwords coded as 3b: suffix of known word
  8823 headwords coded as 0a: special words (icf, foreign, etc.)
  2589 headwords coded as 4a: probable f. nouns ending in 'A'
   618 headwords coded as 4b: inflected form
 74009 headwords coded as cpd1: simple compound of 2 parts, first ending in 'aiufeoAIOxs'
  8749 headwords coded as cpd1a: simple compound of 2 parts, first ending at 'A,I'
   698 headwords coded as cpd1b: simple compound of 2 parts, first ending at 't'
  9712 headwords coded as cpd2: simple compound of 3 or more parts, each part ending in 'aiufeo'
  4441 headwords coded as cpd2a: simple compound of 3 or more parts, each part ending in 'aiufeoAI'
 17696 headwords coded as cpdsrs1: Simple compound with vowel sandhi at 'AIUoeEOvy'
  3319 headwords coded as cpdsrs1a: non-Simple compound with vowel sandhi at 'AIUoeEOvy'
    16 headwords coded as 5a: kar<->kf
    57 headwords coded as 5b: Ikf, IBU, Ikfta, IBUta
   810 headwords coded as 3bcpd: Compound word + suffix
   629 headwords coded as 3acpd: prefix + Compound word
  1103 headwords coded as cpdsandhi1: Compound word with sandhi
   819 headwords coded as 3a1: prefix of known compound
    95 headwords coded as cpdsandhi2: Compound word with sandhi
  2555 headwords coded as cpdsrs1b: non-Simple compound with vowel sandhi at 'AIUoeEOvy'
 21818 headwords coded as gray: Not yet whitelisted

gasyoun · 2016-09-12T10:49:06Z

Wow!

If we exclude PD for now, it's only 5547 words. I hoped it would be more.
1326 ACC - only one who knows valid names of manuscripts can approve.

I have checked a few

paSU    PW
plIyA   PW
buDi    PW
BAgI    PW
mUrKI   PW

Why plIyA and not pliyA? SLP1 converter issue (because PalIkartavE same I issue, but ok in livI)?

Why in s.u. [see under = sehe unter] s. is bolded, u. is not? Makes no sense, let's check if it's same in different entries.

drdhaval2785 · 2016-09-12T11:19:05Z

My views regarding a few of the observations on whitelist approach

whitelist0 - We should try to ignore the historically similar dictionary pairs from this approach. If SKD and VCP show the same word / YAT and WIL show the same word / PW and PWG show the same word, but no other dictionary shows the same word, they should not be put in whitelist0. They tend to repeat the same mistakes as their predecessors.

gasyoun · 2016-09-12T16:58:10Z

try to ignore the historically similar dictionary pairs from this approach yes, that was my main concern. Otherwise it's no real whitelisting, only a ghost list.

funderburkjim · 2016-09-12T19:55:29Z

The 'paired dictionary' observation makes sense as a good refinement.

The graylist file is still a good candidate error list. If we excluded the paired words, graylist would be
INCREASED, not decreased.

There is a tool which may be used to examine paired-dictionary lists.

For example, to generate a list of all headwords which appear ONLY in WIL and YAT,

# change to the whitelist directory
python filterdict.py output/all/whitelist0.txt old/wilyat.txt wil yat
#The output is old/wilyat.txt.  I put the output in the 'old' subdirectory of whitelist, since
# files in that directory are excluded, due to the way .gitignore for hwnorm1 repository is set up.

There turn out to be 185 such words (out of 187,992 words in whitelist0).

funderburkjim · 2016-09-12T20:24:47Z

Why in s.u. [see under = sehe unter] s. is bolded, u. is not?

Here's the reason:

In pw.txt, the entry is coded as:
```
<H1>100{plIyA}1{plIyA/}¦ •f. •»s.u. #{plI/TA}. PW75184
```
Note the two instances of the special symbol: •
The program which converts pw.txt to pw.xml interprets that special symbol as the beginning of
grammatical information; and, further, it makes some guess as to the scope of this grammatical information. This results in the following coding of this record in pw.xml:
```
<gram n="f">f.</gram> »<gram n="s">s.</gram>u. <s>plI/TA</s>.
```
Finally, the display program (web/webtc/disp.php) marks the text within the <gram> element as
being in an html <span class='gram'> element, and css renders the gram class as bold.

So, that explains what is going on.

There are 1165 instances in pw.txt of •»s.u. , all of which are presumably rendered as just described.

funderburkjim · 2016-09-12T20:48:52Z

Let's continue the discussion of markup •»s.u. in this issue under the PWK repository.

gasyoun · 2016-09-12T22:59:40Z

If we excluded the paired words, graylist would be INCREASED, not decreased.

Indeed, but that would make the logic usable. Now we exclude from greylist words, that are still fishy. A word that YAT took from WIL does not become less fishy. So pairs:

SKD, VCP
YAT, WIL
PW, PWG
MW72, MW

Should be counted as one for whitelisting needs or at least marked.

gasyoun · 2016-09-13T18:16:29Z

What I really lack is sample words for each case. Otherwise some seem equal or I do not understand at all what should go there. Maybe the statement rules would help?

187992 headwords coded as 0: In two or more dictionaries

aMSagaRa:IEG,PD
paryantIkfta:BHS,MW
paryavadApayitar:BHS,SCH

6357 headwords coded as 1: key1=X+am and X+a is found
1773 headwords coded as 2: SKD nouns shown in nominative singular
19238 headwords coded as 3a: prefix of known word
11095 headwords coded as 3b: suffix of known word
8823 headwords coded as 0a: special words (icf, foreign, etc.)
2589 headwords coded as 4a: probable f. nouns ending in 'A'

paryavadAtaSrutatA:MW
paryavasA:STC
SatrUccAwanakriyA:ACC

618 headwords coded as 4b: inflected form
74009 headwords coded as cpd1: simple compound of 2 parts, first ending in 'aiufeoAIOxs'
8749 headwords coded as cpd1a: simple compound of 2 parts, first ending at 'A,I'
698 headwords coded as cpd1b: simple compound of 2 parts, first ending at 't'
9712 headwords coded as cpd2: simple compound of 3 or more parts, each part ending in 'aiufeo'
4441 headwords coded as cpd2a: simple compound of 3 or more parts, each part ending in 'aiufeoAI'
17696 headwords coded as cpdsrs1: Simple compound with vowel sandhi at 'AIUoeEOvy'
3319 headwords coded as cpdsrs1a: non-Simple compound with vowel sandhi at 'AIUoeEOvy'
16 headwords coded as 5a: kar<->kf
Should we we make it not only kar<->kf, but ar<->f?

57 headwords coded as 5b: Ikf, IBU, Ikfta, IBUta

810 headwords coded as 3bcpd: Compound word + suffix
629 headwords coded as 3acpd: prefix + Compound word
1103 headwords coded as cpdsandhi1: Compound word with sandhi
819 headwords coded as 3a1: prefix of known compound
95 headwords coded as cpdsandhi2: Compound word with sandhi
2555 headwords coded as cpdsrs1b: non-Simple compound with vowel sandhi at 'AIUoeEOvy'
21818 headwords coded as gray: Not yet whitelisted

vitaritar   PW
vituzI  PW
viduzIbruvA PW
viDanI  PW
viDurI  PW
vinAyikI    PW
vinikze PW
vinigaqI    PW
vinDyAy PW
vipaYcay    PW
vipaTay PW
vipay   PW

I would love to tag the greylisted ones. In vipay I see a praefix, it's a verb. Where should it go? Without samples I do not understand the details of the above classification.

funderburkjim · 2016-09-13T19:08:02Z

What I really lack is sample words for each case.

For each of the categories in that summary, there is a corresponding file of examples.

For instance '11095 headwords coded as 3b: suffix of known word' . The corresponding file is

output/all/whitelist3b.txt

funderburkjim · 2016-09-13T19:29:31Z

Here are 4 files, generated using the 'tool' mentioned above, that contain the words that appear ONLY
in the particular pairs of dictionaries. This is done in response to above requests. I hope someone
learns something useful from these.

185 lines from output/all/whitelist0.txt written to output/all/wil_yat.txt
310 lines from output/all/whitelist0.txt written to output/all/skd_vcp.txt
2752 lines from output/all/whitelist0.txt written to output/all/pw_pwg.txt
2417 lines from output/all/whitelist0.txt written to output/all/mw72_mw.txt

gasyoun · 2016-09-13T20:30:05Z

I guess @drdhaval2785 would agree, I would exclude someone learns something useful from these from whitelist, because they are almost as one.

funderburkjim · 2016-09-13T20:37:26Z

Are there headwords misspelled in both wilson and yates ? That should be the focus of attention, it seems to me. Thus finding misspellings would be something useful.

Similarly for the other paired dictionaries.

funderburkjim mentioned this issue Sep 12, 2016

•»s.u. markup question sanskrit-lexicon/PWK#66

Open

drdhaval2785 mentioned this issue Dec 20, 2020

todo list in 2021 (in descending order of importance) sanskrit-lexicon/COLOGNE#325

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

whitelist #5

whitelist #5

funderburkjim commented Sep 11, 2016

funderburkjim commented Sep 11, 2016 •

edited

Loading

gasyoun commented Sep 12, 2016 •

edited

Loading

drdhaval2785 commented Sep 12, 2016

gasyoun commented Sep 12, 2016

funderburkjim commented Sep 12, 2016

funderburkjim commented Sep 12, 2016

funderburkjim commented Sep 12, 2016

gasyoun commented Sep 12, 2016

gasyoun commented Sep 13, 2016

funderburkjim commented Sep 13, 2016

funderburkjim commented Sep 13, 2016

gasyoun commented Sep 13, 2016

funderburkjim commented Sep 13, 2016

whitelist #5

whitelist #5

Comments

funderburkjim commented Sep 11, 2016

funderburkjim commented Sep 11, 2016 • edited Loading

gasyoun commented Sep 12, 2016 • edited Loading

drdhaval2785 commented Sep 12, 2016

gasyoun commented Sep 12, 2016

funderburkjim commented Sep 12, 2016

funderburkjim commented Sep 12, 2016

funderburkjim commented Sep 12, 2016

gasyoun commented Sep 12, 2016

gasyoun commented Sep 13, 2016

funderburkjim commented Sep 13, 2016

funderburkjim commented Sep 13, 2016

gasyoun commented Sep 13, 2016

funderburkjim commented Sep 13, 2016

funderburkjim commented Sep 11, 2016 •

edited

Loading

gasyoun commented Sep 12, 2016 •

edited

Loading