-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
whitelist #5
Comments
The latest run of the whitelist program shows these statistics regarding the number of cases whitelisted
|
Wow! If we exclude I have checked a few
Why Why in s.u. [see under = sehe unter] s. is bolded, u. is not? Makes no sense, let's check if it's same in different entries. |
My views regarding a few of the observations on whitelist approach
|
|
The 'paired dictionary' observation makes sense as a good refinement. The graylist file is still a good candidate error list. If we excluded the paired words, graylist would be There is a tool which may be used to examine paired-dictionary lists. For example, to generate a list of all headwords which appear ONLY in WIL and YAT,
There turn out to be 185 such words (out of 187,992 words in whitelist0). |
Here's the reason:
So, that explains what is going on. There are 1165 instances in pw.txt of |
Let's continue the discussion of markup •»s.u. in this issue under the PWK repository. |
Indeed, but that would make the logic usable. Now we exclude from greylist words, that are still fishy. A word that
Should be counted as one for whitelisting needs or at least marked. |
What I really lack is sample words for each case. Otherwise some seem equal or I do not understand at all what should go there. Maybe the statement rules would help? 187992 headwords coded as 0: In two or more dictionaries
6357 headwords coded as 1: key1=X+am and X+a is found
618 headwords coded as 4b: inflected form
810 headwords coded as 3bcpd: Compound word + suffix
I would love to tag the greylisted ones. In |
For each of the categories in that summary, there is a corresponding file of examples. For instance '11095 headwords coded as 3b: suffix of known word' . The corresponding file is |
Here are 4 files, generated using the 'tool' mentioned above, that contain the words that appear ONLY
|
I guess @drdhaval2785 would agree, I would exclude |
Are there headwords misspelled in both wilson and yates ? That should be the focus of attention, it seems to me. Thus finding misspellings would be something useful. Similarly for the other paired dictionaries. |
The whitelist directory contains work aimed at identifying headwords of the various Sanskrit dictionaries that may have spelling errors.
The underlying set of headwords is hwnorm1c.txt, which currently has 385,011 headwords.
The idea of the whitelist approach is to identify words which, on the basis of rules, are probably NOT misspelled. For instance, one such rule for a given spelling is that the word with that spelling appears as a headword in two or more dictionaries. All words satisfying a particular such rule are put into a whitelist file (whitelist0.txt for the rule just described). The latest batches of these whitelist files are in the output/all directory. There are currently 24 such whitelist files.
Then, the headwords whose spellings have as yet no rule to justify the correctness of their spelling are gathered into a graylist.txt file.
Currently, there are 21818 graylisted headwords.
According to the logic of this whitelist approach, the graylisted words are the most fertile ground for remaining headword spelling errors.
Of course, many of the graylisted words are surely spelled correctly. But, as of yet, we don't have any programmatic (or other) way to distinguish these as correctly spelled.
I hope others will examine these lists, especially the graylist, with an eye to:
The text was updated successfully, but these errors were encountered: