-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalization steps #3
Comments
|
@gasyoun |
But 'M' and 'H' removal is a bit tricky. - remember some tricky example? |
https://github.com/sanskrit-lexicon/hwnorm1/blob/master/normalization/examine4.txt Full of tricky examples. Right now keeping them in examine.txt file. Whatever found OK - goes to the next step. |
6k lines is too much. Even a 100 list of feminine words wrongly tagged in MW I will work on for a month. Where do you state the algo? Why |
The words which are not found in the sanhw1.txt after removal of terminal 'm' and 'H'.
Now it is 2456. Not much further improvement expected by computer algorithm. Manual only. |
So these are the possible additions to the European style dictionaries, understood. 2.5k is far better. Only a few years away. Have you found at least 1 wrong this way? I guess it's one of the hardest methods. |
Right now the output is placed in normalization subdirectory.
Responsible code is function
countlen()
inhwnorm1.py
.Let me document the steps.
There are four difference files generated in the process.
I hope someone would cursorily examine the files and decide whether we are on right track or not.
The text was updated successfully, but these errors were encountered: