Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalization steps #3

Open
drdhaval2785 opened this issue Nov 24, 2015 · 7 comments
Open

Normalization steps #3

drdhaval2785 opened this issue Nov 24, 2015 · 7 comments

Comments

@drdhaval2785
Copy link
Contributor

Right now the output is placed in normalization subdirectory.

Responsible code is function countlen() in hwnorm1.py.

Let me document the steps.

  1. hw1.txt - headwords of sanhw1.txt sorted alphabetically (python order. Not Sanskrit order).
  2. hw2.txt - hw1.txt after normalization of anusvAra ([NYRnm][consonant] -> M[consonant]. Also terminal 'M' converted to 'm')
  3. hw3.txt - hw2.txt after normalization of duplication ( r[consonant][consonant] -> r[consonant] conversion).
  4. hw4.txt - hw3.txt after normalization for 'ant' at end.
  5. hw5.txt - hw4.txt after normalization of terminal 'm' and 'H' ( [aA][mH]$ -> [aA]$ )

There are four difference files generated in the process.

  1. hw1minushw2.txt - hw1.txt entries not found in hw2.txt
  2. hw2minushw3.txt - hw2.txt entries not found in hw3.txt
  3. hw3minushw4.txt - hw3.txt entries not found in hw4.txt
  4. hw4minushw5.txt - hw4.txt entries not found in hw3.txt

I hope someone would cursorily examine the files and decide whether we are on right track or not.

@gasyoun
Copy link
Member

gasyoun commented Nov 24, 2015

hw1minushw2.txt - hw1.txt entries not found in hw2.txt - what do you expect it to give? Please illustrate, can't grasp. Brain too weak. I wonder how many hw4.txt words were added after you killed terminal 'm' and 'H', maybe a list with links (with terminal 'm' and 'H' added again for the links to work) would help check the original thesis?

@drdhaval2785
Copy link
Contributor Author

@gasyoun
These minus files are meant to be checked whether some non-deserving candidate is not removed in the process.
I don't see much issue in hw1minushw2 and hw2minushw3.
But 'M' and 'H' removal is a bit tricky.
Therefore that list (hw3minushw4.txt) needs to be checked properly.

@gasyoun
Copy link
Member

gasyoun commented Dec 1, 2015

But 'M' and 'H' removal is a bit tricky. - remember some tricky example?

@drdhaval2785
Copy link
Contributor Author

https://github.com/sanskrit-lexicon/hwnorm1/blob/master/normalization/examine4.txt

Full of tricky examples.

Right now keeping them in examine.txt file. Whatever found OK - goes to the next step.
Otherwise dies here.

@gasyoun
Copy link
Member

gasyoun commented Dec 1, 2015

6k lines is too much. Even a 100 list of feminine words wrongly tagged in MW I will work on for a month. Where do you state the algo? Why yogyatAjYAnasyaSabdaMpratikAraRatAvicAraH is fishy? Because of aH? Are these words included? Or should be not? Please add details.

@drdhaval2785
Copy link
Contributor Author

Where do you state the algo?

The words which are not found in the sanhw1.txt after removal of terminal 'm' and 'H'.

6k lines is too much.

Now it is 2456. Not much further improvement expected by computer algorithm. Manual only.
We are not in hurry.
At least we should not normalize words which don't deserve to be normalized.

@gasyoun
Copy link
Member

gasyoun commented Dec 2, 2015

So these are the possible additions to the European style dictionaries, understood. 2.5k is far better. Only a few years away. Have you found at least 1 wrong this way? I guess it's one of the hardest methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants