Normalization steps #3

drdhaval2785 · 2015-11-24T10:41:21Z

Right now the output is placed in normalization subdirectory.

Responsible code is function countlen() in hwnorm1.py.

Let me document the steps.

hw1.txt - headwords of sanhw1.txt sorted alphabetically (python order. Not Sanskrit order).
hw2.txt - hw1.txt after normalization of anusvAra ([NYRnm][consonant] -> M[consonant]. Also terminal 'M' converted to 'm')
hw3.txt - hw2.txt after normalization of duplication ( r[consonant][consonant] -> r[consonant] conversion).
hw4.txt - hw3.txt after normalization for 'ant' at end.
hw5.txt - hw4.txt after normalization of terminal 'm' and 'H' ( [aA][mH]$ -> [aA]$ )

There are four difference files generated in the process.

hw1minushw2.txt - hw1.txt entries not found in hw2.txt
hw2minushw3.txt - hw2.txt entries not found in hw3.txt
hw3minushw4.txt - hw3.txt entries not found in hw4.txt
hw4minushw5.txt - hw4.txt entries not found in hw3.txt

I hope someone would cursorily examine the files and decide whether we are on right track or not.

The text was updated successfully, but these errors were encountered:

gasyoun · 2015-11-24T18:15:25Z

hw1minushw2.txt - hw1.txt entries not found in hw2.txt - what do you expect it to give? Please illustrate, can't grasp. Brain too weak. I wonder how many hw4.txt words were added after you killed terminal 'm' and 'H', maybe a list with links (with terminal 'm' and 'H' added again for the links to work) would help check the original thesis?

drdhaval2785 · 2015-12-01T05:43:14Z

@gasyoun
These minus files are meant to be checked whether some non-deserving candidate is not removed in the process.
I don't see much issue in hw1minushw2 and hw2minushw3.
But 'M' and 'H' removal is a bit tricky.
Therefore that list (hw3minushw4.txt) needs to be checked properly.

gasyoun · 2015-12-01T07:42:29Z

But 'M' and 'H' removal is a bit tricky. - remember some tricky example?

drdhaval2785 · 2015-12-01T08:18:18Z

https://github.com/sanskrit-lexicon/hwnorm1/blob/master/normalization/examine4.txt

Full of tricky examples.

Right now keeping them in examine.txt file. Whatever found OK - goes to the next step.
Otherwise dies here.

gasyoun · 2015-12-01T18:18:36Z

6k lines is too much. Even a 100 list of feminine words wrongly tagged in MW I will work on for a month. Where do you state the algo? Why yogyatAjYAnasyaSabdaMpratikAraRatAvicAraH is fishy? Because of aH? Are these words included? Or should be not? Please add details.

drdhaval2785 · 2015-12-02T07:43:07Z

Where do you state the algo?

The words which are not found in the sanhw1.txt after removal of terminal 'm' and 'H'.

6k lines is too much.

Now it is 2456. Not much further improvement expected by computer algorithm. Manual only.
We are not in hurry.
At least we should not normalize words which don't deserve to be normalized.

gasyoun · 2015-12-02T10:35:04Z

So these are the possible additions to the European style dictionaries, understood. 2.5k is far better. Only a few years away. Have you found at least 1 wrong this way? I guess it's one of the hardest methods.

drdhaval2785 added the Documentation label Nov 24, 2015

gasyoun mentioned this issue Mar 17, 2017

fem. singular/plurals should be joined #9

Open

drdhaval2785 mentioned this issue Dec 20, 2020

todo list in 2021 (in descending order of importance) sanskrit-lexicon/COLOGNE#325

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalization steps #3

Normalization steps #3

drdhaval2785 commented Nov 24, 2015

gasyoun commented Nov 24, 2015

drdhaval2785 commented Dec 1, 2015

gasyoun commented Dec 1, 2015

drdhaval2785 commented Dec 1, 2015

gasyoun commented Dec 1, 2015

drdhaval2785 commented Dec 2, 2015

gasyoun commented Dec 2, 2015

Normalization steps #3

Normalization steps #3

Comments

drdhaval2785 commented Nov 24, 2015

gasyoun commented Nov 24, 2015

drdhaval2785 commented Dec 1, 2015

gasyoun commented Dec 1, 2015

drdhaval2785 commented Dec 1, 2015

gasyoun commented Dec 1, 2015

drdhaval2785 commented Dec 2, 2015

gasyoun commented Dec 2, 2015