Progress in alternate headword extraction #7

drdhaval2785 · 2016-01-09T16:29:29Z

Dear all,
https://github.com/sanskrit-lexicon/VCP/tree/master/alternateheadword is the repository where I have been playing with hw1.py of ejf - renamed as hw1_dhaval.py.

with the latest commit incorporating levenshtein logic has been added to the hw1_dhaval.py code.

Logic -

Separate words having brackets (for extraction of alternate headwords) - bracketwords.txt
Separate words having brackets in mid - midbracket.txt and end - endbracket.txt
Analyse the words and decide whether the substitution is for previous or posterior part or indeterminate.
Compare the suggested word from hw1.txt (sanhw1.txt headwords only file)
If the suggested word is found in hw1.txt or in some of the known patterns (b,v exchange) (S-s exchange etc, then it is stored in validated.txt for future integration in headword list.
If not, store in nonvalidated.txt for manual examination.

The methods to make replacements are marked with a code '1' through "6" which can help us locate the part of hw1_dhaval.py which made these suggestions at later stage.
If there are no suggested headwords / non decision regarding the position of the string to be substituted by the bracket string, I have put "404" code. (See nonvalidated.txt)

Plan ahead -

validated.txt are sureshot headwords. Incorporate them. There can be a very miniscule error in this step, but worth taking risk.
nonvalidated.txt need manual examination. After manual corrections, they can be validated and incorporated in headword list.
Treatment of endbracket.txt is pending. It should be easier.

drdhaval2785 · 2016-01-09T16:30:54Z

latest stats

1209 entries with brackets in vcphw0.txt
875 entries - validated and stored in validated.txt
334 entries - not conclusively decided and stored in nonvalidated.txt

gasyoun · 2016-01-09T17:01:59Z

Well done, well done. I love when such practical work is done from India. There have been so much theory but so little efforts to present good old books in a form they deserve. @funderburkjim please see https://docs.google.com/document/d/1YYTM2hlDYKPzKv322Cfq0Oa92RohvMzvF5jf8eqIO7w/edit# with my VCP classification.

47046 headwords in a.txt source file. 1672 words have parentheses, 
but meaning is not always equal. They are used in several ways.
Number Legend:
0 - replacement of same number of letters before (default, most popular case)
1 - replacement letters more than initial
2 - replacement letters less than initial
3 - additional letters need to be inserted, not replacement (like all cases above)
5 - as per Gasuns, ok
6 - was 1 word, now 3-4 words

Similar to VCP work was done on SCD and AP. Human validation not finalized, but still.

drdhaval2785 · 2016-01-11T05:23:43Z

@gasyoun raised a question on skype
Can't levenshtein decide what to replace in 'zvA(sva)tta'.
My answer then was - NO.
Because zvA and sva have edit distance of 2. Same way sva and tta have edit distance of 2. So there was no winner.

Then I explored the possibility of finding some mathematical way of seeing the nearness of alphabets based on pratyAhAra sutras.
https://github.com/sanskrit-lexicon/VCP/blob/master/alternateheadword/alphabetdistance.py was generated by that effort.
It checks for nearness in alphabets in very crude manner.
Now the stats stand

891 validated entries
318 non-validated entries

i.e. 16 new parses added by this manner, and many suggestions made which were earlier left as it is.

funderburkjim · 2016-01-11T20:50:18Z

Tailoring the notion of edit-distance to take into account knowledge of the Sanskrit alphabet sounds like an interesting idea.
I haven't looked at alphabetdistance program, but presume your idea would imply that 'sva' and 'tta' are actually much further apart than are, say 'wwa' and 'tta' or 'dda' and 'tta'.

Maybe the Sanskrit edit distance could somehow use the varga matrix of characters.

Another possibility might take into account the similarity of the glyphs of the Devanagari characters.

Another possibility might deal with the distance between consonant-vowel glyphs (e.g., the Devanagari representation of 'XA' and 'Xo' (X some consonant) might be considered less than the Sanskrit alphabetical distance between 'A' and 'o'.

Lot's of interesting possibilities to explore.

gasyoun · 2016-01-11T22:06:24Z

varga matrix of characters

That's a wishing well - but a deep one.

similarity of the glyphs of the Devanagari characters

भ म will do, @drdhaval2785 ?

Devanagari representation of 'XA' and 'Xo' (X some consonant) might be considered less than the Sanskrit alphabetical distance between 'A' and 'o'.

Not sure I get in on real life examples.

drdhaval2785 · 2016-01-12T02:30:54Z

The alphabet used is purely the shiva sUtras.

aAiIuUfFxeoEOhyvrlYmNRnJBGQDjbgqdKPCWTcwtkpSzs

So i guess it takes care of nearness based on varga matrices.

But glyph nearness remains to be explored.

gasyoun mentioned this issue Jan 11, 2016

2-grams vs MW72, part 1 sanskrit-lexicon/CORRECTIONS#241

Closed

drdhaval2785 added the enhancement label Dec 13, 2020

drdhaval2785 mentioned this issue Dec 20, 2020

todo list in 2021 (in descending order of importance) sanskrit-lexicon/COLOGNE#325

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Progress in alternate headword extraction #7

Progress in alternate headword extraction #7

drdhaval2785 commented Jan 9, 2016

drdhaval2785 commented Jan 9, 2016

gasyoun commented Jan 9, 2016

drdhaval2785 commented Jan 11, 2016

funderburkjim commented Jan 11, 2016

gasyoun commented Jan 11, 2016

drdhaval2785 commented Jan 12, 2016

Progress in alternate headword extraction #7

Progress in alternate headword extraction #7

Comments

drdhaval2785 commented Jan 9, 2016

drdhaval2785 commented Jan 9, 2016

gasyoun commented Jan 9, 2016

drdhaval2785 commented Jan 11, 2016

funderburkjim commented Jan 11, 2016

gasyoun commented Jan 11, 2016

drdhaval2785 commented Jan 12, 2016