Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Progress in alternate headword extraction #7

Open
drdhaval2785 opened this issue Jan 9, 2016 · 6 comments
Open

Progress in alternate headword extraction #7

drdhaval2785 opened this issue Jan 9, 2016 · 6 comments

Comments

@drdhaval2785
Copy link
Contributor

Dear all,
https://github.com/sanskrit-lexicon/VCP/tree/master/alternateheadword is the repository where I have been playing with hw1.py of ejf - renamed as hw1_dhaval.py.

with the latest commit incorporating levenshtein logic has been added to the hw1_dhaval.py code.

Logic -

  1. Separate words having brackets (for extraction of alternate headwords) - bracketwords.txt
  2. Separate words having brackets in mid - midbracket.txt and end - endbracket.txt
  3. Analyse the words and decide whether the substitution is for previous or posterior part or indeterminate.
  4. Compare the suggested word from hw1.txt (sanhw1.txt headwords only file)
  5. If the suggested word is found in hw1.txt or in some of the known patterns (b,v exchange) (S-s exchange etc, then it is stored in validated.txt for future integration in headword list.
  6. If not, store in nonvalidated.txt for manual examination.

The methods to make replacements are marked with a code '1' through "6" which can help us locate the part of hw1_dhaval.py which made these suggestions at later stage.
If there are no suggested headwords / non decision regarding the position of the string to be substituted by the bracket string, I have put "404" code. (See nonvalidated.txt)

Plan ahead -

  1. validated.txt are sureshot headwords. Incorporate them. There can be a very miniscule error in this step, but worth taking risk.
  2. nonvalidated.txt need manual examination. After manual corrections, they can be validated and incorporated in headword list.
  3. Treatment of endbracket.txt is pending. It should be easier.
@drdhaval2785
Copy link
Contributor Author

latest stats

1209 entries with brackets in vcphw0.txt
875 entries - validated and stored in validated.txt
334 entries - not conclusively decided and stored in nonvalidated.txt

@gasyoun
Copy link
Member

gasyoun commented Jan 9, 2016

Well done, well done. I love when such practical work is done from India. There have been so much theory but so little efforts to present good old books in a form they deserve. @funderburkjim please see https://docs.google.com/document/d/1YYTM2hlDYKPzKv322Cfq0Oa92RohvMzvF5jf8eqIO7w/edit# with my VCP classification.

47046 headwords in a.txt source file. 1672 words have parentheses, 
but meaning is not always equal. They are used in several ways.
Number Legend:
0 - replacement of same number of letters before (default, most popular case)
1 - replacement letters more than initial
2 - replacement letters less than initial
3 - additional letters need to be inserted, not replacement (like all cases above)
5 - as per Gasuns, ok
6 - was 1 word, now 3-4 words

Similar to VCP work was done on SCD and AP. Human validation not finalized, but still.

@drdhaval2785
Copy link
Contributor Author

@gasyoun raised a question on skype
Can't levenshtein decide what to replace in 'zvA(sva)tta'.
My answer then was - NO.
Because zvA and sva have edit distance of 2. Same way sva and tta have edit distance of 2. So there was no winner.

Then I explored the possibility of finding some mathematical way of seeing the nearness of alphabets based on pratyAhAra sutras.
https://github.com/sanskrit-lexicon/VCP/blob/master/alternateheadword/alphabetdistance.py was generated by that effort.
It checks for nearness in alphabets in very crude manner.
Now the stats stand

891 validated entries
318 non-validated entries

i.e. 16 new parses added by this manner, and many suggestions made which were earlier left as it is.

@funderburkjim
Copy link
Contributor

Tailoring the notion of edit-distance to take into account knowledge of the Sanskrit alphabet sounds like an interesting idea.
I haven't looked at alphabetdistance program, but presume your idea would imply that 'sva' and 'tta' are actually much further apart than are, say 'wwa' and 'tta' or 'dda' and 'tta'.

Maybe the Sanskrit edit distance could somehow use the varga matrix of characters.

Another possibility might take into account the similarity of the glyphs of the Devanagari characters.

Another possibility might deal with the distance between consonant-vowel glyphs (e.g., the Devanagari representation of 'XA' and 'Xo' (X some consonant) might be considered less than the Sanskrit alphabetical distance between 'A' and 'o'.

Lot's of interesting possibilities to explore.

@gasyoun
Copy link
Member

gasyoun commented Jan 11, 2016

varga matrix of characters

That's a wishing well - but a deep one.

similarity of the glyphs of the Devanagari characters

भ म will do, @drdhaval2785 ?

Devanagari representation of 'XA' and 'Xo' (X some consonant) might be considered less than the Sanskrit alphabetical distance between 'A' and 'o'.

Not sure I get in on real life examples.

@drdhaval2785
Copy link
Contributor Author

The alphabet used is purely the shiva sUtras.

aAiIuUfFxeoEOhyvrlYmNRnJBGQDjbgqdKPCWTcwtkpSzs

So i guess it takes care of nearness based on varga matrices.

But glyph nearness remains to be explored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants