Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test vidyut.prakriya programmatically against the Siddhanta Kaumudi. #157

Open
akprasad opened this issue Jan 10, 2025 · 9 comments
Open
Labels
good first issue Good for newcomers vyakarana Requires deeper knowledge of Sanskrit grammar

Comments

@akprasad
Copy link
Contributor

Here's an important project I'd love help with: find words that vidyut.kosha does not understand so that we can add them to our test cases and improve our output. This would help immensely. It's a very high impact project that only needs a bit of time and some basic knowledge of Python and vyAkaraNam.

Here's the basic idea:

  • You can get the full text of the SK here: https://github.com/ashtadhyayi-com/data/blob/master/sutraani/kaumudi.txt
  • Use vidyut.lipi or your favorite transliterator to convert this text to SLP1.
  • Split the text into separate padas. Items like तत्रादौ (SLP1 tatrAdO) are hard to separate, but that's fine; for our purposes, we can treat them like a single pada, as we can filter them out later.
  • Use vidyut.kosha to see if each pada is in the kosha. If it isn't, print out the pada and the sutra number it comes from.
  • Output should be a CSV with two columns: the SK sutra number, and the "pada" that could not be found. You can use the bulit-in csv library to handle this.

Once we have this CSV, we can turn it into a spreadsheet and share it with volunteers to mark which are real errors and which are just noise.

Tips:

  • Focus on chapters 8 - 13 (inclusive) for subantas and 43 - 58 (inclusive) for tinantas.
  • Example padas are usually offset by dandas, e.g. । लेख्यति ।. So if multiple words are between dandas, e.g. । द्विधा हि कण्ड्वादयः धातवः प्रातिपदिकानि च ।, these are probably not example padas and are likely just noise. If you are able to, include these padas anyway, maybe with a third CSV column is_probably_noisy. I want to make sure we catch everything.

Please share your early results either here or on our Discord server on the #vidyut channel.

@avinashvarna
Copy link

Is this along the lines of what you have in mind?

sutra_num,type (inferred from sk_chapter),word,is_probably_noisy
11004,tinanta,DAtvaMSalopanimitte ArDaDAtuke pare iko guRavfdDI na staH iti neha nizeDaH,True
11004,tinanta,tibAdInAmanArDaDAtukatvAt,False
11004,tinanta,totorti,False
11004,tinanta,hali ca <{SK354}> iti dIrGaH,True
11004,tinanta,totUrtaH,False
11004,tinanta,totUrvati,False
11004,tinanta,toTorti,False
11004,tinanta,dodorti,False
11004,tinanta,doDorti,False
11004,tinanta,murCA,False
11004,tinanta,momUrcCIti,False
11004,tinanta,momUrtaH,False
11004,tinanta,momUrCatItyAdi,False
11004,tinanta,ArDaDAtuka iti vizayasaptamI,True
11004,tinanta,tena yaNi vivakzite ajervI,True
11004,tinanta,asya yaNluNnAsti,True
11004,tinanta,lukApahAre vizayatvABAvena vIBAvasyApravftteH,True

@akprasad
Copy link
Contributor Author

Yes, this is almost exactly it!

If you have time, these small tweaks would make the data basically perfect:

  • the sutra_num is useful, but we care more about the SK number so that we can cross-reference with the SK text.
  • for items like "ArDaDAtuka iti vizayasaptamI", split these into separate words, one per row.
  • also include the words that are present. We can filter these out later, but they give me a basic idea of how we're doing.

Once we have this data, we can turn it into a Google Sheet or similar and ask volunteers to go through these systematically and flag which ones are real errors. +cc @neeleshb .

@avinashvarna
Copy link

avinashvarna commented Jan 19, 2025

Something like this? https://docs.google.com/spreadsheets/d/1Sa_TI5-C37gRuepdn5Iwt0PTQTRrMYUp7rWiXIwQ2Sg/edit?usp=sharing

FYI - the SK data that you linked to in the issue is indexed by sutra_num, so I thought it might be easiest to keep both sk_num and sutra_num.

@akprasad
Copy link
Contributor Author

Ah, how beautiful! Yes, this is basically it. I thought the data might be too large for sheets to handle in one sheet, but it's loading smoothly on my end.

Misc notes:

  • for words ending in a visarga, Kosha stores them with their original s/r instead (thus arjunas and svar, not arjunaH and svaH). This is so that users can check sandhi rules, but the downside is this weird rough edge. So, fixing this might affect the results.
  • I'm also seeing some rows with avagrahas ('), which the kosha will never contain.

Otherwise this is already immensely useful. I'm surprised words like corayati are missing, but that does seem to be the case for whatever reason, so I'm off to debug.

@avinashvarna
Copy link

  1. I could update the script to try s/r forms for words that end in H, and add them as extra columns. I wouldn't want to just remove them, as it could mean that some missing words wouldn't be identified.

  2. Can remove words with avagrahas and non-SLP1 chars. Haven't done much cleaning to be honest.

@akprasad
Copy link
Contributor Author

(1) I suggest something like: if arjunaH is the SK word, pass if either arjunas or arjunar is in the kosha. The vast majority of these end in -s so I don't think an extra column is necessary.

(2) of course, that's only natural when getting something up and running.

🙏

@akprasad
Copy link
Contributor Author

I found the issue with corayati, and fixing the same bug also fixes totorti, dodorti, doDorti, etc. So with a bit of work, you uncovered a major bug!

I'll continue going through this list. Would love your help rerunning this for a future iteration. Can you attach or share your script here?

@avinashvarna
Copy link

The idea was yours, so credit as well.

Here is the script - https://gist.github.com/avinashvarna/d81f0304f3105206df4691f215da85c2
I updated it to only look at [a-zA-z']+ and filter out some special characters, as well as handle words ending in visarga.

I can try to rerun it if I have time, but if not, feel free to use the script linked above.

@avinashvarna
Copy link

The spreadsheet has been updated. I've moved the previous sheet to .. (old), instead of deleting, in case you want to compare. If you agree that it's not necessary, feel free to delete it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers vyakarana Requires deeper knowledge of Sanskrit grammar
Projects
None yet
Development

No branches or pull requests

2 participants