Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vcp-skd1 comparison, part 2 #10

Open
funderburkjim opened this issue May 21, 2020 · 19 comments
Open

vcp-skd1 comparison, part 2 #10

funderburkjim opened this issue May 21, 2020 · 19 comments

Comments

@funderburkjim
Copy link
Contributor

This continues the root-matching exercise discussed in #9.

The programs and reports are in the vcp_skd1 directory. To see html files as html in your browser,
you will need to download the raw files and open the downloaded files in your browser.

@funderburkjim
Copy link
Contributor Author

Why equivalence classes?

In the previous vcp-skd work, discusses in #9, we ran into a limitation.
The example (ref)
VCP उज्झ 8722 = SKD उद्झ 4713 SKD उज्झ 4431
could not be handled properly. Here SKD has two different headword spellings that should both
correpsond to the single VCP headword spelling. This left SKD उद्झ as unmatched.

The equivalence class notion aims to handle this and other similar, but not yet identified, cases; and thereby paint a truer picture of the correspondence between verbs in VCP and verbs in SKD.

basic idea

The notion of 'equivalence classes' is a useful concept in many areas of mathematics. For example, integers can be constructed as equivalence classes of pairs of natural numbers (see).

In our situation, we start with the set of entries from the Cologne digitization of a particular dictionary. Each entry has a specific Cologne id, which distinguishes that entry from all other entries in the particular dictionary. And each entry also has a particular headword spelling.

Next, in our study of verbs, we use some method to determine which of the entries of the dictionary corresponds to a verb. This leads to a list of verb entries.

entry equivalence by headword

We have silently assumed that two verb entries from a particular dictionary are equivalent if they
have the same headword spelling. Obviously the author of a particular dictionary had some reason for providing, for some headwords, more than one entry. But these reasons are not systematic and are difficult to infer, even when the two entries are of the same grammatical type (e.g. when both entries are verbs). Thus we have made the reasonable assumption that all (verb) entries with the same headword should be considered equivalent.

In applying this to VCP, we get the equivalence classes of vcp_ecs.txt (and vcp_ecs_deva.txt).

Consider the vcp equivalence classes (vcp_ecs.txt). An entry is a pair (headword,cologne id).
For headword 'aMSa' there is only one verb entry with this spelling, with cologne id=4: aMSa,4.
But note 'aca' appears as the headword for 3 entries: aca,604;aca,605;aca,606.
By searching for semicolon character, we see that 592 of the VCP equivalence classes have
multiple entries.

@funderburkjim
Copy link
Contributor Author

funderburkjim commented May 21, 2020

equivalence classes with different headwords

We are now in a position to modify the equivalence classes for a given dictionary. We do this by having a particular file. In the case of SKD, this file is skd_ecs_manual.txt.
Currently this file has just 2 entries:

udJa,4713;ujJa,4431
staBa,40439;stamBa,40448

From skd_verb_filter.txt, we know that there is just one entry
with headword 'udJa' (search term k1=udJa,) and just one entry with headword 'ujJa'.
Now, @Shalu411 determined that we should think of these two entries in SKD as the same.
Thus, we merge the entries for udJa and ujJa to get a new equivalence class udJa,4713;ujJa,4431.

Similarly, I determined that we should think of the entries for 'staBa' and 'stamBa' as equivalent.
Hence the new equivalence class for SKD of staBa,40439;stamBa,40448.

There is also a file for VCP: vcp_ecs_manual.txt. Currently this file is empty, which implies that currently we consider all the distinct verb spellings in VCP to be non-equivalent.

As @Shalu411 continues studying the non-matching VCP and SKD verbs, I anticipate that we will add
several more equivalences for SKD and perhaps a few for VCP.

@funderburkjim
Copy link
Contributor Author

funderburkjim commented May 21, 2020

Matching VCP and SKD equivalences manually

The main focus of this research is to match VCP and SKD verbs.
By considering equivalence classes, this research task resolves to matching VCP equivalence classes to SKD equivalence classes.

In my file-naming conventions, I use the term 'map' instead of 'match'. This is because I think
of matching two sets as constructing a functional map between the two sets.

The mapping between vcp equivalence classes and skd equivalence classes is presented in two
report forms. Each form shows all the classes from both dictionaries. The classes are ordered
alphabetically.

short form of matching report

This report form is vcp_skd_ec_map.txt (or vcp_skd_ec_map_deva.txt).

Each line of this report shows a vcp equivalence class and an skd equivalence class,
and asserts that these two classes match.
For example: vcp=ujJa,8722 skd=udJa,4713;ujJa,4431 (*).

When there is no match for a vcp equivalence class, the skd equivalence class shows as '?'.
For example: vcp=uras,10045 skd=? There are 148 of these.

When there is no match for an skd equivalence class, the vcp equivalence class shows as '?'.
For example: vcp=? skd=atwaNa,706. There are 69 of these.

One further annotation is (*). This means that, in the matching, there is some difference in
the headword spelling between VCP and SKD. For example, vcp=amBa,3980 skd=aBa,1638 (*)
There are 72 of these annotations currently.
Note that the 'ujJa' example above also shows this (*) annotation; this is because the skd class also has a headword spelling udJa differing from the vcp headword spelling (ujJa).

@funderburkjim
Copy link
Contributor Author

funderburkjim commented May 21, 2020

long form of matching report.

This is report vcp_skd_ec_verb2.html and the Devanagari version
vcp_skd_ec_verb2_deva.html
This report is in the form of an html document, so you'll need to download the raw form and then open the download html file in the browser.

The long form report contains all the information of the short report.

image

additional information of long report

The long report has some detail from the underlying dictionary entries.

image

Note the ... etc. etc. etc. ... . This means that there is more information in the dictionary for this entry.

@funderburkjim
Copy link
Contributor Author

funderburkjim commented May 21, 2020

Mapping principle

There are currently two methods of matching -- a 'manual' method and a 'general' method.

The 'manual' method uses a file of headword spelling correspondences: vcp_skd_map.txt.

For example garba:garbba means that the equivalence class in vcp containing headword spelled 'garba' is asserted to match the equivalence class in skd containing headword spelled 'garbba'.

These correspondences were developed mostly by me; I think @Shalu411 found some of them.

The 'general' method uses the rule:
Given an equivalence class ec1 for vcp and an equivalence class ec2 for skd, if there is a
headword spelling 'X' in both ec1 and ec2, then ec1 is aserted to match ec2.
This rule is harder to verbalize than it is to use.
Example: vcp=aBra,3722 skd=aBra,1803 these two equivalence classes are the same because
the verb speling 'aBra' is common to both.

@funderburkjim
Copy link
Contributor Author

funderburkjim commented May 21, 2020

Next steps

I think the next step is to continue the comparison of non-matches, using the two 'ec' mapping reports. This will likely turn up more examples like 'ujJa'. For example, I think 'drA'/'drE' is such an example.

There are also some other kinds of cases which probably should match. It might be
that the anusvara in skd is a digitization error.

vcp=hiqa,48054 skd=?
vcp=? skd=hiqaM,41771

@Shalu411 : the ball is now in your court! Hope I've given you enough material to proceed.

@Shalu411
Copy link
Collaborator

Shalu411 commented Jun 20, 2020

Hariom Jim
It's 29 days that ball has been in my court. I had shifted my house and many things were there to attend.. So I couldn't attend to this dhatu issue.
Hope these are the files that I should be working upon now on-

This is report vcp_skd_ec_verb2.html and the Devanagari version
vcp_skd_ec_verb2_deva.html
This report is in the form of an html document, so you'll need to download the raw form and then open the download html file in the browser.

@gasyoun
Copy link
Member

gasyoun commented Jun 21, 2020

Hope these are the files that I should be working upon now on-

So do I, our Bangalore Sanskrit scholar.

@funderburkjim
Copy link
Contributor Author

@Shalu411 The main report is referred to as the 'long form of matching report', mentioned in
the comment above.

The report file is named 'vcp_skd_ec_verb2_deva.html'. To get this file:

  • First download the file by:
  • Then open the downloaded file with your browser.

Your main task is to find how to resolve the '=?' cases of this report. For example, with
'vcp=अंह, skd=?'

  • Is there some verb entry in skd that we should match to the verb अंह of vcp ? If so, how is
    that skd verb spelled?
  • Or is अंह a verb unique to vcp ?

@Shalu411
Copy link
Collaborator

Shalu411 commented Dec 17, 2020

Namaste
Have started again. I heard somewhere that it does not matter how many times you fall down, ultimately what matters is-whether you could manage to get up or not! In the similar lines, whether I restart or not matters.
So here is the first interesting case-
Issue-1 - vcp=अट्ट, skd=अट्ट
skd-vcp-1-aTTa

There are two dhatus in vcp and only one in skd in this list. But there are actually two in skd too. But it got merged as a part of another headword. (Showed in png file)
skd-1-aTTa
:)

@Shalu411
Copy link
Collaborator

Issue 2- vcp=अन्चु, skd=?

vcp k1=अन्चु, L=2185 = skd k1=अन्च, L=1215
skd-vcp-2-aYcu

Compare- VCP अन्चु¦ गतौ अचिवत् ८२ पृ० दृश्यम् VS SKD अन्च¦ उ पूजने । गमने । म्लिष्टोक्तौ ।
This seems only to be style of presentation- See the ु in VCP and उ in SKD.
skd-vcp-2-aYcu2

@Shalu411
Copy link
Collaborator

Off-line issue of Typos-
k1=अन्च, L=2184 अन्च¦ व्य क्तौ चु० The two letters is actually one word- व्यक्तौ . It is a typo- to have space in between. What to do with these cases as of now?
I see this in the html doc. list vcp_skd_ec_verb2_deva.html . Some more are there

@Shalu411
Copy link
Collaborator

Namaste
The comparison is on- First set of VCP with SKD is going on.
The picks so far-
VCP अन्दोल 2361 = SKD आन्दोल 3499
VCP अन्ध 2362 = SKD अन्धं 1295
VCP अट्ट, L=792 = SKD (merged with previous)अट्ट क तौच्छ्ये । अनादरे ।
VCP अन्चु 2185 = SKD अन्च 1215
VCP इङ् 8100 = SKD इ 4017
VCP उर्द्द 10084 = SKD उर्द 5161
VCP उलड 10105 = SKD ओलड 5661
VCP ऋन्फ 10512 = SKD ऋम्फ 5409
VCP ऋश 10520 = SKD ऋश 5410
VCP कद्ड 11690 = SKD कद्ड 6169
VCP कन 11700 = SKD कन 6176

Others are not found. I am not noting them separately when match is not found.
--Hariom

@gasyoun
Copy link
Member

gasyoun commented Jan 19, 2021

noting them separately when match is not found.

Right, that's the way to do.

@funderburkjim
Copy link
Contributor Author

@Shalu411

Re अटाट्या : Do you suggest that SKD has a print error at अट्ट क तौच्छ्ये -- the error being that this
should start a new headword?

Also, how to translate क तौच्छ्ये ?

@funderburkjim
Copy link
Contributor Author

VCP अन्ध 2362 = SKD अन्धं 1295

Disagree. SKD अन्धं 1295 is a nominal while VCP अन्ध 2362 is a verb.

VCP अन्ध 2363 and 2388 are nominals.

Comparing texts, I think VCP अन्ध 2363 corresponds to SKD अन्धं 1295

@funderburkjim
Copy link
Contributor Author

How is the list above derived ?

What is your method?

what things are you looking at (what files and or displays)?

What are the criteria for putting something in the list?

Are all the items supposed to be verbs?

Why is VCP ऋश 10520 = SKD ऋश 5410 in the list (even though the spelling is same in VCP and SKD)?

@Shalu411
Copy link
Collaborator

Shalu411 commented Jan 22, 2021

Namaste
Disagree. SKD अन्धं 1295 is a nominal while VCP अन्ध 2362 is a verb.
Am extremely sorry ! It was wrongly put there . It is actually meant to express the third case here. But it's wrong to put so!
Do you suggest that SKD has a print error at अट्ट क तौच्छ्ये -- the error being that this should start a new headword?
Yes. It is a new headword.. But so is it given in the book printed also. (Image attached)
aTTa-SKD

Also, how to translate क तौच्छ्ये ?
It is same as the other dhatu entries. We need not translate it- It is the internal dhatu detail given by the author. We don't take it in headword-dhatu

How is the list above derived ?
From cross-checking each entry back in the dictionary.. both digital form in website and the printed book scan.
What is your method?
I take case by case- one at a time.
Type the word in the search box in advanced mode--> Check around the words up and down--> open the printed book page through adjacent words and double check if the dhatu is around anywhere --> check for the dhatu meaning word or first-form (the ति form) if not found then confirm it as a "no-match".

what things are you looking at (what files and or displays)?
Now I am checking SKD against VCP- So I use SKD

  1. Mostly advanced display with Devanagari or SLP1
  2. Then I also refer to the print-scan by the link provided in the adjacent words (because many words contain same printed page link)

What are the criteria for putting something in the list?
After the above steps- Sometimes rarely we find a match in the surprising circumstances (as explained in the five criteria. Then after double confirming by comparison with the other details as the dhatu-meaning-word and the first-form- at last declare it as a match.
Are all the items supposed to be verbs?
Where? in my list? Or the list which is provided to me? Both- Yes. 101%. Except the silly अन्धं
Why is VCP ऋश 10520 = SKD ऋश 5410 in the list (even though the spelling is same in VCP and SKD)?
I am not sure why it got missed and is given as a no-match even when there is it in the digitized version and the printed book- both. May be it flew away unnoticed?
-शुभमस्तु
Keep smiling. :)

@gasyoun
Copy link
Member

gasyoun commented Jan 22, 2021

I am not sure why it got missed and is given as a no-match even when there is it in the digitized version and the printed book- both.

Hope @funderburkjim is happy with the answers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants