Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pUrbb vs. pUrvv #20

Open
funderburkjim opened this issue Mar 18, 2021 · 29 comments
Open

pUrbb vs. pUrvv #20

funderburkjim opened this issue Mar 18, 2021 · 29 comments
Labels

Comments

@funderburkjim
Copy link
Contributor

This is one of those 'b/v' problems that @drdhaval2785 loves.

In vcp2, there are 1303 matches for `pUrvv' and 1474 matches for 'pUrbb'.

In vac2, there are 2051 matches for 'pUrvv' and 236 matches for 'pUrbb'.

I suggest we change all the 'pUrbb' to 'pUrvv' in both vcp and vac.

This would remove a nice chunk of needless differences.

What do others think?

@funderburkjim
Copy link
Contributor Author

TE lines beginning with anusvara

There are 852 lines of vac2 that begin with 'M', the slp1 version of anusvAra.
None for vcp2.

Suggest we remove those initial 'M' in vac2.

@funderburkjim
Copy link
Contributor Author

funderburkjim commented Mar 18, 2021

Another interesting stat,
There are about 100,000 lines whose adjusted text differs by only 1 edit in the two versions.
First current Example:

000007:1: vcp: <>patnI NIp I (lakzmIH) .
000007:1:  te: patnI +NIp+I (lakzmI). </vkr><page>vp1_039.pdf</page><column>1</column><br/>

image

@gasyoun
Copy link
Member

gasyoun commented Mar 18, 2021

I suggest we change all the 'pUrbb' to 'pUrvv' in both vcp and vac.

This will get us in trouble with @drdhaval2785

There are about 100,000 lines whose adjusted text differs by only 1 edit in the two versions.

So dirt takes as visarga, now that is a huge number. Maybe forget about the idea of correcting it?

@drdhaval2785
Copy link
Contributor

  1. I agree on changing all pUrbb to pUrvv.
  2. Can you provide a sample where there is M initially, with preceding line for reference?
  3. In the example lakzmIH, visarga is mandatory according to grammar. See SKD, which gives headwords in their inflected forms. You will not find lakzmI, but only lakzmIH. So these single difference lines are actual differences. Let them be examined thoroughly. We can not brush them under the carpet with regex.

@funderburkjim
Copy link
Contributor Author

  1. a sample with M initially ...

Here's the whole list of 852: filter01.txt

@funderburkjim
Copy link
Contributor Author

Maybe forget about the idea of correcting it?

The point about that 100,000 is that maybe there is a way to 'automate' some significant
fraction of those differences (where two texts differ in only 1 position).

The lakzmI/lakzmIH example may be idiosyncratic, but perhaps some of the differences
are systematic in the sense that some rule or rules could be used to identify that one of
the spellings is right and the other one is wrong.

@funderburkjim
Copy link
Contributor Author

candrabindu

In the slp1 coding of vac.txt, candrabindu is represented by the character '~' ; I believe this
is the usual slp1 convention.

However, ~ is not used in vac2 (Tirupati); instead, the candrabindu is represented by 'z'; this
is clearly different from the SLP1 convention that 'z' represents cerebral sibilant; and
vac2 does use 'z' also for the cerebral sibilant.

Thus we should correct vac2 in such cases (changing such 'z' to '~').

There are 842 matches for ~ in vcp.txt.

@drdhaval2785
Copy link
Contributor

a sample with M initially .

A bird's eye view shows that VAC is correct in majority of places. cInA-MSuka is correct. cInA-Suka is wrong ib VCP.

So, we can not mechanically change VAC. On the contrary, VCP would require addition of those missing anusvAras.

@drdhaval2785
Copy link
Contributor

candrabindu

I agree. There is no possibility of ष being confused with candrabindu by any typist. So, we can mechanically convert z to ~ where vcp.txt has ~.

@drdhaval2785
Copy link
Contributor

I love the way Jim keeps on identifying low hanging fruits, to reduce labour.

@drdhaval2785
Copy link
Contributor

NYRnm v/s M

I am not sure about the conventions used by VAC and VCP. But I saw some entries in meld, which were differing in these letters only. E.g. saMKyA and saNKyA.

We can derive some stats to check the tendency of the dictionary, and correct the remaining entries to match them.

@drdhaval2785
Copy link
Contributor

drdhaval2785 commented Mar 19, 2021

duplicated / deduplicated

Check for stats in VCP of rxx and rx. e.g. karmma and karma . We can align both vac.txt and vcp.txt to the more prevalent convention. That would reduce unnecessary meld diffs.

The relevant portion from paper normalization.pdf is as below.

Convention 2 - Duplication of consonants after ’r’.

Option 2.1
Duplication is done in all cases e.g. पूर्व्व.
Dictionaries: SKD, WIL

Option 2.2
Duplication is not done e.g. पूर्व .
Dictionaries: ACC, AP, AP90, BEN, BHS, BOP, BUR, CAE, CCS, GRA, GST, IEG, INM, KRM, MCI, MD, MW, MW72, PD, PE, PGN, PUI, PW, PWG, SCH, SNP, STC, VCP, VEI

Note– (1) SHS and YAT are inconsistent in this convention. See निर्विघ्न / निर्व्विकल्प in SHS and  दुर्वच / दुर्व्वचस् in YAT. (2) VCP highly leans towards option 2.2, but there are a few inconsistent entries as well e.g. पर्वत and अग्निपर्व्वत.

Therefore, we should remove all duplications after r in VCP and VAC.

@gasyoun
Copy link
Member

gasyoun commented Mar 19, 2021

I love the way Jim keeps on identifying low hanging fruits, to reduce labour.

I guess one could call it lexicographical hell otherwise.

We can derive some stats to check the tendency of the dictionary, and correct the remaining entries to match them.

Right, the nasals.

The relevant portion from paper normalization.pdf is as below.

What other issues of normalization.pdf should be applied to inside the dictionaries?

@funderburkjim
Copy link
Contributor Author

षार्वत्यां -> पार्वत्यां was noticed by a user correction (sanskrit-lexicon/csl-orig#495).

There are several other zArvat and zArvvat possible errors in VCP to be investigated.

@gasyoun
Copy link
Member

gasyoun commented Mar 21, 2021

several other zArvat and zArvvat possible errors

I would propose that the issue is even wider: षा vs. पा

@gasyoun gasyoun added the bug label Mar 21, 2021
@funderburkjim
Copy link
Contributor Author

Before we make a blanket change of 'pUrbba' to 'pUrvva' , I would like to know that the scanned images actually have 'bb' -- Can anyone find 5 instances where 'pUrbba' is clearly 'b' ?
The few examples that I've seen are not clearly 'b'. So I'm not sure whether the digitizations 'pUrbba' are actually a feature of the Author's spelling, or whether they are a feature of the digitization of unclear print images.

@Andhrabharati
Copy link

Andhrabharati commented Mar 22, 2021

Guess the following (from the very beginning pages) are enough for the purpose-

<L>44 <pc>0037,b अकडम

image

<L>76 <pc>0039,b अकाल

image

<L>151 <pc>0044,b अक्षरन्यास

image

<L>174 <pc>0045,b अक्षि

image

<L>181 <pc>0046,a अक्षिभ्रुव

image

@Andhrabharati
Copy link

Andhrabharati commented Mar 22, 2021

My remark elsewhere is not just limited to this पूर्ब्ब, but to

Therefore, we should remove all duplications after r in VCP and VAC.

as well.

The Eastern school (of India) of grammars (and usage) are having those throughout the literature in (& from) that region.
Probably it all is due to the Mugdhabodha influence.

One may look at the <L>10577 <pc>1458,b ॡ
where Taranatha specifically talks about the लकारद्वय as per मुग्धबोध.
The HW itself is shown as ल्लृ (instead of ॡ as is the practice elsewhere).
[Probably we could find the वर्णद्वययुत रेफ also somewhere mentioned inside the मुग्धबोध.]

@Andhrabharati
Copy link

What other issues of normalization.pdf should be applied to inside the dictionaries?

My opinion is that any kind of normalisation should be done in another layer (for searching and displaying etc.), but not in the actual "content" of the printed matter.

@gasyoun
Copy link
Member

gasyoun commented Mar 22, 2021

Probably it all is due to the Mugdhabodha influence.

Interesting thought.

My opinion is that any kind of normalisation should be done in another layer (for searching and displaying etc.), but not in the actual "content" of the printed matter.

We have some data in tags added, that's all for now, I guess.

@Andhrabharati
Copy link

Andhrabharati commented Aug 8, 2021

I have some additional information now, and thought I should share the same here.

(a) The consonant doubling is not prescribed by Mugdhabodha (Vopadeva), but has been identified by Pāṇini himself.
He has framed a sūtra (अचो रहाभ्यां द्वे P. 8.4.46) saying the consonants after r and h can be optionally doubled. It is his method of covering all the regional practices known at his time.

So the replacement of the double consonant after r and h with a single consonant can be taken as grammatically alright.
But I would still suggest retaining the regional variant forms as seen the books, but have the non-doubled form as the alt. form for all such cases. This makes the searching to catch the words without fail and match with other dictionaries as well.

(b) Now coming to the perpetual ba/va issue.
Seen that Bengali script has no separate character for ba and va (both are represented by a single character ব, u+09AC); but Rev. Yates in his Bengali Grammar says thus-

image

With this information, we can safely replace the conjuncted b with v [for handling the doubling cases, refer the above point] in all the Bengal based works (WIL, YAT, SKD, VCP etc.), when such v forms are seen in other regional texts (like AP or the European ones).

@Andhrabharati
Copy link

Andhrabharati commented Aug 9, 2021

The issue is still lingering in my mind.

Probably we can do the va/ba replacement (and the reverse case, ba/va as well) in non-conjuct places too; say like klIva to klIba, if such are the forms used in other region texts.

On the whole, it appears to be not a deliberate different form in Bengali works but just a limitation in their orthographs. And then the outsiders took the letters as is without understanding/knowing the Bengali limitation.

This thus treats the va-ba issue in toto once for all, I believe.

what do you say, @drdhaval2785?

@drdhaval2785
Copy link
Contributor

This was discussed elsewhere.
I was against changing b/v then.
Thereafter a strong argument was put forth.

  1. For Bengal, there is no difference of b/v. So, they would not notice the change.
  2. For rest of world, changing klIva to klIba woulx make the text more congruent to their expectation.

So there is nothing to be lost, but everything to gain by this b/v change.

I am now convinced that we should make changes, and am making such changes in my VAC VCP comparision work.

@Andhrabharati
Copy link

Good, and now I also have to take back what I said few months back that I cannot be a part in the team's exercise with the change suggested by Jim or you (in one of the issues in Meld usage).

After you finish your comparision work, I would be glad to proofread the VCP text, for the benefit of everyone.

@Andhrabharati
Copy link

Andhrabharati commented Aug 9, 2021

As I am looking into SKD front pages now, found this piece of info under the section ग्रन्थपरिपाटी (Methodology adopted)-

वर्णमालायां च वर्ग्य-जकारान्तःस्थयकारौ मूर्द्धन्य-णकार-दन्त्यनकारौ वर्ग्यवकारान्तःस्थवकारौ तालव्यशकार-मूर्द्धन्यषकारदन्त्यसकाराः सन्ति ।

एतदखिल-वर्णादि-शब्दानां धातूनाञ्च प्रभेदं कृत्वा सूचीपूर्व्वकं यथास्थानं संस्थापनं कृतवान् । वङ्गदेशे उक्तवर्णानामुच्चारण-भेदाभावः । विशेषतो वकारद्वयस्याकारोच्चारणयोर्भेदो नास्ति पश्चिमादिदेशे वर्त्तते । किन्तु मुग्धबोधटीकायां दुर्गादासविद्यावागीशधृता वकार-भेदिकैकप्राचीनकारिकास्ति । सा यथा, --
“उदूटौ यत्र विद्येते यो वः प्रत्ययसन्धिजः । अन्तःस्थं तं विजानीयात्तदन्यो वर्ग्य उच्यते” ॥
एतत्कारिकया सकलवकार-प्रभेदो न भवति । इति हेतोरहं धातूनां शब्दाकरत्वादोष्ठ्यदन्त्य-वकारादिधातुद्वारा पदसाधनं कृत्वा बहु-प्रयत्नैर्वकारद्वय-भेदं प्रकाशितवान् रेफयुक्तवर्णन्तु रवर्णात् परं शब्द-सूचीमध्ये विन्यस्तवान् ॥

Here, we are cautioned not to change every va/ba-kAra blindly (एतत्कारिकया सकलवकार-प्रभेदो न भवति).

BTW, contextually the वर्ग्यकारान्तःस्थवकारौ in the above text should not be changed to वर्ग्यकारान्तःस्थवकारौ (as this has been referred a few lines later as वकारद्वय-भेदं), though there is no "vargya-va" in the rest of India.

@Andhrabharati
Copy link

Andhrabharati commented Aug 9, 2021

Here SKD is giving the prevalent practice in Bengal that (j,y), (N,n), (b,v) and (S,z,s) groups [वर्ग्य-जकारान्तःस्थयकारौ मूर्द्धन्य-णकार-दन्त्यनकारौ वर्ग्यवकारान्तःस्थवकारौ तालव्यशकार-मूर्द्धन्यषकारदन्त्यसकाराः] to be without a difference in pronunciation.

Wilson in his dictionary (1st ed., 1819) preface quotes thus-

रलयोर्डलयोस्तद्वज्जययोर्बवयोरपि ।
शसयोर्मनयोश्चान्ते सविसर्गाविसर्गयोः ।
सविन्दुकाविन्दुकयोः स्यादभेदे न कल्पनम् ॥
(ralayor ḍalayos tadvajjayayor bavayorapi |
śasayor manayoś cānte savisargāvisargayoḥ |
savindukāvindukayoḥ syādabhede na kalpanam ||)

“The letters R and L, D and L, J and Y, B and V, Ś and S, M and N, a final visarga or its omission, and a final nasal mark or its omission, are always optional, there being no difference between them.”

Thus Wilson has covered a larger regional variations in India, than SKD.

Thought @funderburkjim might catch a piece or two (with his interest in "knowing" Skt.) through my posts, which could be of some help in cleaning the CSL texts.
[BTW is it CSL or CDSL that is to be used, to refer to this lexicon project of Cologne? I see both acronyms, somewhere or other on this forum and at the site.]

@Andhrabharati
Copy link

@drdhaval2785,

With the above information before us, what is your opinion about changing कोष to कोश?

This has been a long pending issue in my mind.
[Now looks like the reason is found.]

@drdhaval2785
Copy link
Contributor

As both are valid words, I would not change कोष or कोश

drdhaval2785 added a commit to sanskrit-lexicon/csl-orig that referenced this issue Aug 11, 2021
drdhaval2785 added a commit that referenced this issue Aug 11, 2021
@funderburkjim
Copy link
Contributor Author

May this issue be closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants