Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

abbreviation preparation #11

Open
funderburkjim opened this issue Oct 28, 2020 · 19 comments
Open

abbreviation preparation #11

funderburkjim opened this issue Oct 28, 2020 · 19 comments

Comments

@funderburkjim
Copy link
Contributor

funderburkjim commented Oct 28, 2020

Via Email, a user, James, expressed an interest in identifying the abbreviations in Vacaspatyam dictionary.

In part, he said:

One thing we could try, and would probably be fairly fruitful.  If from your side you could create a 
list of the abbreviations, then I could see if I can crowdsource the names of the referenced texts  
(the expansions you mention) through our Indology list. 
There's an amazing amount of knowhow on the list, one is continually surprised 
by the depth and breadth of responses to really arcane questions.

This issue devoted to getting started with this.

The basic problem is that there is no known list of abbreviation expansions for VCP.

@funderburkjim
Copy link
Contributor Author

Identification of abbreviations in VCP

Abbreviations in VCP may be identified as words ending in the digit '0'.

There are about 4000 distinct such abbreviations in our digitization vcp.txt.

Grammar and Literary Source

Examination of a few examples leads me to think that there are basically 2 types of abbreviations:

  • Grammatical For example 'BvA0' for 'BvAdi' -- The 'class 1' verbs
    • Mostly, I think, these occur on the first couple of lines of an entry.
    • For example, with root 'gama', the first two lines are
      गम गतौ भ्वा० पर० अनिट् । गच्छति ऌदित् अगमत् ज-
      गाम जग्मतुः । गन्ता गम्यात् गमिष्यति । गन्ता गमी
      
  • Literary sources. These usually occur later in an entry (after the first two lines)

The 'first-two-lines' rule is just a rough preliminary indication of whether a given abbreviation should
be thought of as grammatical or as a literary source.

funderburkjim added a commit that referenced this issue Oct 28, 2020
@funderburkjim
Copy link
Contributor Author

a first display

Please see abbrev0_roman_100.md.

The file name indicates certain details of the display:

  • The Sanskrit words are represented in Roman Unicode (IAST)
    • Other forms could use Devanagari or SLP1
  • The file and its links are written in markdown
    • another option would be html
  • Only abbreviations with at least 100 observed instances are included. There are 154 such abbreviations
    • There are 3900+ distinct abbreviations (at least 1 observed instance)
    • There are 500+ distinct abbreviations with 10 or more instances.

154 instances is a workable number. If we make progress on these most common abbreviations, then
we can later tackle other less common abbreviations.

@funderburkjim
Copy link
Contributor Author

structure of first display

The display is a table whose columns are:

  • a sequence number
  • the abbreviation
  • the number of occurrences of the abbreviation in the first 2 lines of some entry
  • the number of occurrences of the abbreviation not in the first 2 lines of some entry
  • Up to 5 headword links where the abbreviation occurs in first 2 lines
  • Up to 5 headword links where the abbreviation occurs after the first 2 lines

The links show one of the Cologne displays for the word in VCP dictionary.

@funderburkjim
Copy link
Contributor Author

I've communicated back by email to James. We'll have to see what else might be needed to
facilitate crowd-sourcing.

@gasyoun
Copy link
Member

gasyoun commented Oct 30, 2020

The basic problem is that there is no known list of abbreviation expansions for VCP.

There are some hints. Remember the Tirupati edition files, it had some literary sources? abbrev0_roman_100.md so very well done.

154 instances is a workable number. If we make progress on these most common abbreviations, then
we can later tackle other less common abbreviations.

Agree.

@funderburkjim
Copy link
Contributor Author

As a reminder, vac.txt is our copy of the Tirupati edition.

First glance at vac.txt shows markup with tags such as

  • vkr : vikrama = grammatical information

It would be useful to

  • get a list of all the tags used in Tirupati markup,
  • estimates of what the tags stand for (like 'vkr' stands for 'vikrama') and
    perhaps how the tags could be made use of.
    • Need help from a Sanskrit grammarian here.

I don't see definite markup of the abbreviations in vac.txt.

@Andhrabharati
Copy link

Andhrabharati commented Mar 15, 2021

First glance at vac.txt shows markup with tags such as

* vkr  :  vikrama  = grammatical information

<vkr> is not for विक्रम; it is for व्याकरण.

@Andhrabharati
Copy link

Andhrabharati commented Mar 15, 2021

I've communicated back by email to James. We'll have to see what else might be needed to
facilitate crowd-sourcing.

Any result from this crowd-sourcing?

More often than not, our observation is that nothing get started, once the work is "allotted".
----------------

Only abbreviations with at least 100 observed instances are included. There are 154 such abbreviations

* There are 3900+ distinct abbreviations  (at least 1 observed instance)

* There are 500+ distinct abbreviations with 10 or more instances.

May I ask @funderburkjim to post the complete list here (preferably in Devanagari)?

I tried making one such list myself.
VCP abbreviations extracted.txt

1. This shows that many entries are with spelling and spacing errors (I did remove some of them, but then stopped).
2. Also quite many of these could be clubbed together as comp. abbr.s, instead of keeping as separate ones.
3. Many are variant forms of the same "source".
4. And finally quite many others are without the trailing '0', either in the text or in the print itself.

@Andhrabharati
Copy link

@gasyoun
If you can trace your Tirupati CD, can you post a link to download it?
I also purchased the CD, but need to locate it.

@funderburkjim
Copy link
Contributor Author

post the complete list here (preferably in Devanagari)?

A complete devanagari list (as github markdown table) is in two parts:

The one-part form is also prepared, but is too big for github to display properly.

There is also a simpler list, with each abbreviation and its frequency,
at
abbrev0_deva_all.txt.

This should be comparable to VCP.abbreviations.extracted.txt from @Andhrabharati (see a previous comment for link).

@gasyoun
Copy link
Member

gasyoun commented Mar 15, 2021

I don't see definite markup of the abbreviations in vac.txt

Tirupati people are famous for bad documenting, so it's the same with the Tirupati edition of digital Ramayana.

If you can trace your Tirupati CD, can you post a link to download it?

The file we have at Cologne is based on the CD Usha sent me. That is, nothing else about it.

Analysis done in 2014.

f_WX.txt
Vacaspatyam_15_01_2014_b1.xlsx
Vachaspatyam.xlsx
Vachaspatyam_b3_with_dev.xlsx
Vachaspatyam_b4_without_dev.xlsx
Vachaspatyam_b5_proof_1673.xlsx
Vachaspatyam_b6_proof_1673-06-01-14.xlsx

@funderburkjim
Copy link
Contributor Author

funderburkjim commented Mar 16, 2021

The file we have at Cologne is based on the CD Usha sent me

The Tirupati vacaspatyam I started with in this repository is vac_input.txt. According to my notes in the readme.org of vcpte-vac,

By some unknown process, Scharf and colleagues reformatted and modified
presumably the same original Tirupati edition of Vacaspatyam.

@gasyoun
Copy link
Member

gasyoun commented Mar 16, 2021

Scharf and colleagues reformatted and modified presumably the same original Tirupati edition of Vacaspatyam.

Oh, so you believe the two versions have the same source initially.

@funderburkjim
Copy link
Contributor Author

The Tirupati version I got from Scharf had already been put into SLP1. I don't know what source Peter started with; but since you mentioned the existence of a CD made by Tirupati, it may be that Peter started from that cd.

@gasyoun
Copy link
Member

gasyoun commented Mar 16, 2021

Tirupati version I got from Scharf had already been put into SLP1

Oh, ok, because when I saw the CD it was in that funny WX encoding. And contained not only the dictionary file, but several additional, including the Preface.

@Andhrabharati
Copy link

Finally done with the first phase of Vacaspatyam corrections, starting mainly with the abbr. markers, in a focused effort for two weeks; and the summary is in the file below-

Phase-1 of work on Vacaspatyam.txt

Almost all the abbr.s are resolved now!!

It appears that the present Cologne data has missed the dual/variant forms of the HWs (marked in parenthesis in the print), and also many errors are noticed.

Hence it is desirable to correct the HWs portion (before touching the body portion), which I would like to take up in next few days.

@gasyoun
Copy link
Member

gasyoun commented Apr 21, 2021

HWs portion (before touching the body portion)

Yeah, headwords is what comes first. Thanks for the hard work @Andhrabharati

Almost all the abbr.s are resolved now!!

But where to look for them?

@funderburkjim
Copy link
Contributor Author

@Andhrabharati There are a lot of 'extra' headwords at
https://github.com/sanskrit-lexicon/csl-orig/blob/master/v02/vcp/vcp_hwextra.txt

These probably include many of the 'dual/variant forms' .

@Andhrabharati
Copy link

Yes @funderburkjim, I've seen this file as well as the Vachaspatyam-Doubles-16.3.15.xlsx file from @gasyoun.

As I saw, there are some errors in both the files.

So decided to do it again myself, while looking for HW errors throughout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants