Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fresh Look, starting with <is> tag #95

Closed
funderburkjim opened this issue Jul 17, 2023 · 52 comments
Closed

Fresh Look, starting with <is> tag #95

funderburkjim opened this issue Jul 17, 2023 · 52 comments

Comments

@funderburkjim
Copy link
Contributor

Work initially related to sanskrit-lexicon/CORRECTIONS#419.

funderburkjim added a commit to sanskrit-lexicon/csl-orig that referenced this issue Jul 18, 2023
@funderburkjim
Copy link
Contributor Author

funderburkjim commented Jul 18, 2023

Work in this issue is done in the pwkissues/issue95 directory of this repository.

start with latest pw.txt

@Andhrabharati please start with latest csl-orig/v02/pw/pw.txt. A few (19) changes were made during development of transcode script. You could name this file 'temp_pw_0.txt'.

transcode script

The pw_transcode.py script converts pw from one transcoding to another.

First, make the 'pwtranscode' directory current terminal directory

  • python pw_transcode.py slp1 roman PATH-TO-PW_SLP1.TXT PATH-TO-PW_IAST.TXT
  • python pw_transcode.py roman slp1 PATH-TO-PW_IAST.TXT PATH-TO-PW_SLP1.TXT
  • python pw_transcode.py slp1 deva PATH-TO-PW_SLP1.TXT PATH-TO-PW_DEVA.TXT (mw-style)
  • python pw_transcode.py deva slp1 PATH-TO-PW_DEVA.TXT PATH-TO-PW_SLP1.TXT (mw-style)
  • See comment below
    • python pw_transcode.py slp1 deva1 PATH-TO-PW_SLP1.TXT PATH-TO-PW_DEVA.TXT (pw-style)
    • python pw_transcode.py deva1 slp1 PATH-TO-PW_DEVA.TXT PATH-TO-PW_SLP1.TXT (pw-style)

Note 1: If you convert from slp1 to iast and then (without making changes to the iast version)
immediately convert the iast version back to slp1 (under differently named file), then that differently
named file and the original file should be identical.

Note 2: Conversion is applied to (a) both the k1 and k2 fields of metaline and (b) the {#X#} elements of the text.

@Andhrabharati
Copy link

I had taken the recent pw.txt from csl-orig, for my present working.

Will incorporate the 19 changes done by you now in my file.

And, a big "Thank you" for the conversion scripts.

@Andhrabharati
Copy link

Noted that 8 of 19 were already changed during my working.

@funderburkjim
Copy link
Contributor Author

retain line-numbering

As with the work on Gra, request you maintain the line numbering in revisions of pw.txt. Then at the end of the pw revisions, we can remove unneeded blank lines.

@funderburkjim
Copy link
Contributor Author

@Andhrabharati So I can follow your comments (such as at sanskrit-lexicon/CORRECTIONS#419 (comment)),
why don't you upload a zip of your current pw.txt.

@Andhrabharati
Copy link

Andhrabharati commented Jul 19, 2023

@funderburkjim

Still quite a bit of work is remaining to cleanup the data to give out my prelim. file.

I had only looked at the portions marked as italic; there are quite many places not marked so in the text (but are in italics, in print) that need to be identified.

My present focus is on marking the abbr.s inside italics as well as outside.

Pl. wait for few more days.

Meanwhile, you may start looking/working on BHS, which I had made earlier & recently 'marked' citation numbers after GRA.
Shall I post the same at the csl-devanagari repo (that discussed this point), which you could then take-up in a relevant repo?

@Andhrabharati
Copy link

Also quite many places are not marked with is-tag!!

@Andhrabharati
Copy link

Andhrabharati commented Jul 19, 2023

@funderburkjim

Just tried converting the slp1 file at my end, and got

<L>1<pc>1001-1<k1>अ<k2>अ<h>1<e>000
1. {#अ#}¦ Pron. der 3ten Person. Davon {#अस्मै॑ , अस्यै॑ , अस्मा॑त् , अस्या॑स् , अस्य॑ , अस्मि॑न् , अस्या॑म् , आभ्या॑म् , एभि॑स् , आभि॑स् , एभ्य॑स् , आभ्य॑स् , एषा॑म् , आसा॑म् एषु॑ , आसु॑#} {%Diesem , diesem hier%} <ab>u.s.w.</ab> Unbetont <ab>Subst.</ab> {%ihm , ihr%} <ab>u.s.w.</ab> <ab>Vgl.</ab> {#अयम् , अया , इदम् , इम , इयम् , एन , एना#}.
<LEND>

instead of [from my earlier file pw_AB_08.txt]

<L>1<pc>1001-1<k1>a<k2>a<h>1<e>000
1. {#अ#}¦ Pron. der 3ten Person. Davon {#अस्मै꣫, अस्यै꣫, अस्मा꣫त्, अस्या꣫स्, अस्य꣫, अस्मि꣫न्, अस्या꣫म्, आभ्या꣫म्, एभि꣫स्, आभि꣫स्, एभ्य꣫स्, आभ्य꣫स्, एषा꣫म्, आसा꣫म् एषु꣫, आसु꣫#} {%Diesem, diesem hier%} <ab>u. s. w.</ab> Unbetont <ab>Subst.</ab> {%ihm, ihr%} <ab>u. s. w.</ab> <ab>Vgl.</ab> {#अयम्, अया, इदम्, इम, इयम्, एन, एना#}.
<LEND>

The remark is about the Vedic svara conversion.

Probably the underlying "rule" files (in the transcoder folder) are not the ones that we had 'finalised' earlier for the PW group.

Would you pl. check this once?

@funderburkjim
Copy link
Contributor Author

Would you pl. check this once?

pw-style devanagari accents

@Andhrabharati Yes, you are right regarding conversion.
To get the 'pw' style devanagari accents, you will need to

  • download the 'deva1' versions of transcoding rules
    • pwtranscode/transcoder/slp1_deva1.xml and deva1_slp1.xml
  • Invoke the python pw_transcode.py script with 'deva1' as a parameter, rather than 'deva'.

I think this will solve that problem.


Note: 1 typo noticed, under (slp1) <L>132690<pc>7240-3<k1>svardfS, {#suArdf/Z#} -> {#suArdf/S#}

@funderburkjim
Copy link
Contributor Author

Shall I post the same at the csl-devanagari repo

Sure, go ahead. I'll take a look.

Also please note that I need a posting of your current pw; so I can respond to the <is n="abbrev">X</is> question you raised.

@Andhrabharati
Copy link

@funderburkjim

Posted my BHS file at the relevant repo.

@Andhrabharati
Copy link

Got 9 more abbr. type is-entities, in the non-italic part (while checking the dot-ending words)--

<is n="Acchāvāka">A.</is>
<is n="Iṣṭi">I.</is>
<is n="Kālidāsa">K.</is>
<is n="Magundī">M.</is>
<is n="Tīrtha">T.</is>
<is n="Trigarta">Tr.</is>
<is n="Uṣṇih">U.</is>
<is n="Uttaraphalgunī">Uttaraph.</is>
<is n="Virāj">V.</is>

And, noted that some entries listed in the pwis_mw.txt are in fact typos.
This prompts me to look at the full set of <is>-words now, to "close" the issue.

@Andhrabharati
Copy link

And, noted that some entries listed in the pwis_mw.txt are in fact typos.
This prompts me to look at the full set of <is>-words now, to "close" the issue.

Just showing an example word on this point, <is>Kānda</is>--

<L>81565<pc>5003-3<k1>maNgalika<k2>maNgalika/<e>100
{#maNgalika/#}¦ (wohl <lex>n.</lex>) <ab>Pl.</ab> vielleicht <ab>Bez.</ab> {%der Lieder des 18ten%} <is>Kānda</is> im <ls>AV.</ls>
<LEND>

image

MW entry for kānda is

image

And, the MW entry for maṅgalika is

image

Finally, the pwk print has this as

image

Thus, we can see that this entry has both a typo error (Kānda) as well as a print error (Kāṇda) in pwk, whereas it should’ve been Kāṇḍa.

@Andhrabharati
Copy link

Andhrabharati commented Jul 21, 2023

BTW, this above example reminds me of the very initial comments on the ls-entity display of PWG (and pwk) posted by me--

first and next (Note 4)

But it appears that either these posts have skipped Jim's attention, or he didn't see any value in this point.

I feel REALLY bad whenever I see Rv, Av, etc. on CDSL PWG/pwk search results, while the MW display renders them 'appropriately' as RV, AV etc..

@Andhrabharati
Copy link

Andhrabharati commented Jul 21, 2023

And, noted that some entries listed in the pwis_mw.txt are in fact typos.

The fist entry that I had noted this discrepancy in is-words wrt the mw-words is dvipa that occurred 25 times, either by itself (Dvipa 6 times-- all in error) or as part of another word (dvipa 19 times-- all marked as notmw); whereas it should've been Dvīpa or dvīpa respectively.

@Andhrabharati
Copy link

Andhrabharati commented Jul 23, 2023

retain line-numbering

As with the work on Gra, request you maintain the line numbering in revisions of pw.txt. Then at the end of the pw revisions, we can remove unneeded blank lines.

Sorry for having 'violated' this, @funderburkjim !

Rather, I haven't violated but just implemented the style I started in GRA, in this pw as well.

I have started with minimal line-number changes (limited to 'embedding' [Pagexxxx] into other lines), for now; but I have more changes in mind, to prepare this pw in a "standard style" to be followed in the other CDSL works as well.

Feel free to revise any parts of gra9. Go ahead and add entries such as the missed headwords.

… … … …

Friendly reminder - keep a note file as you change; this will be a guide to me of your changes. These notes will be helpful to me in constructing the displays from your revised version. The 'tags' files you provided will also find use when the display programs are revised.

@Andhrabharati The baton is now in your hands for the next leg of this Grassman marathon. Good luck!

Hope you'd allow me a 'free-hand'(!!) here also, as done at GRA recently.

@Andhrabharati
Copy link

Andhrabharati commented Jul 23, 2023

  1. Any line starts only with one of the 5 types-- <L>, Header, <div, <F> and <LEND>
  2. No blank lines are present within the entry portion; and just a single blank line is present when a new entry starts.

These two were the binding-principles reg. the text-lines that I followed in the pw.txt file, and did the following replacements--

image

and this can be taken as my starting file, [the split lines are marked as ;; split]--

pw_CDSL_0.zip

Is this in compliance with our earlier posts 1 and 2?

@Andhrabharati
Copy link

Andhrabharati commented Jul 23, 2023

Hope you'd allow me a 'free-hand'(!!) here also, as done at GRA recently.

If you have other thoughts, I shall post only the relevant <is and <ab strings, retaining the cdsl text 'form' as is, though it amounts to a minor rework at my end (withholding my current plan).

If you happen to agree (I just hope you would!), then I shall start posting what all I have done so far [having finished the abbr. portion], and my prelim. file.

@funderburkjim
Copy link
Contributor Author

funderburkjim commented Jul 24, 2023

pw_cdsl_0

These observations based on work in pwkissues/issue95/compare0 directory.

Generation of displays (locally) using pw_CDSL_0 encounters no problems. The generated pw.xml validates with pw.dtd. Great!

A couple of minor observations:
I renamed the file pw_CDSL_0.txt to temp_pw_ab_0.txt

  1. At line 106035 replace <L>55397 with a blank line
  2. When comparing with current pw.txt, I did not see any extra lines corresponding to ';; split'
    • for instance at L>13120, both AB version and have 4 lines. So why the ';; split' ?
  3. I found no lines starting with < at second character, e.g. I found no lines starting with image, so the comment above is confusing.

Seems ok to proceed with further revisions.

@Andhrabharati
Copy link

Andhrabharati commented Jul 24, 2023

2. When comparing with current pw.txt, I did not see any extra lines corresponding to ';; split'

* for instance at L>13120, both AB version and have 4 lines. So why the ';; split' ?

@funderburkjim

pw_CDSL_0.txt is not the version that I am working with; it is just regenerated from pw.txt to match the lines with my AB file.

Here is the screenshot comparing the two files--

image

And you may see the split in my AB file at "Mit {#kar#}", breaking the prev. line into two lines.

3. I found no lines starting with < at second character, e.g. I found no lines starting with image, so the comment above is confusing.

There are no lines starting with •<ab etc.

My comment clearly shows the replacement of line starting with <ab getting merged into the prev. line as •<ab
\n<ab -> •<ab (total 102 such replacements)

See the first such occurrence at lines 24204-6 in pw.txt

<div n="1">— 2) Praep. mit
[Page1063-1]
<ab>Acc.</ab>

that get merged in pw_CDSL_0.txt (line 23955) as
<div n="1">— 2) Praep. mit •[Page1063-1] •<ab>Acc.</ab>

These are all (almost) the cases of what I mentioned above as "limited to 'embedding' [Pagexxxx] into other lines".

@Andhrabharati
Copy link

Generation of displays (locally) using pw_CDSL_0 encounters no problems. The generated pw.xml validates with pw.dtd. Great!

This file has no "real" changes made, except the line mergers at [Pagexxxx].

@funderburkjim
Copy link
Contributor Author

Got it. Ready for 'real' changes.

@Andhrabharati
Copy link

Andhrabharati commented Jul 24, 2023

Here is the prelim. file to go through meanwhile (as you had done with my GRA file earlier, without any notes).

pw (AB v1).zip

This can be used to check and workout the abbr. expansions, if nothing else.
[Probably Thomas and/or Felix Rau could be reached out to help in the process.]

I will start posting the notes from tomorrow morning, indicating various changes went into the file to get the prelim. file at my end (as of now), as I am too tired now.
[It is just past mid-night here.]

@funderburkjim
Copy link
Contributor Author

compare metalines ab_0 v. ab_1

See results under 'compare1/readme.txt' at 'compare_hw step 2'.
31 of the 34 differences between the metalines in temp_pw_ab_0.txt and temp_pw_ab_1.txt are ab1 corrections to errors in ab0.

The other 3 (marked 'abi error?' ) should be corrected in temp_pw_ab_1.txt.
Also, there is 1 misc. suggested correction.
@Andhrabharati Request you to make these 4 corrections in temp_pw_ab_1. Agree?

@funderburkjim
Copy link
Contributor Author

text after <LEND>

in temp_pw_ab_1, 3027 instances with text following <LEND>.
One is <LEND>〉 and the rest are like <LEND>•[Page1003-3].

@Andhrabharati Are these temporary markup?

@Andhrabharati
Copy link

Andhrabharati commented Jul 26, 2023

The other 3 (marked 'abi error?' ) should be corrected in temp_pw_ab_1.txt.
Also, there is 1 misc. suggested correction.
@Andhrabharati Request you to make these 4 corrections in temp_pw_ab_1. Agree?

ab1 errors ? (based on differences between metalines in ab0, ab1 versions.

only ab0: <L>13353<pc>1158-1<k1>Akarika<k2>Akarika<e>100
only ab1: <L>13353<pc>1158-1<k1>AkAraka<k2>AkAraka<e>100
ab1 error? cf pwg, alphabetical order (Note: 'pw print error')

pwk (1158-1) has
image
I had corrected this as per scan, though I noticed the alpha. order error and the error in the HW [the cited text is having Akarika only , and so does the PWG entry]; I thought of changing it in the next round of Header proofing (that would take a longer time!).
Now that you have raised the point, shall correct this now itself.

only ab0: <L>19684<pc>1240-3<k1>upanikzepa<k2>upanikzepa<e>100
only ab1: <L>19684<pc>1240-3<k1>upanikzepa<k2>upanikzepa<e>100ṇ
ab1 error? ṇ

Yes, this letter got here by error.

only ab0: <L>78979<pc>4253-2<k1>Barb<k2>°Barb<e>500
only ab1: <L>78979<pc>4253-2<k1>Barb<k2>*Barb<e>500
ab1 error?

This is to be taken as a print error.
A skt. root cannot and does not have the contraction mark before it; it always occurs on its own. And note the '*' mark at the following variant root BarB.
image

only ab0: <L>120161<pc>7058-2<k1>samaha<k2>samaha.<e>100
only ab1: <L>120161<pc>7058-2<k1>samaha<k2>sa\ma\ha\<e>100
ab1 correction.
additionally, ab1 error?: (add comma) {#praSasta#}, {#saDana#}

Does this pwk (7058-1) snippet answer the point?
image

So in summary, I need to correct only 2 places out of these 4.

@Andhrabharati
Copy link

One is `〉

This character got here by error.

and the rest are like•[Page1003-3]`.

@Andhrabharati Are these temporary markup?

Yes, and you had accepted the <LEND> [Pagexxx] lines recently in GRA.

Now that the line-breaks around [Pagexxxx] lines are looked at, we can remove this • character thoughout.
[It will be reintroduced in the upcoming major change shortly!!]

@Andhrabharati
Copy link

Andhrabharati commented Jul 26, 2023

Next, I will start posting the changes made and then this (first-part of) Fresh-look issue can be closed, as it is growing longer.

[I could not do this yesterday, having been engaged in some pressing chores.]

@Andhrabharati
Copy link

The IAST corrections matter could be continued in the parent issue (PW IAST corrections #419), as this <is> tag issue might be closed after my change notes are posted.

@Andhrabharati
Copy link

I have started with the simplest point, as mentioned at Space before punctuation marks (reg. PWG, pwk and pwkvn) #855 ,

and the counts now stand thus in my version of pwk--

image

Notes.

  1. All the 7 places reg. full-stop are within the Devanagari slp1 strings denoting the danda (6) or double danda (1).
  2. There are 7 ';;' places now that show the running remarks in the text line, which would get removed (after AB's revision).

@Andhrabharati
Copy link

Andhrabharati commented Jul 26, 2023

Next point is separating out the <ab- and <is- elements from within italic strings.

Any person closely looking at the print pages, can notice that

  • all the wide-spaced entities whether Sanskrit [tagged as <is strings-- whether being full word(s), or abbreviated] and non-Sanskrit [tagged as <iw strings] in "straight face", never in italics (even if it is a single letter abbr., at some places).

  • most of the global (generic) abbr.s lie outside the italic strings, except very few (~15).

  • most of the local (generic) abbr.s lie outside the italic strings, except very few (~2).

  • all the in-line abbr.s in the running text lie inside the italics.

With this background, the separation of various text strings from italics has been carried out.

The abbr. counts now stand thus-
image

And here is a summary of the global abbr.s that occur in italics & outside-
image

and the local abbr.s that occur in italics & outside-
image

Finally, the total occurrences apart, the unique abbr. counts are thus-
image

@Andhrabharati
Copy link

The <is> details in a similar manner cannot be posted yet, as some work is yet to be taken up, as mentioned above.
[Quite a few <is> unique strings as listed to be in mw (by Jim earlier) might get corrected; a reduction of over 1000+ is estimated.]

@Andhrabharati
Copy link

Andhrabharati commented Jul 26, 2023

One interesting point noticed is that at some places, the abbr.(s) in print pages are present in expanded form in the text file, most probably done by @maltenth (or who else could it be?) while applying his markups on the typed text [it is highly doubtful that the typists at India would have done this expansion].

Also seen that at many places the marked italic strings are not so in the print; and at far more places the italic strings of the print are present in normal face in the typed text.

There is no way except a full reading wrt the print to "correct" these points completely, I suppose.

@funderburkjim
Copy link
Contributor Author

ab1: <L>120161<pc>7058-2<k1>samaha<k2>sa\ma\ha\<e>100

I agree that the print has a comma.
But this comma is missing in temp_pw_ab_1:

{#sa\ma\ha\#}¦ <lex>Adv.</lex> {%irgendwie, so oder so%}. Nach <ls>SĀY.</ls> <lex>Adj.</lex> 
(= {#praSasta#} {#saDana#} <ab>u. s. w.</ab>) im <ab>Voc.</ab>
               COMMA MISSING

Doesn't that comma need to be added to pw_1 ?

Accept your point Barb. WIll start a print change file for this and perhaps other future print changes that arise.

@Andhrabharati
Copy link

But this comma is missing in temp_pw_ab_1

Sorry, I was looking at my current file that has undergone more changes; it has the comma here.

@funderburkjim
Copy link
Contributor Author

<LEND>[Pagexxx] accepted in GRA

No, this is a case where you did something in GRA that I was not aware of. If I had noticed it, I would have complained.

The <LEND> line is important since it marks the end of an 'entry' which starts with the metaline.
When a (python) program processes an xxx.txt file, it must identify this end-of-entry line.
There are (at least) two possible ways to make this identification

  1. line == "<LEND>"
  2. line.startswith("<LEND>")

(1) would NOT recognize <LEND>[Pagexxx].
(2) would recognize both <LEND>[Pagexxx] and <LEND>

Although I have thought of (1) as the default, I have (AFAIK) used (2) in all existing code,
and (2) doesn't care if there is additional information -- in particular, programs work with <LEND>[Pagexxx].

I still have a fondness for

<LEND>
[Pagexxx]

<L>....

Conclusion: I DO accept <LEND>[Pagexxx] if it is important for your analysis.
And I will continue to use method (2) to recognize the end of entry.

@Andhrabharati
Copy link

Andhrabharati commented Jul 26, 2023

I think we can get rid of that [Pagexxxx] after <LEND> altogether, as the page change-over would anyway be reflected in the next meta-line's <pc> value. In effect, this is a repetition of information.

And as you had mentioned elsewhere, this <pc> or [Page....] info is not used anywhere except to link to the resp. page to display when clicked on it.

My thinking is that this [Pagexxxx] need/should not be on a separate line.

@funderburkjim
Copy link
Contributor Author

this (first-part of) Fresh-look issue can be closed, as it is growing longer

Agree.

the IAST corrections matter could be continued in the parent issue #419

Prefer you to start a new issue here in PWK repository when you're ready.

@Andhrabharati
Copy link

Prefer you to start a new issue here in PWK repository when you're ready.

In such a case, the referred parent issue can be closed. No need to keep it open until a new issue is opened for the <is elements.

I see many issues still remain open in various repos, though their purpose is served.

@funderburkjim
Copy link
Contributor Author

the abbr.(s) in print pages are present in expanded form in the text file,

Please post the examples you have noticed. We can ask @maltenth if he recalls some reason.
Or, we may find a pattern. I can also check against an early version of pw.txt (in case I introduced these
expansions ! )

at many places the marked italic strings are not so in the print; and at far more places the italic strings of the print are present in normal face in the typed text.

Again, post some examples if they are at hand.

I have wondered about the significance of italic/non-italic text in PW. Maybe if this distinction were conceptually clear, we could find some way to identify (and correct) many of these mistakes in pw.txt.

@Andhrabharati
Copy link

Andhrabharati commented Jul 26, 2023

I have wondered about the significance of italic/non-italic text in PW.

The italics mostly denote the meaning/explanation portions in German language, as I could see.

@gasyoun
Copy link
Member

gasyoun commented Jul 29, 2023

The italics mostly denote the meaning/explanation portions in German language, as I could see.

Wonder if the preface gives a clue, if reread.

@gasyoun
Copy link
Member

gasyoun commented Aug 10, 2023

I vaguely remember seeing the script somewhere (but unable to recall now), apart from PWG (5-0078) that has the same string

https://ru.wikipedia.org/wiki/%D0%91%D0%B0%D0%B3%D0%B0%D1%82%D1%83%D1%80

монг. baγatur (ᠪᠠᠭᠠᠲᠦᠷ )

ᠪᠠᠭᠠᠲᠦᠷ

@Andhrabharati
Copy link

Andhrabharati commented Aug 12, 2023

монг. baγatur (ᠪᠠᠭᠠᠲᠦᠷ )

@gasyoun

I could see the letter y in between and the letter t is not matching the character in the PWG print; so the word appears to be the (Mongolian) baga(?)yur.

Can your (Mongolian) friend tell why the PWG has the (Mongolian) lettering upside-down and then left-to-right (or in other words, rotated by 180 degrees)?

@Andhrabharati
Copy link

Andhrabharati commented Jan 6, 2024

[Jim]

I have wondered about the significance of italic/non-italic text in PW. Maybe if this distinction were conceptually clear, we could find some way to identify (and correct) many of these mistakes in pw.txt.

[AB]

The italics mostly denote the meaning/explanation portions in German language, as I could see.

@funderburkjim
Pl. have a look at sanskrit-lexicon/MD#12 (comment), which I hope answers the point quite satisfactorily.

@Andhrabharati
Copy link

@funderburkjim

I suggest closing this issue, as the is-tags were more or less attended to.
If any more corrections are still present, they would come out when full proofing of pwk takes place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants