Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternate headwords for pw #106

Closed
funderburkjim opened this issue Feb 9, 2024 · 108 comments
Closed

Alternate headwords for pw #106

funderburkjim opened this issue Feb 9, 2024 · 108 comments

Comments

@funderburkjim
Copy link
Contributor

funderburkjim commented Feb 9, 2024

We tackle the task of generating alternate headwords for pw dictionary.

Preliminary outline of the approach:

  • Filter entries based on the first line of data (the line after the metaline)
  • Parse the implied headwords (based on the broken bar in that first line of entry)
    • recognize also hom and (?) roots
  • Use this parse to construct k2 of the metaline base. When there is more than one headword, this will result in a comma-separated list in k2
  • construct parallel list of k1 from the list of k2.
  • construct pw_hwextra.txt (for csl-orig) from the k1-k2 list
    • this will generate essentially duplicate entries in pw.xml for the extra 'alternate' headwords.

Note: no attempt to generate alternate headwords from upasargas of verb entries.

@funderburkjim
Copy link
Contributor Author

markup observations

  • <h> appears in metalines in pw-main, but not in pw-vn.
    • the present work will remove that <h>N and embed it into k2 (cf. GRA markup)
  • * - 29339 matches for "[*].*¦" . Boehtlingk.
    • A word, a meaning, a construction or a gender that has so far only been
      listed by grammarians or lexicographers has been designated with *. ref:
  • √ indicates a root . Andhrabharati addition. 3103 matches for "√{#.*¦"
  • √! indicates a 'denominative' root. Andhrabharati addition. 1294 matches for "!√{#.*¦"

√ and ! do not appear in k2 of metaline
* appears in k2 of metaline: 29069 matches for "<k2>[*]"

@Andhrabharati
Copy link

Andhrabharati commented Feb 11, 2024

@funderburkjim

It appears that you are analyzing the pw file data, for marking the prospective alt. HWs yourself.

I had mentioned earlier (in #104) that the header portion may be looked at to get these words; but there appear to be more entries that contain alt. HWs (which I had missed before in my posted file that formed the base for the cdsl version) after the broken bar.

Here are a few (43 no.s) such L-entries on a quick searching--

16678, 22390, 23281, 26293, 28410,
30597, 34090, 34556, 36315, 37930,
39355, 39831, 39852, 43702, 43761,
44078, 44931, 46971, 55339, 55378,
56092, 59172, 61689, 61801, 74511,
77163, 80864, 87626, 88417, 89051,
94286, 97985, 100189, 102651, 103657,
110117, 110192, 112382, 113429, 115433,
125488, 130735, 133969

There could be more like these, and hope you would try to get those as well.
--------------------------------------------
PS. You had tried to get to the <7k figure that I mentioned at #104, and suggested discarding the hom-tag entries in the process; but there are little over 200 entries having a hom-tag that contain alt. HWs. [My earlier count was 6986.]

@Andhrabharati
Copy link

Once you finish your exercise, I might be able to compare your file with my version (not posted so far) for any missings/changes.

@funderburkjim
Copy link
Contributor Author

Yes, I am working on those alt-headwords now. Will share my work with you for comparison, perhaps later this week.

@Andhrabharati
Copy link

Andhrabharati commented Feb 13, 2024

@funderburkjim

Now, I've looked for the 'missed' alt. HW entries in the erstwhile pwkvn file, and found 15 such.

<L>200221<pc>1-284-a<k1>ajagAva<k2>ajagAva
<L>203121<pc>2-299-c<k1>KaqIna<k2>KaqIna
<L>203284<pc>3-247-a<k1>aganDi<k2>aganDi
<L>203898<pc>3-253-c<k1>arTasaMhata<k2>arTasaMhata
<L>204099<pc>3-256-a<k1>AmravAwaka<k2>*AmravAwaka
<L>204416<pc>3-259-b<k1>kalvowaka<k2>kalvowaka
<L>204577<pc>3-261-a<k1>gArDapfzwa<k2>gArDapfzwa
<L>205639<pc>4-299-a<k1>cUlha<k2>cUlha
<L>206832<pc>5-249-c<k1>ikawI<k2>*ikawI
<L>207055<pc>5-252-a<k1>kikviwa<k2>kikviwa
<L>207313<pc>5-255-a<k1>jvalana<k2>jva/lana
<L>209005<pc>6-302-b<k1>devaniSrayaRI<k2>devaniSrayaRI
<L>209263<pc>6-305-b<k1>mAtaNgavedi<k2>*mAtaNgavedi
<L>213195<pc>7-312-a<k1>avimanas<k2>avimanas
<L>213697<pc>7-315-c<k1>asevana<k2>asevana

These entries have the alt. HWs after the broken bar, as are the pwk (main) entries listed above.

@funderburkjim
Copy link
Contributor Author

systematic additional headwords

My approach to determine NEW secondary headwords for an entry is based on
analysis of the 'broken-bar' line of the entry.

  • 158370 is the current count for pw.txt entries.
    • grep -E '^<L>' ../temp_pw_2.txt | wc -l
  • 8444 entries are previously identified as multiple headwords
    • grep -E '{#.*?#}.*{#.*?#}.*¦' ../temp_pw_2.txt | wc -l
  • 3103 entries are identified as roots
    • grep -E '√.*¦' ../temp_pw_2.txt | wc -l

Excluding these two groups, there are approx. 31054 entries which might
have multiple headwords.
grep -E '^[^{√]*{#[^#]*#}[^{]*¦.*{#' ../temp_pw_2.txt | wc -l

But there are many patterns in these 31054 entries which disqualify the
entry from implying extra headwords. For example, in {#akulI#}¦ <ab>v. l.</ab> für {#aNkulI#}., {#aNkulI#} is not an extra headword. The pattern is <ab>v. l.</ab> für {#
RESTRICTED to cases with just one {#X#} after ¦

After excluding this and many other patterns, there remain 10788 candidates.
Certainly some of these should also be excluded; but at this point, the
exclusion-by-pattern approach has become unproductive.

Thus, the approach changes course to apply patterns to INCLUDE subsets of the
candidates. For such an included entry, there need to be changes to:

  • the broken bar line ( ¦ moves after last sub-headword phrase {#X#} )
  • the <k2> field of metaline.

The file temp_change_3_01.txt
Shows the changes made for the pattern: {#°tA#} or {#°tva#} (is the only {#X#} after ¦ ).
There are 915 cases here.
Later pw.txt will be changed by applying the changes in this file, and similar changes for other patterns.
Eventually, this approach should get most of the implied extra headwords.

@Andhrabharati Before proceeding much further, I wanted your take on this approach.

@Andhrabharati
Copy link

Andhrabharati commented Feb 15, 2024

I would suggest 'restricting' this phase to mark and bring-out the 'primary HWs' (as I termed them) [single or 'grouped' entries], that occur at the beginning of the entry in the printed lexicon.

We definitely need to bring-out other 'inner' HWs also, that occur in multiple ways, but this could be done in another/next phase -
(a) in-line HWs [that are within braces in the running text]
(b) implied HWs [mostly with a preceding "also' etc.]
(c) indicative HWs [like the -tA & -tva varieities, that mostly do not have any 'objective' body, but just a mention of the word]
(d) variant form HWs [with a preceding "written", "v. l.", "w. r." etc.]

And then, we should look for the composite/compound words that occur inside the body portion of the above entries, and suitably bring them out.


BTW, I see 158375 <L> entries, not 158370 as mentioned above, in the combined pw.txt.

@Andhrabharati
Copy link

Andhrabharati commented Feb 15, 2024

My personal opinion is that we should mark these 'secondary' HWs with <div n="x" > style [as done in GRA], "x" covering various forms that we come across in the particular lexicon.

And then, list those various groups within the 'main' entry somewhere (like the separate althw file seen in some cdsl works), to come under the "search" criterion.

This approach would retain the digital text in a form closer to the printed work.
[BTW, this is the approach that I took in revising the MW-dev data; my ultimate goal being to bring all the cdsl works in the similar format, making it a 'theme' all across the works.]

@funderburkjim
Copy link
Contributor Author

BTW, I see 158375 <L> entries, not 158370

I merged 5 entries:

parvan L=96646 merge into 69945 pora
pravAla L=73144 merge into  73143, pravAqa
AzwakIya L=16300 merge into 16299 Azwaka
Dru L= 55950 merge into Dru 55949
peSI L=69764 merge into peSI 69763

@funderburkjim
Copy link
Contributor Author

additional denominative roots

475 matches for "^!√{#[^#]*y#}¦, {#°yat[ie]#}"  already marked
102 matches for "^{#[^#]*y#}¦, {#°yat[ie]#}"  add !√ markup

@Andhrabharati
Copy link

I merged 5 entries:

Looked for other entries with similar pattern ¦\n<LEND, and found that 5 entries (44904, 49112, 53788, 58874 and 113078) have missed the body portion.

@Andhrabharati
Copy link

Andhrabharati commented Feb 16, 2024

102 matches for "^{#[^#]*y#}¦, {#°yat[ie]#}" add !√ markup

All these occur in the erstwhile pwkvn portion; I seem to have skipped marking them.
Probably there could be some places missing the √ mark as well in this portion.

BTW, found one entry L-45920, which has √ mark, but should be with the !√ mark; and it has a typo tilakaya for tilakay

@funderburkjim
Copy link
Contributor Author

funderburkjim commented Feb 16, 2024

other entries with similar pattern ¦\n<LEND

Those five had missing text in cdsl pw. I have added the text.
temp_missing_body_AB.txt

Note: corrected the tilakaya entry.

@funderburkjim
Copy link
Contributor Author

Probably there could be some places missing the √ mark

I generated a list of possible missing √ mark entries by
search for headword pattern consonant-root-consonant

regex without hom (python syntax)
r'^[*]?{#[kKgGNcCjJYwWqQRtTdDnpPbBmyrlvSzsh|][aAiIuUfFxXeEoOMH][kKgGNcCjJYwWqQRtTdDnpPbBmyrlvSzsh|]#}¦'

regex with hom:  
r'^<hom>[^<]*</hom> [*]?{#[kKgGNcCjJYwWqQRtTdDnpPbBmyrlvSzsh|][aAiIuUfFxXeEoOMH][kKgGNcCjJYwWqQRtTdDnpPbBmyrlvSzsh|]#}¦'

513 found.
Manually examined all.

  • 358 need √ mark (251 of these are in the pwkvn entries)
  • 155 don't need √ mark.

details: temp_possible_roots_edit.txt

@Andhrabharati Do you agree that √ mark should be added (before ¦) for these 358?

@Andhrabharati
Copy link

Andhrabharati commented Feb 16, 2024

Yes pl., somehow I had skipped these markings!!

@Andhrabharati
Copy link

Now, I've looked for the 'missed' alt. HW entries in the erstwhile pwkvn file, and found 15 such.

Found another 8 entries in pwkvn portion, that come under this--
L-200048, L-200334, L-201819, L-206328, L-208193, L-209653, L-214672, L-221290

@Andhrabharati
Copy link

@Andhrabharati Before proceeding much further, I wanted your take on this approach.

Just curious to know your conclusion on how to proceed further on the task, @funderburkjim !!

@funderburkjim
Copy link
Contributor Author

interim progress

I've marked the additional 358 roots.

'primary HWs' (as I termed them) [single or 'grouped' entries], that occur at the beginning of the entry in the printed lexicon'close' althws

I like this idea and am proceeding to see if I find any more in addition to those you have mentioned in above comments.

@Andhrabharati
Copy link

Glad to hear this, @funderburkjim !

Working with a 'common' thinking/process definitely makes the collaborative effort easier, facilitating the comparision (between the two works) quicker and fruitful.

@Andhrabharati
Copy link

I have many more entries that come under the alt.HW type & the 'root' type now.

@funderburkjim
Copy link
Contributor Author

altheadwords

temp_change_1a_2_althws.txt

This file contains changes for alternate headwords from 2 sources:

  • 66 that you have listed above
  • 26 that I have added

@Andhrabharati Please review and provide corrections as needed.

Note: I have used the GRA model for deriving extra entries for PW (pw_hwextra.txt) from <k2>.
Currently: 2516 extra headwords from 1589 <k2>s.

many more entries that come under the alt.HW type & the 'root' type now.

If you provide these, I'll make changes for them.

@Andhrabharati
Copy link

Andhrabharati commented Feb 21, 2024

@funderburkjim

My file now contains 713 (main) and 682 (vn) lines differing with the CDSL (combined) file, ignoring the meta-lines (as I did not populate the k2-field yet).

; 05: 93 entries - alternate headwords

If you post your full file (containing all the changes in your 5 steps), I can do a diff with my file and list out the differing lines.

funderburkjim added a commit that referenced this issue Feb 21, 2024
@funderburkjim
Copy link
Contributor Author

temp_pw_2.zip my current version
temp_change_pw_0_2.txt Changes from the current csl-orig pw.txt. All changes thus far were done while keeping the number of lines the same. This file shows the line-by-line diff.

Further details in the pwkissues/issue106 folder.

@Andhrabharati
Copy link

Andhrabharati commented Feb 21, 2024

Thank you @funderburkjim for the files.

Seen that just over 900 lines (616: main and 315: vn) are differing between our files.

Will go through them tomorrow and after necessary corrections (if any) in my file, shall post the differing lines for your persual and further action.

@Andhrabharati
Copy link

Andhrabharati commented Feb 22, 2024

Here are the files that I had made--

  1. separated the VN data from the combined CDSL file: pwkvn_2 (CDSL).txt
  2. deleted the metaline to ease comparing with my file: temp_pwkvn_2 (CDSL).txt

And the corresponding file from my side: temp_pwkvn_2 (AB).txt
[Pl. note that my file does not contain the trailing <info(.*)> tag.]

After "incorporating" necessary corrections in my file, there are 450 differing lines in the VN portion with the CDSL file

  • some of these belong to the header portion that need to be carried into the metaline
  • some are just the relocation of the broken bar, not affecting the metaline, and
  • some are within the body portion, not affecting the metaline

Hope @funderburkjim wouldn't be having much issues in using my file.
[I can generate (and post) the diff. file, in case he feels any difficulty with the above AB file.]

@Andhrabharati
Copy link

Andhrabharati commented Feb 22, 2024

Now, coming to the pw main data, here are 206 header portions with dhAtu (√) markup--
dhAtu header lines.txt

Hope, this is convenient enough to be "used" by Jim.
-----------------------------

There still remain about 390 diff. lines, out of which 34 lines contain the _ (underscore) character.

Though most of those could be removed as done by Jim (for slp1 has no scope for confusion of vowel-hiatus [but I wonder if these would all "pass" the round-robin test of conversion to another script like Devanagari or IAST and back!!]), I feel some of them need to be retained as they denote a 'space' character within the Devanagari string.

@Andhrabharati
Copy link

Andhrabharati commented Feb 22, 2024

Another 30 lines have the <ls n="Chr."> markup by Jim, which do not point to Boehtlingk's Chrestomathie at all; I had marked them with the 〔...〕markup, so that they would be easily traceable for properly tagging to their resp. works.

BTW, there are quite a few such places in the pwkvn portion as well, which I had already posted above (with the same markup).

@Andhrabharati
Copy link

Note: I have used the GRA model for deriving extra entries for PW (pw_hwextra.txt) from <k2>.
Currently: 2516 extra headwords from 1589 <k2>s.

@funderburkjim

Would you mind explaining about this 1589 number?
I see a huge number of entries that count to nearly 5-6 times of this!

@funderburkjim
Copy link
Contributor Author

versions 8 and 8a

Work is in pw_8_work directory.

Request @Andhrabharati to apply the changes in the two BEGIN Jim disagrees with AB for sections of the pw_8_work/readme.txt file. If you accept these, then I expect the 8a version will agree with your version.

@Andhrabharati
Copy link

@funderburkjim

Would you pl. have a look at this post, while I look at the pw_8a file?

@Andhrabharati
Copy link

Andhrabharati commented Mar 22, 2024

Here are the 3 places where AB likes to debate with Jim's opinion.

AB: <L>124385<pc>7-121-b<k1>sAradIyanAmamAlA<k2>sAradIyanAmamAlA, (SAradIyanAmamAlA)
Jim: <L>124385<pc>7-121-b<k1>sAradIya<k2>sAradIya, (SAradIya), SAradIyanAmamAlA
Note: Also bbline change {#nAmamAlA#} -> {#°nAmamAlA#} PRINT CHANGE

;; AB remark: There is no "sAradIya" word that occurs in the literature; the word "SAradIya" has already been mentioned at L-111682, and as such there is no need to repeat the same again here.
;; AB remark: It is clearly the suggestion of Boethlingk to consider SAradIya for sAradIya (as a print error in BÜHLER Report) in sAradIyanAmamAlA which is a single word.
;; AB remark: "Adj. (f. {#A#})" is deleted here in this session, as it appeared redundant.
;; AB remark: And there is no need to put the ° mark, taking the entry as a single word.
----------------------------

{#aBizwipA/si#}¦ <ls>ṚV. 2,20,2</ls> nach <ls>GRASSMANN.</ls> für {#aBi/zwI pAsi#}.
aBizwipA/si -> aBizwipA/(si) by print

;; AB remark: With the adopted norm that the entities having the in-text (...) and [...] be expanded with and without the brackets, this should've been made as an alt. HW group [aBizwipA/, aBizwipA/si].
;; AB remark: However, this seems not the intent here; either it should go as just the "aBizwipA/" as taken by MW, or as "aBizwipA/si" as seen in the ṚV. citation and 'matching' with the GRA emendment.
;; AB remark: In either case, this would go as a "print change".
----------------------------

{#ISvarItantra#} <lex>n.</lex> und {#ISvare (<ab>Loc.</ab>) nityasuKAvasTApanam#}¦ Titel von Werken.
{#ISvare (Loc.) nityasuKAvasTApanam#} ->
{#ISvare#} (Loc.) {#nityasuKAvasTApanam#}

;; AB remark: I had felt that the two words (forming the name of the work) need not be separated as individual words, and as such marked thus.

@Andhrabharati
Copy link

Andhrabharati commented Mar 22, 2024

Now, about the slp1 haitus places--
If Jim feels no need for these, in spite of my above post, I have no issues in having the hiatus removed at such places.

BTW, there is another place where it is not required, "it does exist" [at L-2991]!

@Andhrabharati
Copy link

Andhrabharati commented Mar 22, 2024

My present version data has additional differences [in non-metalines] in pwk_main (few: ~150) and pwkvn (lot many: ~15k) portions; but the comparison could probably be stopped here.

@Andhrabharati
Copy link

Andhrabharati commented Mar 22, 2024

On a 2nd thought, I have 'modified' both CDSL and AB files a bit; now, the difference line count is just over 700.

And, here are the modified files--
pw (CDSL) 8a.zip
[This has few blank lines inserted]

pw integrated (AB) v1 (for CDSL).zip
[This now has pwk main and vn portions integrated]

@funderburkjim
Copy link
Contributor Author

funderburkjim commented Mar 22, 2024

  1. I am unclear on your pw(CDSL)8a version
    • how is it related to the temp_pw_8a version that I uploaded?
    • What use should I make of it?
  2. I suspect the pw.integrated version is your latest candidate for final version. Right?
  • how different from your pw(CDSL)8a version ?

Each of these versions has 764942 lines. and my uploaded temp_pw_8a.txt has 764934 lines -- where do the extra 8 lines in your versions come from?

Also, in the vn section of both your versions, you omit the <info n="sup_X"/> field. This is needed for the displays to show the [supplement volume X] note .

@Andhrabharati
Copy link

Andhrabharati commented Mar 22, 2024

  1. I am unclear on your pw(CDSL)8a version
  • how is it related to the temp_pw_8a version that I uploaded?
  • What use should I make of it?

Yes, it is the same file with some changes done inside.
It can be used to get the diff.s wrt the pw.integrated version; of course your original temp_8a file could also be used, but it will give more (500+) differences.

  1. I suspect the pw.integrated version is your latest candidate for final version. Right?
  • how different from your pw(CDSL)8a version ?

Yes, for time being. [And I thought of not doing any more 'independent' updates in it from my side.]
As mentioned above, it has some 700 differences wrt the pw(CDSL)8a

Each of these versions has 764942 lines. and my uploaded temp_pw_8a.txt has 764934 lines -- where do the extra 8 lines in your versions come from?

I have added extra blank lines after the <H> lines as were at the earlier versions of the pwkvn file, that were removed in your recent file(s).

Also, in the vn section of both your versions, you omit the <info n="sup_X"/> field. This is needed for the displays to show the [supplement volume X] note .

Do you want me to upload the files with the info tags retained as is?

@funderburkjim
Copy link
Contributor Author

Do you want me to upload the files with the info tags retained as is?

If you can do that readily, then yes. Otherwise I can find a way to do it.

@Andhrabharati
Copy link

Andhrabharati commented Mar 22, 2024

They are not immediately available; I need to spend a little time to make them.

Probably, it might be better if you do it yourself.

@funderburkjim
Copy link
Contributor Author

Re 'L=124384' -- in your files, you have {#°nAmamAlA#} but you mention there is no need to put the ° mark.

re {#ISvare (<ab>Loc.</ab>) nityasuKAvasTApanam#}

This is not proper -- since <ab>Loc.</ab> is not Sanskrit. --
The abbreviation needs to be outside of {#...#}

@funderburkjim
Copy link
Contributor Author

be better if you do it yourself.

OK, I'll do that.

@Andhrabharati
Copy link

Andhrabharati commented Mar 22, 2024

Re 'L=124384' -- in your files, you have {#°nAmamAlA#} but you mention there is no need to put the ° mark.

My mistake; initially I had reverted my file line as in yours; but later posted the comments, but not corrected in my file accordingly.

This is how I wanted it to be--
<L>124385<pc>7-121-b<k1>sAradIyanAmamAlA<k2>⁅sAradIya⁆nAmamAlA, (⁅SAradIya⁆nAmamAlA)
{#sAradIyanAmamAlA#} (besser {#SA°#})¦ <lex>f.</lex> Titel eines Werkes <ls>BÜHLER, Rep. No. 780</ls>.
<LEND>

re {#ISvare (<ab>Loc.</ab>) nityasuKAvasTApanam#}

This is not proper -- since <ab>Loc.</ab> is not Sanskrit. -- The abbreviation needs to be outside of {#...#}

So, do we go with the two words separately marked as {#ISvare#} (Loc.) {#nityasuKAvasTApanam#}?
[This is not a big point for me to debate upon.]

@funderburkjim
Copy link
Contributor Author

do we go with the two words separately marked as {#ISvare#} (Loc.) {#nityasuKAvasTApanam#}?

Yes - I can't think of a better solution at the moment.

I've found the extra lines.

That's all my questions for now -- will proceed with analysis/implementation of your changes.

@funderburkjim
Copy link
Contributor Author

Regarding '_'

You are definitely right that a round-trip of transcoding of X (slp1 -> hk - > slp1) does not result in X when X has certain properties (such as an 'ai' or 'au' hiatus, also 'bh', 'gh' , and maybe a few other cases).

A similar comment regarding IAST instead of hk.

My view has been that iast and hk should be viewed as faulty and/or incomplete transcoding schemes for Devanagari. cdsl could take upon itself the task of extending hk and iast to 'remedy' such problems. But, I have not thought the user reward for such a task is great enough to justify the effort, since such anomalies are rare.

While thinking about this, I noticed that the 'simple-search (input=simple)' display needs to be revised so that 'prauga' (MW) yields not only 'prOga' (slp1) but also 'prauga' (slp1).

@Andhrabharati
Copy link

My view has been that iast and hk should be viewed as faulty and/or incomplete transcoding schemes for Devanagari.

I've seen that slp1 itself also has the drawback of failing in the round-trip conversion, deva - slp1 - deva (or slp1 - deva - slp1) at such places!!

funderburkjim added a commit that referenced this issue Mar 24, 2024
@funderburkjim
Copy link
Contributor Author

temp_pw_9b.txt

temp_pw_9b.zip

This incorporates almost all of AB's latest batch of changes.
See also change_8b_9.txt, change_9_9a.txt and change_9a_9b.txt for how I analyzed the many different kinds of changes proposed by AB.
See diff_9b_ab_2.txt for the differences between temp_pw_9b.txt and AB's final file pw.integrated.AB.v1.for.CDSL.txt.

The changes are also integrated into the displays (locally):
image

@Andhrabharati When you sign off on temp_pw_9b.txt, I'll install it at Cologne.

@funderburkjim
Copy link
Contributor Author

funderburkjim commented Mar 24, 2024

I've seen that slp1 itself also has the drawback of failing in the round-trip conversion, deva - slp1 - deva (or slp1 - deva - slp1) at such places!!

I'll believe it when I see it!

I doubt that the Ralph Bunker/Peter Scharf implementation of slp1-deva transcoding has an invertibility problem, but it may be that my implementation is imperfect.

When (if) you encounter such an instance, open a new issue and provide full details, so I can reproduce the problem, and hopefully correct any such imperfections.

@Andhrabharati
Copy link

Andhrabharati commented Mar 25, 2024

@Andhrabharati When you sign off on temp_pw_9b.txt, I'll install it at Cologne.

Great to see that practically no differences exist between the two versions.

Here are the final changes--

  1. While at two entries (L-17562 and L-73947) the hiatus is removed in the header portion, it remained in the metaline.
  2. The final form concluded at L-124385 prompted me to look for other places having "(besser" and found 3 entries--
    diff_9b-1.txt
  3. The SUrpa°RaKI at L-113882 prompted me to look for other places having "[a-z]°[a-z]" and found 8 lines, out of which 5 are typo or print errors
  • 232306 212306 . {#daSa°Sata°#} -> , {#daSa°#}, {#Sata°#}
  • 294400 {#nizAda°tva#} -> {#nizAdatva#} ;; print change
  • 565127 SUrpa°RaKI -> SUrpaRaKI
  • 597908 {#sarvaM°yam#} -> {#sarvaM °yam#}
  • 645392 ,%} {#pa°da#} -> %}. {#pada°#}

and the remaining 3 lines are the only 'rare' cases having the ° mark within the string (in the digital text; probably there might be few more, which would come out if and when a full proofing takes place to match the file data with the print - i.e. typo errors) [should we make these changes? if so, what's the best way to do so?]

  • 68425 {#A°nipuRe — dEve#} ;; {#A⁅parvaBaNga⁆nipuRe — dEve#}
  • 202051 212051 {#tri°jyotizmatI#} ;; {#tri⁅zwub⁆jyotizmatI#}
  • 306317 {#mAMsaM Sva°nipAtitam#} ;; {#mAMsaM Sva⁅daSanAnaNge⁆nipAtitam#}

@Andhrabharati
Copy link

Andhrabharati commented Mar 25, 2024

This is one of the longest sessions that took place-- though at may a times going beyond the "subject matter" (due to my 'uncontrolled' way of corrections!)-- but bringing the text into a good form now.

I would like Jim to think of opening two more issues

  • one to tackle the long-pending (for over three years now, since I had "promised" to give out my results if the corrections are done in cdsl data as per my proposal) "resolution" of ls-entities; this exercise shall now also include making the simplistic ls-tooltips and

  • another to "integrate" the vn portion into the main pwk portion, in the same way as done in GRA [this would eliminate the 'pure index entries' (without any "body matter") in the pwk7 vn and retain the 'proper' vn entries having some objective "body matter", and bring the total vn entries count close to what SCH mentioned (14450) from the present 22611 and contain the entries]; I had touched this point (of removing the 'index entries') in the very initial days after the pwkvn got typed by Thomas and added as a new repo as point 3b but it did not get Jim's nod for some reason [point 3a got corrected in the present session!!].

I shall take responsibility for these two tasks (the first one does not need much time, and which only I can do [as of now]), but the 2nd one might take a week or so [which Jim could also try out as in GRA initially, and then I had jumped in to give finishing touches jointly].

Look forward to know what Jim decides on this.

@Andhrabharati
Copy link

Andhrabharati commented Mar 25, 2024

Finally here is the concluding post from my side at this issue--

If I give a brief about the spl. markup introduced for the filled-up portions at the HW level, probably Jim might appreciate my idea and take up necessary action further (as I intended).

While in vast majority cases, the "padding" is done at the front of the compound word (as ⁅X⁆°Y), in just 91 cases it is done at the end (as X°⁅Y⁆).

I had presumed that we should somehow have the difference, and thus used the spl. markers '⁅ ⁆'; though the regular '[ ]' could've been used, as it has been used for other purposes in the print, I had thought of having a separate mark to avoid ambiguity.

Jim is requested to recall his opinion on the topic [as note 2 in L-12291.AB.revised_JF.txt, while I was working at MD last], wrt the status in MW.]

Now, what use did I have in mind for this marker in practice?

  • In case, if (and when) we decide to have the main text (i.e. the header portion) itself changed with the "padded strings" [as done in case of MW], the markers will come in handy.

  • We can programmatically match the strings in the k2-field (with marker) and in the following header portion (without marker) and change the header strings easily, and then remove the markers in the k2-field.

funderburkjim added a commit to sanskrit-lexicon/csl-orig that referenced this issue Mar 25, 2024
funderburkjim added a commit to sanskrit-lexicon/csl-corrections that referenced this issue Mar 25, 2024
funderburkjim added a commit to sanskrit-lexicon/csl-apidev that referenced this issue Mar 25, 2024
@funderburkjim
Copy link
Contributor Author

temp_pw_9c.zip has the few changes mentioned by AB above.

change_9b_9c.txt has the changes.

This version is now installed at Cologne.

Additional revisions of repositories csl-corrections, csl-apidev, hwnorm1 (see commit links above).

The final version changes about 42000 lines out of 764942, or about 5% of lines.
There are now about 12000 'alternate' headwords for pw.
This work has taken about 6 weeks.

Now closing this issue. Will make a 'placeholder' issue for some additional TODOs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants