-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternate headwords for pw #106
Comments
markup observations
|
It appears that you are analyzing the pw file data, for marking the prospective alt. HWs yourself. I had mentioned earlier (in #104) that the header portion may be looked at to get these words; but there appear to be more entries that contain alt. HWs (which I had missed before in my posted file that formed the base for the cdsl version) after the broken bar. Here are a few (43 no.s) such L-entries on a quick searching-- 16678, 22390, 23281, 26293, 28410, There could be more like these, and hope you would try to get those as well. |
Once you finish your exercise, I might be able to compare your file with my version (not posted so far) for any missings/changes. |
Yes, I am working on those alt-headwords now. Will share my work with you for comparison, perhaps later this week. |
Now, I've looked for the 'missed' alt. HW entries in the erstwhile pwkvn file, and found 15 such.
These entries have the alt. HWs after the broken bar, as are the pwk (main) entries listed above. |
systematic additional headwordsMy approach to determine NEW secondary headwords for an entry is based on
Excluding these two groups, there are approx. 31054 entries which might But there are many patterns in these 31054 entries which disqualify the After excluding this and many other patterns, there remain 10788 candidates. Thus, the approach changes course to apply patterns to INCLUDE subsets of the
The file temp_change_3_01.txt @Andhrabharati Before proceeding much further, I wanted your take on this approach. |
I would suggest 'restricting' this phase to mark and bring-out the 'primary HWs' (as I termed them) [single or 'grouped' entries], that occur at the beginning of the entry in the printed lexicon. We definitely need to bring-out other 'inner' HWs also, that occur in multiple ways, but this could be done in another/next phase - And then, we should look for the composite/compound words that occur inside the body portion of the above entries, and suitably bring them out. BTW, I see 158375 |
My personal opinion is that we should mark these 'secondary' HWs with And then, list those various groups within the 'main' entry somewhere (like the separate althw file seen in some cdsl works), to come under the "search" criterion. This approach would retain the digital text in a form closer to the printed work. |
I merged 5 entries:
|
additional denominative roots
|
Looked for other entries with similar pattern |
All these occur in the erstwhile pwkvn portion; I seem to have skipped marking them. BTW, found one entry L-45920, which has √ mark, but should be with the !√ mark; and it has a typo tilakaya for tilakay |
Those five had missing text in cdsl pw. I have added the text. Note: corrected the tilakaya entry. |
I generated a list of possible missing √ mark entries by
513 found.
details: temp_possible_roots_edit.txt @Andhrabharati Do you agree that √ mark should be added (before ¦) for these 358? |
Yes pl., somehow I had skipped these markings!! |
Found another 8 entries in pwkvn portion, that come under this-- |
Just curious to know your conclusion on how to proceed further on the task, @funderburkjim !! |
interim progressI've marked the additional 358 roots.
I like this idea and am proceeding to see if I find any more in addition to those you have mentioned in above comments. |
Glad to hear this, @funderburkjim ! Working with a 'common' thinking/process definitely makes the collaborative effort easier, facilitating the comparision (between the two works) quicker and fruitful. |
I have many more entries that come under the alt.HW type & the 'root' type now. |
altheadwordsThis file contains changes for alternate headwords from 2 sources:
@Andhrabharati Please review and provide corrections as needed. Note: I have used the GRA model for deriving extra entries for PW (pw_hwextra.txt) from
If you provide these, I'll make changes for them. |
My file now contains 713 (main) and 682 (vn) lines differing with the CDSL (combined) file, ignoring the meta-lines (as I did not populate the k2-field yet).
If you post your full file (containing all the changes in your 5 steps), I can do a diff with my file and list out the differing lines. |
temp_pw_2.zip my current version Further details in the pwkissues/issue106 folder. |
Thank you @funderburkjim for the files. Seen that just over 900 lines (616: main and 315: vn) are differing between our files. Will go through them tomorrow and after necessary corrections (if any) in my file, shall post the differing lines for your persual and further action. |
Here are the files that I had made--
And the corresponding file from my side: temp_pwkvn_2 (AB).txt After "incorporating" necessary corrections in my file, there are 450 differing lines in the VN portion with the CDSL file
Hope @funderburkjim wouldn't be having much issues in using my file. |
Now, coming to the pw main data, here are 206 header portions with dhAtu (√) markup-- Hope, this is convenient enough to be "used" by Jim. There still remain about 390 diff. lines, out of which 34 lines contain the _ (underscore) character. Though most of those could be removed as done by Jim (for slp1 has no scope for confusion of vowel-hiatus [but I wonder if these would all "pass" the round-robin test of conversion to another script like Devanagari or IAST and back!!]), I feel some of them need to be retained as they denote a 'space' character within the Devanagari string. |
Another 30 lines have the BTW, there are quite a few such places in the pwkvn portion as well, which I had already posted above (with the same markup). |
Would you mind explaining about this 1589 number? |
versions 8 and 8aWork is in pw_8_work directory.
Request @Andhrabharati to apply the changes in the two |
Would you pl. have a look at this post, while I look at the pw_8a file? |
Here are the 3 places where AB likes to debate with Jim's opinion.
;; AB remark: There is no "sAradIya" word that occurs in the literature; the word "SAradIya" has already been mentioned at L-111682, and as such there is no need to repeat the same again here.
;; AB remark: With the adopted norm that the entities having the in-text (...) and [...] be expanded with and without the brackets, this should've been made as an alt. HW group [aBizwipA/, aBizwipA/si].
;; AB remark: I had felt that the two words (forming the name of the work) need not be separated as individual words, and as such marked thus. |
Now, about the slp1 haitus places-- BTW, there is another place where it is not required, "it does exist" [at L-2991]! |
My present version data has additional differences [in non-metalines] in pwk_main (few: ~150) and pwkvn (lot many: ~15k) portions; but the comparison could probably be stopped here. |
On a 2nd thought, I have 'modified' both CDSL and AB files a bit; now, the difference line count is just over 700. And, here are the modified files-- pw integrated (AB) v1 (for CDSL).zip |
Each of these versions has 764942 lines. and my uploaded temp_pw_8a.txt has 764934 lines -- where do the extra 8 lines in your versions come from? Also, in the vn section of both your versions, you omit the |
Yes, it is the same file with some changes done inside.
Yes, for time being. [And I thought of not doing any more 'independent' updates in it from my side.]
I have added extra blank lines after the
Do you want me to upload the files with the info tags retained as is? |
If you can do that readily, then yes. Otherwise I can find a way to do it. |
They are not immediately available; I need to spend a little time to make them. Probably, it might be better if you do it yourself. |
Re 'L=124384' -- in your files, you have re This is not proper -- since |
OK, I'll do that. |
My mistake; initially I had reverted my file line as in yours; but later posted the comments, but not corrected in my file accordingly. This is how I wanted it to be--
So, do we go with the two words separately marked as {#ISvare#} (Loc.) {#nityasuKAvasTApanam#}? |
Yes - I can't think of a better solution at the moment. I've found the extra lines. That's all my questions for now -- will proceed with analysis/implementation of your changes. |
Regarding '_' You are definitely right that a round-trip of transcoding of X (slp1 -> hk - > slp1) does not result in X when X has certain properties (such as an 'ai' or 'au' hiatus, also 'bh', 'gh' , and maybe a few other cases). A similar comment regarding IAST instead of hk. My view has been that iast and hk should be viewed as faulty and/or incomplete transcoding schemes for Devanagari. cdsl could take upon itself the task of extending hk and iast to 'remedy' such problems. But, I have not thought the user reward for such a task is great enough to justify the effort, since such anomalies are rare. While thinking about this, I noticed that the 'simple-search (input=simple)' display needs to be revised so that 'prauga' (MW) yields not only 'prOga' (slp1) but also 'prauga' (slp1). |
I've seen that slp1 itself also has the drawback of failing in the round-trip conversion, deva - slp1 - deva (or slp1 - deva - slp1) at such places!! |
temp_pw_9b.txtThis incorporates almost all of AB's latest batch of changes. The changes are also integrated into the displays (locally): @Andhrabharati When you sign off on temp_pw_9b.txt, I'll install it at Cologne. |
I'll believe it when I see it! I doubt that the Ralph Bunker/Peter Scharf implementation of slp1-deva transcoding has an invertibility problem, but it may be that my implementation is imperfect. When (if) you encounter such an instance, open a new issue and provide full details, so I can reproduce the problem, and hopefully correct any such imperfections. |
Great to see that practically no differences exist between the two versions. Here are the final changes--
and the remaining 3 lines are the only 'rare' cases having the ° mark within the string (in the digital text; probably there might be few more, which would come out if and when a full proofing takes place to match the file data with the print - i.e. typo errors) [should we make these changes? if so, what's the best way to do so?]
|
This is one of the longest sessions that took place-- though at may a times going beyond the "subject matter" (due to my 'uncontrolled' way of corrections!)-- but bringing the text into a good form now. I would like Jim to think of opening two more issues
I shall take responsibility for these two tasks (the first one does not need much time, and which only I can do [as of now]), but the 2nd one might take a week or so [which Jim could also try out as in GRA initially, and then I had jumped in to give finishing touches jointly]. Look forward to know what Jim decides on this. |
Finally here is the concluding post from my side at this issue--
While in vast majority cases, the "padding" is done at the front of the compound word (as ⁅X⁆°Y), in just 91 cases it is done at the end (as X°⁅Y⁆). I had presumed that we should somehow have the difference, and thus used the spl. markers '⁅ ⁆'; though the regular '[ ]' could've been used, as it has been used for other purposes in the print, I had thought of having a separate mark to avoid ambiguity. Jim is requested to recall his opinion on the topic [as note 2 in L-12291.AB.revised_JF.txt, while I was working at MD last], wrt the status in MW.] Now, what use did I have in mind for this marker in practice?
|
Ref: sanskrit-lexicon/PWK#106 Installed temp_pw_9c.txt
temp_pw_9c.zip has the few changes mentioned by AB above. change_9b_9c.txt has the changes. This version is now installed at Cologne. Additional revisions of repositories csl-corrections, csl-apidev, hwnorm1 (see commit links above). The final version changes about 42000 lines out of 764942, or about 5% of lines. Now closing this issue. Will make a 'placeholder' issue for some additional TODOs. |
We tackle the task of generating alternate headwords for pw dictionary.
Preliminary outline of the approach:
Note: no attempt to generate alternate headwords from upasargas of verb entries.
The text was updated successfully, but these errors were encountered: