Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

With or without linebreaks #419

Open
drdhaval2785 opened this issue Jan 17, 2024 · 9 comments
Open

With or without linebreaks #419

drdhaval2785 opened this issue Jan 17, 2024 · 9 comments
Labels
Documentation How TXT , XML work

Comments

@drdhaval2785
Copy link
Contributor

Dear all,
This issue has been going on in my mind for long.
In many CDSL dictionaries, we have line breaks as per printed dictionaries. In many, we don't.

This issue is devoted to deciding the usefulness or otherwise of line breaks

Pros

  1. Helps locate a sentence in printed dictionary, because text looks almost the same in digitized form.
  2. May help in highlighting the blocks in PDF for given Lnum.

Cons

  1. Unnecessary hyphenation.
  2. Unable to search for hyphenated words which lie on the edge of lines, without some hack.
  3. In many applications / frontend, the display breaks at line breaks whereas a continuous display would have been better. Tabs / mobile / browsers have different screen sizes. In smaller screens like Mobile, one line of CDSL data may run for two lines and break abruptly at second line. This was particularly true of stardict apps.

Should we change from line breaks to sans line breaks?

The question deserves attention, because @Andhrabharati submits his major corrections in the later format. If we agree in principle to go with that format, we won't have hassle of analysing diffs and spending a lot of time.

what about invertibity?

We can have a json with lnum as key. It will hold "old" and "new" text blobs. So, it will be possible to go back and forth.

In case we made some change to our new data, diff can be found out and the same can be carried to old one, if wished.

Only the changes at line ends will not be carried back computationally. It will have to be handled manually.

Historical experience

We had made a quantum jump when we moved from Anglicised Sanskrit to IAST / metaline. We also had invertibility principle then.
But in practice, no one has ever shown any interest to carry back changes made to IAST version to AS version. Same may happen here. Much ado about nothing.

View

My view is that we should do away with line breaks.

What do others say?

@drdhaval2785
Copy link
Contributor Author

image

Example of ugly line breaks in the frontend of stardicts.

@Andhrabharati
Copy link

Andhrabharati commented Jan 17, 2024

@drdhaval2785

I am glad that your very first 'keen' attempt of looking into my file(s) 'prompted' you to think of changing the 'stand' (that stood for many years now).

I know (for sure) that Jim could add few more Cons to your list and I have many more (but that is not worth spending my time at).

And apart from 'leaving away' the line-breaks, I make several other 'important' structural changes in my files.
[Probably you would be noticing them and bring onto board for discussion/voting, as you spend some more time looking at various files that I had posted.]

Coming to retaining the line-breaks, I think they should be retained at "verse blocks" in VCP and SKD, that span into multiple columns (and even multiple pages) many a time. Reading such long unbroken matter would be a bad experience, as the reader's mind now, more or less, is 'tuned' to the "semantic breaks" introduced in printing. But within the 'prose' paragraphs, they can be got away with.

Now coming to the Pros that you had listed--

As I understand, the need for looking at the print dictionary comes up, to compare the digital text, mostly for correcting the errors being reported by the users (or otherwise). How is it being dealt in the case of MW, that is the mostly reported work [I would roughly estimate it to be 90-95% in the user feedback], whose digital text does not contain the line-breaks and also has deviated (a bit too-)much from the print (except for having the page-column info in-tact)?

@Andhrabharati
Copy link

Speaking of the MW digital text format, I thought I should 'leak' that my current working 'prompted' me to make some major structural changes in it, some of them moving closer to print matter.

I am sure that this would create some hiccups, if (and when) I post my MW work.

@drdhaval2785
Copy link
Contributor Author

Any thoughts @funderburkjim?

@gasyoun gasyoun added the Documentation How TXT , XML work label Feb 4, 2024
@gasyoun
Copy link
Member

gasyoun commented Feb 4, 2024

@drdhaval2785

We can have a json with lnum as key. It will hold "old" and "new" text blobs. So, it will be possible to go back and forth.

And double the size of each dictionary?

My view is that we should do away with line breaks.

I'm for it. But that would take years for just this one task and stop all the others, is it worth now?

@Andhrabharati

need for looking at the print dictionary comes up, to compare the digital text, mostly for correcting the errors being reported by the users (or otherwise).

exactly

MW, that is the mostly reported work [I would roughly estimate it to be 90-95% in the user feedback]

right

@Andhrabharati
Copy link

But that would take years for just this one task and stop all the others, is it worth now?

On what basis did you arrive at this, Marcis?

I have been doing this (removing or alt. marking the line-breaks) in just few minutes in each of the CDSL dictionary, that I work upon!

@funderburkjim
Copy link
Contributor

removing or alt. marking the line-breaks)

In AB's [revision to MD](sanskrit-lexicon/csl-orig@2dffafb dictionary), he introduced the
convention of using a special character to indicate line breaks. 🞄 = U+1F784

{#a#}¦ <hom>1.</hom> a, <ab>pn.</ab> {%root used in the inflexion of%} 
idam 🞄{%and in some particles%}: a-tra, a-tha.

make_xml.py can 'ignore' this character, so it doesn't get in the way of displays.

This seems like a good solution.

As a general point, I think that preservation of line breaks have served there purpose.
The original 'later' digitizations provide by @maltenth honored line breaks -- this was in part to
help in the internal double-entry error detection process of the output by the Sanskrit typists.

Then, when I came to make displays for these later dictionaries, I thought it was best to preserve line breaks in the displays to aid in correction investigation.

We are now in process of making major revisions to these original forms -- adding markup, tooltips, links, etc. so the dictionary displays more useful. These changes also provide the basis for future NLP-type work with the dictionary corpora (e.g. DAtu extraction).

So line-break preservation is no longer as useful as it once was. For some dictionaries (Burnouf and Apte90 come to mind), I used a <lbinfo> tag to preserve line-break info. But I think AB's use of a special character is better -- it doesn't get in the way as much.

Current opinion: For cdsl dictionaries where line-breaks currently preserved, use the special character. But feel free to use multiline forms in the xxx.txt (e.g. at the <div n="pfx"> for semantically meaningful breaks. For example:

OLD (CURRENT)
<L>578<pc>018-b<k1>arT<k2>arT
{#arT#}¦ 10. {%P.%} (v. {#arTa#}) petere, postulare (gr. <lang n="greek">αἰτέω</lang> dissoluto
{%r%} in vocalem {%i%}, cf. {#arTa#}).
<div n="pfx">c. {#pra#} petere, appetere, desiderare, concupiscere. BR. 2. 11.
12. 13. 16. IN. 5. 33. SU. 1. 26. 3. 11.
<div n="pfx">c. {#sam#} cogitare, putare, existimare. UR. 18. 9. 18. 5. infr.
<LEND>

NEW? 
<L>578<pc>018-b<k1>arT<k2>arT
{#arT#}¦ 10. {%P.%} (v. {#arTa#}) petere, postulare (gr. <lang n="greek">αἰτέω</lang> dissoluto 🞄{%r%} in vocalem {%i%}, cf. {#arTa#}).🞄
<div n="pfx">c. {#pra#} petere, appetere, desiderare, concupiscere. BR. 2. 11.🞄 12. 13. 16. IN. 5. 33. SU. 1. 26. 3. 11.🞄
<div n="pfx">c. {#sam#} cogitare, putare, existimare. UR. 18. 9. 18. 5. infr.
<LEND>

@funderburkjim
Copy link
Contributor

funderburkjim commented Feb 6, 2024

@Andhrabharati Do you convert line-breaks (`\n') to

  • 🞄 (one character)
  • 🞄 (two characters, space before 🞄)
  • 🞄 (two characters, space after 🞄)

Also, how do you handle end-of-line hyphens?

@vvasuki
Copy link
Collaborator

vvasuki commented Feb 7, 2024

make_xml.py can 'ignore' this character, so it doesn't get in the way of displays.

This seems like a good solution.

No - if you're using xml, use a (specially defined) xml tag and not some adhoc special-meaning-symbols.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation How TXT , XML work
Projects
None yet
Development

No branches or pull requests

5 participants