Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vac2a - picture data in TE #19

Open
funderburkjim opened this issue Mar 18, 2021 · 23 comments
Open

vac2a - picture data in TE #19

funderburkjim opened this issue Mar 18, 2021 · 23 comments
Labels

Comments

@funderburkjim
Copy link
Contributor

In the process of preparing hiatus-corrections (#18),
I discovered that there are some (about 200) awkward lines in the vac2.txt (Tirupati data).

These lines were selected from vac2 based on one of two criteria:

  1. the corresponding line of vcp.txt (Cologne data) is <Picture> or
  2. the vac2 line is 300 or more characters long.
    • this is twice as long as any vcp.txt line

About half the 200 lines thus selected actually satisfy both criteria.

funderburkjim added a commit that referenced this issue Mar 18, 2021
@funderburkjim
Copy link
Contributor Author

For the lines so identified, two actions were taken:

  1. The lines of both vac2 and vcp2 were put into the file vac2a_picturedata.txt. We may later want to try to understand why these lines are present in the vac2 version of Tirupati data; but at the current level of study, these lines are just in the way.
  2. A new version of vac2.txt was made called vac2a.txt. In this version, those 200 lines are just represented by a '?' character.

@gasyoun
Copy link
Member

gasyoun commented Mar 18, 2021

@Andhrabharati do I understand right that those pages with images should be rescanned or just cutting out as they are would be enough?

@gasyoun gasyoun added the bug label Mar 18, 2021
@Andhrabharati
Copy link

just cutting is enough.

@Andhrabharati
Copy link

probably the issues I talked about in meta2 file could be done first as they are kind of "identified"-

  1. ai & au cases
  2. Picture cases
  3. Column/Tabular data places

And now the abbr. place corrections.

@Andhrabharati
Copy link

Andhrabharati commented Mar 18, 2021

I guess the tabular data could simply be rendered as / marked, instead of just with a space (or with varying Cn tagging).

And I was thinking of taking it up now.

@Andhrabharati
Copy link

Here is the list of tables and pictures in the Vacaspatya, made wrt the print pages (whatever worth it has).

List of tables (tabular data) and Pictures.txt

@gasyoun
Copy link
Member

gasyoun commented Mar 18, 2021

I guess the tabular data could simply be rendered as / marked, instead of just with a space, or with varying Cn tagging.

We can use the github markup language there, right.

funderburkjim added a commit that referenced this issue Mar 18, 2021
@funderburkjim
Copy link
Contributor Author

I've reorganized the file names a bit in the visible part of the repository.

  • vcp.txt contains the latest version of the Cologne digitization
    • this is in metaline format
    • at some point, the version of vcp.txt in this repository will be moved to the
      production version at csl-orig/v02/vcp/vcp.txt. But meanwhile the version in
      this 'vcp' repository will be ahead of the production version.
  • vac2.txt contains the latest version of the Tirupati digitization
  • vcp2.txt is based on vcp.txt, but in the format comparable to that of vac2.
    Also, its 'difference number' field is based on both vac2.txt and vcp.txt.
  • readme_changes.txt will hold a log so we can keep track of the changes.

Thus far, I've implemented changes pertaining to:

And am looking for other low-hanging fruit

@gasyoun
Copy link
Member

gasyoun commented Mar 18, 2021

And am looking for other low-hanging fruit

Are you sure we are ready to go deeper as headword issues for now?
@drdhaval2785 is the API issues closed? It seems to be endless, the
journey inside the Sanskrit-Sanskrit dictionaries nobody even uses.

@drdhaval2785
Copy link
Contributor

I think the API issue is closed.
Now any further API development would be needed aa and when during frontend development, we need some information in a specific format. We need frontend developer for the same. API is closable from my side.

Regarding Sanskrit-Sanskrit dictionaries, actually they are the ones many people like me use exclusively. If I get no hits in SKD and VCP, then only I turn to MW.
So, if dictionaries were to be corrected for texts, I would put these two on much higher priority. Priority may change in European or American continent, but in Indian subcontinent, Sanskrit-Sanskrit dictionaries are widely used.

@Andhrabharati
Copy link

one might compare how MW and VCP grew up, based on the same roots- WIL, SKD and the German one (PWG).I guess VCP would be with many corrections to PWG notified in PWK, as Taranatha had the full set of manuscripts belonging to Vedic branch in his possession (or accessible to him).

@Andhrabharati
Copy link

MonierWilliams had to "wait" for PWK to come and other works been published. MW99 also has couple of entries taken from Apte90 (by Cappeller, as mentioned in the front pages of it).

@Andhrabharati
Copy link

Are you sure we are ready to go deeper as headword issues for now?

This reminds me of the work I started those days in 2016; I had finished the HW correction for the vowels part. It had treated double (multiple) HWs much better than the exercise at cologne (Usha and Jim).

@Andhrabharati
Copy link

Andhrabharati commented Mar 19, 2021

The Meld exercise is to look (mainly) at the differing lines in vcp2 and vac2 (and refer to scans to decide the corrections).

but I strongly feel the other lines also need to be read once with the scans, as both the digitisations (TPT and Koeln) have erred at many places.

@gasyoun
Copy link
Member

gasyoun commented Mar 19, 2021

We need frontend developer for the same.

I might have found one. I need your understanding of the tasks in a more detailed way.

API is closable from my side.

Got it. What was next priority in your Dec 2020 list?

Sanskrit-Sanskrit dictionaries are widely used

Widely used and yet only 2 people from India interested in cleaning this wast ocean. Because all of time and energy can go here and all the other tasks will stop, because there is no end if we go that deep inside these two oceans.

VCP would be with many corrections to PWG notified in PWK

Guess not, sounds just like some fantasy.

Taranatha had the full set of manuscripts belonging to Vedic branch in his possession (or accessible to him)

Than indeed differs him, but has he used his advantage in full?

MonierWilliams had to "wait" for PWK to come and other works been published

Exactly, around 10 years.

It had treated double (multiple) HWs much better than the exercise at cologne

Not sure I understand what you mean. Can you give a sample, please?

The Meld exercise is to look (mainly) at the differing lines in vcp2 and vac2 (and refer to scans to decide the corrections).

I mean it was not in the priority list of 2021 and it can stop all tasks, every other task in the list for just this one. There are some minor tasks, that only Jim can give an answer to, but solving and integrating VCP corrections would swallow everything. Even MW is huge, but there is no way back. @drdhaval2785 are you personally eager to put your koshas aside and work with the VCP as intensive as it is proposed above? @Andhrabharati is yet in MW pond, VCP ocean might not what he is interested in, do not know. He works like a bull, but is that something you both can concentrate without disabling Jim for the priorities set 3 months ago?

@Andhrabharati
Copy link

I am not at MW99; no fulltime works for past some weeks my side.

@Andhrabharati
Copy link

Andhrabharati commented Mar 19, 2021

Widely used and yet only 2 people from India interested in cleaning

how many people did you get for MW and PWG, to clean/correct worldwide? (forget about occassional feedbacks)

@funderburkjim
Copy link
Contributor Author

as both the digitizations (TPT and Koeln) have erred at many places

Agree. In recent comparison, I have also found differences between TE and the
scan; sometimes Cologne right (agrees with scan) and TE is wrong, sometimes TE is right and Cologne is wrong; and (probably) sometimes both Cologne and TE disagree with scan.

@Andhrabharati
Copy link

good to see someone getting my intent correctly.

I was just thinking to start full proofing of vacaspatyam on my own.

for me neither of the digitisations are satisfactory enough, and I see no point spending time in just keep on comparing them and correcting both.

and the way things are moving here to remove the Bengal flavor (dialect), is quite against our (AB) principles of handling the texts.

one might recall how the great Panini never took to normalising the words by taking any one school as a standard, but just had them stay side by side.

one can just compare different schools, but never let one school override another.

so I would better be off from this exercise.

@gasyoun
Copy link
Member

gasyoun commented Mar 21, 2021

I was just thinking to start full proofing of vacaspatyam on my own.

Are you still?

how many people did you get for MW and PWG, to clean/correct worldwide? (forget about occassional feedbacks)

@Andhrabharati
Copy link

So isn't 2 far better than 0?

yes, I might start the work sometime soon; probably after finishing the mw99 annexure portion.

@drdhaval2785
Copy link
Contributor

@Andhrabharati and @funderburkjim ,

I know that both the digitizations are bad. But strictly speaking in mathematical terms and assuming both digitizations to be completely independent.

Let us assume 1/100 letters are wrong in both digitizations. So the probability of both digitizations being wrong at the same letter is 1/10000. Do we want to spend precious time on this miniscule? Maybe once we have corrected differring errors, it may be taken up. Not before.

@gasyoun
Copy link
Member

gasyoun commented Mar 22, 2021

Do we want to spend precious time on this miniscule? Maybe once we have corrected differring errors, it may be taken up. Not before.

Agree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants