-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Note to self - Understanding the file updation process #8
Comments
@drdhaval2785 If you decide to attempt to 'reduce the bloat', suggest you do it in v02. One thing causing the bloat is the long tag-names (entrydetails, etc) |
Right. I am doing this in v2 only. |
Your comment regarding long tag names causing bloat is quite correct. |
created a file harsa1.xml, which is compact in nature with minimal duplication. I think that there is no fun trying to complicate the downstream applications like sqlite.py and webtc2 logic to take care of this randomness. Having an alternate XML suffices for my purpose. The Cologne workflow can continue as usual. The file harsa1.xml would be an additional resource, if I want to use it in stardict or other places. |
This is a note to @drdhaval2785.
Just noting it here, so that my step by step understanding and thinking out loud is not lost.
I am dropping this idea of changing XML file drastically.
Understand what is happenning
I would try to understand the changes which are happenning to a dictionary from the file anhk1.txt to the harsa.txt, harsa.xml and harsa.sqlite.
I have a flimsy knowledge of how this pipeline works in other Cologne dictionaries. I will explore and document step by step what is the input, what is the output and what is the script. This will help me get a better hang of what to do to minimize the duplication in every possible manner.
Input file - anhk1.txt
This file is annotated version of anekārthanāmamālā of Harśakīrti. sanskrit-lexicon/COLOGNE#405 (comment)
Format
harsa.txt
This file is SLP1 version of ankh1.txt
This is generated by prep/harsa/redo.sh file.
Workhorse is prep/harsa/convert.py file.
Format
harsa.txt to csl-orig
csl-orig is the place where Cologne stores its dictionary data.
Generate local displays
cd csl-pywork sh generate_dict.sh harsa ../apps/harsa
This code does three things.
apps/harsa
directory.redo_hw.sh
redo_xml.sh
They require further examination.
The log generated in the process is quite helpful to understand what is happenning under the hood.
redo_hw.sh
This uses three scripts hw.py, hw2.py and hw0.py and generates three output files harsahw.txt, harsahw2.txt and harsahw0.txt.
hw.py
Reads two input files harsa.txt and harsa_hwextra.txt (currently blank).
Generates harsahw.txt.
Format of harsahw.txt
Here, L stands for lnum, pc for page-column, k1 for key1, k2 for key2, ln1 for the starting line of the entry in harsa.txt and ln2 for the ending line of the entry.
This would mean that entry with lnum 1 starts from 53 and ends at 60. Note that in this file, the line 53 is the metaline starting with
<L>
, and 60 is the metaline ending with<LEND>
to mark the end of the entry. Thus, this is inclusive of the metalines.hw2.py
Reads harsahw.txt file as input.
Generates harsahw2.txt file as output.
Format
The entries are in
pc:k1:ln1:ln2:lnum
format.hw0.py
Reads harsahw2.txt file as input.
Generates harsahw0.txt file as output.
In the present case harsahw2.txt and harsahw0.txt are identical, as there is no differences between key1 and key2 in these Sanskrit koshas.
redo_xml.sh
This script generates harsa.xml by using harsa.txt and harsahw.txt with the help of make_xml.py script
make_xml.py
This is the most important workhorse of the whole process. This generates the XML file from TXT file.
harsa.xml
Format
<h>
tag holds key1, key2.<body>
holds<hwdetails>
and<entrydetails>
<tail>
holds L and pc.hwdetails is a list of hwdetail (holding hw-gender, meaning pair).
entrydetails is a list of verses.
I should stop at this juncture and analyse the information being captured in harsa.xml.
Flaws in harsa.xml
The following is the original information
This shows that there are four headwords with their associated meaning. Ideally when I search for
sUrya
, I should get only the following information.But the present information of
sUrya
in harsa.xml is as shown below.Flaw 1. One can see that there is superfluous inclusion of
<hw><s>ka-klI</s></hw><meaning><s>suKa,mastaka,jala</s></meaning></hwdetail><hwdetail><hw><s>Sloka-puM</s></hw><meaning><s>anuzwuB,yaSas</s></meaning></hwdetail><hwdetail><hw><s>loka-puM</s></hw><meaning><s>Buvana,jana</s></meaning></hwdetail>
.Flaw 2. One can see that there is copying of the entrydetails in all 12 headwords, unnecessarily.
<entrydetails><entrydetail><s>sUrye veDasi vAyO kaH kaM suKe mastake jale .</s></entrydetail><entrydetail><s>anuzwubyaSasoH Sloko lokastu Buvane jane .. 1 ..</s></entrydetail></entrydetails>
Because of these duplications, the file size increases dramatically. harsa.txt of 32 kb gets bloated to 810 kb harsa.xml i.e. almost 25 times increment. It is OK for small lexica such as this, but there are lexica around 2 MB in size. The bloat would be very high.
proposed new harsa1.xml
Here,
eid
is extra id, which is used to identify the hw-meaning pair in anekArthaka koshas. Similarly that may be used in samAnArthaka koshas.eid
will be continuous througout the file, so that it is possible to tag to internally refer to some samAnArthaka group or anekArthaka group in the dictionary / from other dictionaries.If we search for a word in the headword in
hwdetails
section, we will be able to getL
andeid
for the searched word.We can use the same to get the entry from
entrydetails
section. We can display the relevanteid
only. The entry is shown indented just for the sake of readability. Otherwise, it would be on as single line.As, this format reduces the problems of duplication, I will try to explore this. Otherwise, we will stick to the format used by Jim.
redo_postxml.sh
Generate sqlite file
pywork/sqlite/sqlite.py is the script.
xxx.sqlite is an sqlite table with the name 'dictcode'.
It has a table with three items in each row. key, lnum and line. row = (key, lnum, line) tuple is curated from xxx.xml file. These rows are put inside the sqlite file in a default batch of 10000 entries.
Generate query_dump file
pywork/webtc2/init_query.py is the script.
It generaes query_dump.txt file from xxx.xml file.
It is used for advanced search.
The text was updated successfully, but these errors were encountered: