Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ejf/hwnorm1c #4

Open
funderburkjim opened this issue Feb 25, 2016 · 2 comments
Open

ejf/hwnorm1c #4

funderburkjim opened this issue Feb 25, 2016 · 2 comments

Comments

@funderburkjim
Copy link
Contributor

The ejf/hwnorm1c directory initally contains the normalization program used in the Cologne hwnorm1 display.

in the hwnorm1c.txt file, there is one line for each normalized spelling.

For instance, the first line and its explanation:

a:a:AP,AP90,BEN,BHS,BOP,BUR,CAE,CCS,GRA,GST,MCI,MD,MW,MW72,PD,PE,PW,PWG,SCH,SHS,SKD,STC,VCP,WIL,YAT;aM:PD,SKD;aH:PD,SKD

Two parts:
first part:
 a   (normalized spelling)
second part a:AP,AP90,BEN,BHS,BOP,BUR,CAE,CCS,GRA,GST,MCI,MD,MW,MW72,PD,PE,PW,PWG,SCH,SHS,SKD,STC,VCP,WIL,YAT;aM:PD,SKD;aH:PD,SKD

Second part contains a sequence of parts, separated by semicolon. There are three such parts in
this example.
1.a:AP,AP90,BEN,BHS,BOP,BUR,CAE,CCS,GRA,GST,MCI,MD,MW,MW72,PD,PE,PW,PWG,SCH,SHS,SKD,STC,VCP,WIL,YAT
2. aM:PD,SKD
3. aH:PD,SKD

Each of these three parts itself has two parts (colon-separated)
  non-normalized spelling, 
  and a comma-separated list of dictionaries with a headword with this
     spelling

Here's another example:

uBayavat:uBayavat:MW;uBayavant:PW,PWG

The distrib.py program provides code to parse the lines of hwnorm1c.txt into a series of HWnormc objects.

distrib.txt counts how many normalized spellings occur in exactly one dictionary, exactly two dictionaries, etc.

@funderburkjim
Copy link
Contributor Author

The normalize_key function in hwnorm1c.py is used to compute a normalized spelling for any given headword. All spellings are in SLP1 transliteration.

Here is an explanation of the current normalization rules, as copied from here:

hwnorm1 normalization rules

These rules are independent of the dictionary.

Use homorganic nasal rather than anusvara
normalize so that 'rxx' is 'rx' (similarly, fxx is fx)
ending 'aM' is 'a'
ending 'aH' is 'a'
ending 'uH' is 'u'
ending 'iH' is 'i'
'ttr' is 'tr' (pattra v. patra)
ending 'ant' is 'at'
'cC' is 'C' (Jan 27, 2015)

@funderburkjim
Copy link
Contributor Author

The 'redo.sh' scripts recomputes

  • hwnorm1c.txt sanhw1.txt (as it appears in the CORRECTIONS repository)
  • distrib.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant