Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise graph compilation #11

Open
chiarcos opened this issue Aug 13, 2020 · 0 comments
Open

Revise graph compilation #11

chiarcos opened this issue Aug 13, 2020 · 0 comments
Assignees
Milestone

Comments

@chiarcos
Copy link
Contributor

Update /stable/dicts-w-legend.gif and /dicts/dicts-w-legend.gif.
Revise scripts (stable/scripts/build-dict-graph.sh, stable/scripts/build-dict-graph-incl-exp.sh) such that they use only the statistics in the files langs.tsv and lang-pairs.tsv that each data set should provide (issue #10).
For classifying language codes, this is currently hard-wired in both these scripts. Create mapping file in scripts (/stable/scripts/langs.tsv) with the following tab-separated columns:

TAG NAME GROUP AFFILIATION

with
TAG: BCP-47 tag (primary language tag only, ignore everything after -)
NAME: name according to ISO 639-3
GROUP: major language group or geographic region
AFFILIATION: major language group (free text) or other comments

e.g.,

en English GERMANIC Germanic, Indo-European
zh Chinese EAST_ASIA Sino-Tibetan

Current set of GROUPs:
Indo-European (different shades of grey):
GERMANIC
CELTIC
ROMANCE
ITALIC
SLAVIC
BALTIC
IRANIAN
INDIAN (incl. Romani)
OTHER_IE (Albanian, Greek, Armenian, Anatolian, Tokharian, etc.)
Note that Pidgins and Creoles are classified along with the language they derive from, e.g., English-based Creoles like English (but mark that under AFFILIATION).
Note that artifical languages based on European languages, e.g., Esperanto, are not considered Indo-European.)

other languages (different colors)
AFROASIATIC (called SEMITIC in the script)
ALTAIC (Turkic, Monolic, Tungusic, excluding Korean and Japanese)
URALIC
DRAVIDIAN
CAUCASIAN (NE, SW, NW Caucasian)
PACIFIC (native languages of Australia, Papua-New Guinea, Austronesian, incl. Malagasy)
SUBSAHARIC (native languages of Africa, excluding Afroasiatic and immigrant languages such as Malagasy)
EAST_ASIA (languages of Eastern and Southern Asia that are neither Indo-European, Dravidian, Austronesian nor Altaic)
AMERICA (native languages of North or South America)

Along with this, unclassified exist (e.g., artificial languages, Basque, Sumerian, Elamite, etc.)

Note that the current set of GROUPs is not meant to be linguistically adequate, but its different levels of granularity (coarse-grained geographic region, macro-family, language family) only reflect the composition of the dataset. With the mapping table separated from the code, this is the first step towards implementing a more linguistically adequate classification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants