Revise graph compilation #11

chiarcos · 2020-08-13T08:56:41Z

Update /stable/dicts-w-legend.gif and /dicts/dicts-w-legend.gif.
Revise scripts (stable/scripts/build-dict-graph.sh, stable/scripts/build-dict-graph-incl-exp.sh) such that they use only the statistics in the files langs.tsv and lang-pairs.tsv that each data set should provide (issue #10).
For classifying language codes, this is currently hard-wired in both these scripts. Create mapping file in scripts (/stable/scripts/langs.tsv) with the following tab-separated columns:

TAG NAME GROUP AFFILIATION

with
TAG: BCP-47 tag (primary language tag only, ignore everything after -)
NAME: name according to ISO 639-3
GROUP: major language group or geographic region
AFFILIATION: major language group (free text) or other comments

e.g.,

en English GERMANIC Germanic, Indo-European
zh Chinese EAST_ASIA Sino-Tibetan

Current set of GROUPs:
Indo-European (different shades of grey):
GERMANIC
CELTIC
ROMANCE
ITALIC
SLAVIC
BALTIC
IRANIAN
INDIAN (incl. Romani)
OTHER_IE (Albanian, Greek, Armenian, Anatolian, Tokharian, etc.)
Note that Pidgins and Creoles are classified along with the language they derive from, e.g., English-based Creoles like English (but mark that under AFFILIATION).
Note that artifical languages based on European languages, e.g., Esperanto, are not considered Indo-European.)

other languages (different colors)
AFROASIATIC (called SEMITIC in the script)
ALTAIC (Turkic, Monolic, Tungusic, excluding Korean and Japanese)
URALIC
DRAVIDIAN
CAUCASIAN (NE, SW, NW Caucasian)
PACIFIC (native languages of Australia, Papua-New Guinea, Austronesian, incl. Malagasy)
SUBSAHARIC (native languages of Africa, excluding Afroasiatic and immigrant languages such as Malagasy)
EAST_ASIA (languages of Eastern and Southern Asia that are neither Indo-European, Dravidian, Austronesian nor Altaic)
AMERICA (native languages of North or South America)

Along with this, unclassified exist (e.g., artificial languages, Basque, Sumerian, Elamite, etc.)

Note that the current set of GROUPs is not meant to be linguistically adequate, but its different levels of granularity (coarse-grained geographic region, macro-family, language family) only reflect the composition of the dataset. With the mapping table separated from the code, this is the first step towards implementing a more linguistically adequate classification.

The text was updated successfully, but these errors were encountered:

chiarcos added the enhancement label Aug 13, 2020

chiarcos assigned max-ionov Aug 13, 2020

chiarcos added this to the version 1.0 milestone May 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise graph compilation #11

Revise graph compilation #11

chiarcos commented Aug 13, 2020

Revise graph compilation #11

Revise graph compilation #11

Comments

chiarcos commented Aug 13, 2020