You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update /stable/dicts-w-legend.gif and /dicts/dicts-w-legend.gif.
Revise scripts (stable/scripts/build-dict-graph.sh, stable/scripts/build-dict-graph-incl-exp.sh) such that they use only the statistics in the files langs.tsv and lang-pairs.tsv that each data set should provide (issue #10).
For classifying language codes, this is currently hard-wired in both these scripts. Create mapping file in scripts (/stable/scripts/langs.tsv) with the following tab-separated columns:
TAG NAME GROUP AFFILIATION
with
TAG: BCP-47 tag (primary language tag only, ignore everything after -)
NAME: name according to ISO 639-3
GROUP: major language group or geographic region
AFFILIATION: major language group (free text) or other comments
e.g.,
en English GERMANIC Germanic, Indo-European
zh Chinese EAST_ASIA Sino-Tibetan
Current set of GROUPs:
Indo-European (different shades of grey):
GERMANIC
CELTIC
ROMANCE
ITALIC
SLAVIC
BALTIC
IRANIAN
INDIAN (incl. Romani)
OTHER_IE (Albanian, Greek, Armenian, Anatolian, Tokharian, etc.)
Note that Pidgins and Creoles are classified along with the language they derive from, e.g., English-based Creoles like English (but mark that under AFFILIATION).
Note that artifical languages based on European languages, e.g., Esperanto, are not considered Indo-European.)
other languages (different colors)
AFROASIATIC (called SEMITIC in the script)
ALTAIC (Turkic, Monolic, Tungusic, excluding Korean and Japanese)
URALIC
DRAVIDIAN
CAUCASIAN (NE, SW, NW Caucasian)
PACIFIC (native languages of Australia, Papua-New Guinea, Austronesian, incl. Malagasy)
SUBSAHARIC (native languages of Africa, excluding Afroasiatic and immigrant languages such as Malagasy)
EAST_ASIA (languages of Eastern and Southern Asia that are neither Indo-European, Dravidian, Austronesian nor Altaic)
AMERICA (native languages of North or South America)
Along with this, unclassified exist (e.g., artificial languages, Basque, Sumerian, Elamite, etc.)
Note that the current set of GROUPs is not meant to be linguistically adequate, but its different levels of granularity (coarse-grained geographic region, macro-family, language family) only reflect the composition of the dataset. With the mapping table separated from the code, this is the first step towards implementing a more linguistically adequate classification.
The text was updated successfully, but these errors were encountered:
Update /stable/dicts-w-legend.gif and /dicts/dicts-w-legend.gif.
Revise scripts (stable/scripts/build-dict-graph.sh, stable/scripts/build-dict-graph-incl-exp.sh) such that they use only the statistics in the files langs.tsv and lang-pairs.tsv that each data set should provide (issue #10).
For classifying language codes, this is currently hard-wired in both these scripts. Create mapping file in scripts (/stable/scripts/langs.tsv) with the following tab-separated columns:
with
TAG: BCP-47 tag (primary language tag only, ignore everything after
-
)NAME: name according to ISO 639-3
GROUP: major language group or geographic region
AFFILIATION: major language group (free text) or other comments
e.g.,
Current set of GROUPs:
Indo-European (different shades of grey):
GERMANIC
CELTIC
ROMANCE
ITALIC
SLAVIC
BALTIC
IRANIAN
INDIAN (incl. Romani)
OTHER_IE (Albanian, Greek, Armenian, Anatolian, Tokharian, etc.)
Note that Pidgins and Creoles are classified along with the language they derive from, e.g., English-based Creoles like English (but mark that under AFFILIATION).
Note that artifical languages based on European languages, e.g., Esperanto, are not considered Indo-European.)
other languages (different colors)
AFROASIATIC (called SEMITIC in the script)
ALTAIC (Turkic, Monolic, Tungusic, excluding Korean and Japanese)
URALIC
DRAVIDIAN
CAUCASIAN (NE, SW, NW Caucasian)
PACIFIC (native languages of Australia, Papua-New Guinea, Austronesian, incl. Malagasy)
SUBSAHARIC (native languages of Africa, excluding Afroasiatic and immigrant languages such as Malagasy)
EAST_ASIA (languages of Eastern and Southern Asia that are neither Indo-European, Dravidian, Austronesian nor Altaic)
AMERICA (native languages of North or South America)
Along with this, unclassified exist (e.g., artificial languages, Basque, Sumerian, Elamite, etc.)
Note that the current set of GROUPs is not meant to be linguistically adequate, but its different levels of granularity (coarse-grained geographic region, macro-family, language family) only reflect the composition of the dataset. With the mapping table separated from the code, this is the first step towards implementing a more linguistically adequate classification.
The text was updated successfully, but these errors were encountered: