Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate statistics #10

Open
chiarcos opened this issue Aug 13, 2020 · 0 comments
Open

Consolidate statistics #10

chiarcos opened this issue Aug 13, 2020 · 0 comments
Assignees
Milestone

Comments

@chiarcos
Copy link
Contributor

chiarcos commented Aug 13, 2020

Provide for every dataset (stable and experimental) a file langs.tsv and a file lang-pairs.tsv in the root directory of the data set.

Use the following structure:

langs.tsv:
TAG<TAB>FILE&ltTAB>ENTRIES<TAB>LICENSE

TAG: primary BCP47 language tag, omitting subtags, e.g., en for en-US, etc.
FILE: OntoLex RDF file, can be in a (zip or other) archive. A file within an archive should be separated from the archive path with :
ENTRIES: number of lexical entries (i.e., number of lexical entry URIs)
LICENSE: license acronym

example:

en ontolex/archive.zip:en/dict1.ttl 10000 CC-BY 4.0

Note that multiple dictionaries per language variety can exist.

lang-pairs.tsv:
SRC<TAB>TGT<TAB>FILE<TAB>ROWS<TAB>SOURCES

SRC: source language tag (see TAG above)
TGT: target language tag (see TAG below)
FILE: TIAD-TSV file (see FILE above)
ROWS: number of rows in FILE, i.e., translation pairs. FILE must not contain duplicates.
SOURCES: one or multiple source files, should correspond with langs.tsv FILE entries such that the license can be recovered

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants