-
Notifications
You must be signed in to change notification settings - Fork 6
Corpus Description
The corpus for this machine learning competition provides a graph of research publications linked with their datasets in that research plus other metadata. This uses the ADRF vocabulary based on DCAT, etc.
The corpus is provided in both TTL and JSON-LD serialization formats. The former is more human-readable and can have axioms applied for consistency checking, while the latter is generally more usable for machines. It takes about two lines of Python to convert between the two formats.
Entities in the graph each have a title
field and an id
(a generated UUID), and as much as possible they are uniquely identified by persistent identifiers.
These linked data annotations have been verified by domain experts.
entity | persistent identifier | required fields | optional fields | notes |
---|---|---|---|---|
dataset | ADRF vocab ID |
id , title , provider
|
doi , url , alt_title (list), description , date
|
|
provider | ROR |
id , title
|
ror , url , description
|
|
publication | DOI |
id , title , datasets (list), journal
|
doi , url , pdf (open access URL) |
open access PDFs are downloaded and provided in a public S3 bucket |
journal | ISSN |
id , titles (list), issn (list) |
url |
first element in titles list is the ISO 4 standard abbreviation; first element in issn list is the linking ISSN
|
author | ORCID |
id , title
|
orcid , url
|
|
topic | madsrdf:Authority | label |
For examples of how to read and write the corpus files in Python -- both in TTL and JSON-LD formats -- see the write_corpus()
method in the gen_ttl.py
script.
The corpus will be extended over time, with updates managed using GitHub tags and versioning. After each update, previous entries in the leaderboard will get re-evaluated.
Note that names in the dct:alternative
field are merely informational -- what our human annotators have encountered when reading PDFs to identify dataset references manually.
For the purposes of the competition the ML models don't need to use them in any way.