Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with the Ntriple file from Kaggle #1

Open
vemonet opened this issue Apr 2, 2020 · 5 comments
Open

Issue with the Ntriple file from Kaggle #1

vemonet opened this issue Apr 2, 2020 · 5 comments

Comments

@vemonet
Copy link

vemonet commented Apr 2, 2020

First I would like to thank you for this KG and its documentation!

I tried to deploy your Notebooks on my infrastructure (in a Jupyterlab with root user)

I faced issues when loading the provided ntriples file from Kaggle: https://www.kaggle.com/group16/covid19-literature-knowledge-graph

  • encoding : getting this error when trying to load the download RDF in your notebook rdflib graph: http://dbpedia.org/datatype/polishZ\u0142oty does not look like a valid URI, trying to serialize this will break.
  • lang tag missing : Ontotext GraphDB was failing when loading due to this datatype rdf:langString requires a language tag

Not sure if the encoding issue is due to my environment (running Ubuntu 18.04)

I found a rather clean way to solve those issues:

apt-get install raptor2-utils
rapper -i ntriples -o turtle kg.nt > ugent-covid-kg.ttl
  • then just replace ^^rdf:langString with ^^<http:\/\/www.w3.org\/2001\/XMLSchema#string>
find ugent-covid-kg.ttl -type f -exec sed -i "s/\^\^rdf:langString/^^<http:\/\/www.w3.org\/2001\/XMLSchema#string>/g" {} +

# Or keep langString and use english tag as default
find ugent-covid-kg.ttl -type f -exec sed -i "s/\^\^rdf:langString/@en/g" {} +

I uploaded the Notebooks to this GitHub repository and detailed the process to download the ntriples: https://github.com/MaastrichtU-IDS/covid-kg-notebooks/#download-data

I loaded the graph in a GraphDB triplestore, it can be browsed and URI resolved using this web browser:
http://trek.semanticscience.org/describe?uri=http://idlab.github.io/covid19#ffe663e4ef5018da41f057533520b9d85ec86e18&endpoint=https://graphdb.dumontierlab.com/repositories/covid-kg

I will add search index and HCLS descriptive metadata soon if you are interested

@GillesVandewiele
Copy link
Owner

Legend! Thanks for writing this out, we will try to integrate this in our pipeline so that the issue is resolved for next versions.

@vemonet
Copy link
Author

vemonet commented Apr 21, 2020

Hi, I noticed that the latest version available on Kaggle seems to have solved those encoding issues, thanks!

The version 11 file is half the size (500M) of the version 9 (1G)

I cannot find Mesh keywords in the latest version (previously defined using http://idlab.github.io/covid19#paragraphEntities )

We can only find dbpedia mappings defined using http://idlab.github.io/covid19#hasConcept

Is it normal?

@GillesVandewiele
Copy link
Owner

Hi, @bsteenwi made some changes to the final version to reduce the size. He did indeed remove some of the relations, but I am not sure which ones exactly...

@bsteenwi
Copy link
Collaborator

Hi, the last version of the KG does indeed mis some links.
We have recreated the KG with concepts extracted from dbpedia spotlight and tried to find correlations between papers based on these concepts.

I will update the mapping scripts in this repository, so it easier to see which relations are available

@vemonet
Copy link
Author

vemonet commented Apr 22, 2020

Ok, we were planning to integrate your KG to the Mesh vocabulary and complementary resources (other publications KG about covid, drug, pathways db, etc). And are less interested in the dbpedia mappings (mainly due to data quality issues)

Do you know if you plan to make MeSH annotations available again soon?
I could take a look into re-executing the code you wrote to generate it, but if you plan to put it back, that would be even better :)

A small note also: for MeSH URI you are using HTTPS (e.g. https://id.nlm.nih.gov/mesh/D007251)
Mesh vocabulary and prefix.cc uses HTTP (http://id.nlm.nih.gov/mesh/)

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants