Skip to content

isabelline/BioNLPDatasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 

Repository files navigation

BioNLPDatasets

Repo for Bio NLP Resources

Contents

  • Named Entity Recognition
  • Named Entity Normalization
  • Relation Extraction
  • Large Scale Pubmed Corpus

Named Entity Recognition

  • Drug Protein NER

Disease

  • NCBI Disease Corpus: 793 PubMed abstracts 6,892 disease mentions 790 unique disease concepts Medical Subject Headings (MeSH ) Online Mendelian Inheritance in Man (OMIM ) 91% of the mentions map to a single disease concept divided into training, developing and testing sets.

Mutation Mentions of various kinds (Protein, DNA...)

  • tmVAR: tmVar Corpus contains 500 PubMed articles manually annotated with mutation mentions of various kinds.

Chemical Disease Interaction

  • BC5CDR: BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions.

  • CDT: The weakly-labeled corpus used in (Peng et al., 2016) consists of 18,410 abstracts and 33,224 CID relations. The raw data was extracted from curated data in the CTD-Pfizer collaboration with document-level annotations of drug-disease and drug-phenotype interactions.

Chemical and Drug

Relation Extraction

Gene-Disease

  • GAD: The Genetic Association Database (GAD) is an archive of human genetic association studies of complex diseases, including summary data extracted from publications on candidate gene and GWAS studies. We use GAD for the development of a corpus on associations between genes and diseases (downloaded on January 21st, 2013).

  • EU-ADR: The corpus has been annotated for drugs, disorders, genes and their inter-relationships. For each of the drug-disorder, drug-target, and target-disorder relations three experts have annotated a set of 100 abstracts.

Chemical-Protein

Protein-Protein

  • PPI: This is a new, and much improved, binarization of BioInfer as reported in Heimonen et al., Complex-to-Pairwise Mapping of Biological Relationships using a Semantic Network Representation.

Drug-Drug Interaction

Drug-ADE

  • ADE: Development of a benchmark corpus to support the automatic extractionof drug-related adverse effects from medical case reports

  • TAC2017: The DDIExtraction2013 Shared Task focuses on extraction of drug-drug interactions.

  • SMM4H: Fourth Social Media Mining for Health (#SMM4H) Shared Task at ACL 2019

  • ADRMine: Corpus from Analysis of the effect of sentiment analysis on extracting adverse drug reactions from tweets and forum posts

Large Scale Pubmed Corpus

Pubmed

  • Pubmed Phrases: The dataset contains a collection of 705,915 PubMed Phrases (Kim et al., 2018) that are beneficial for information retrieval and human comprehension.

Useful Links

About

Repo for Bio NLP Resources

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published