Repo for Bio NLP Resources
- Named Entity Recognition
- Named Entity Normalization
- Relation Extraction
- Large Scale Pubmed Corpus
- Drug Protein NER
- NCBI Disease Corpus: 793 PubMed abstracts 6,892 disease mentions 790 unique disease concepts Medical Subject Headings (MeSH ) Online Mendelian Inheritance in Man (OMIM ) 91% of the mentions map to a single disease concept divided into training, developing and testing sets.
- tmVAR: tmVar Corpus contains 500 PubMed articles manually annotated with mutation mentions of various kinds.
-
BC5CDR: BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions.
-
CDT: The weakly-labeled corpus used in (Peng et al., 2016) consists of 18,410 abstracts and 33,224 CID relations. The raw data was extracted from curated data in the CTD-Pfizer collaboration with document-level annotations of drug-disease and drug-phenotype interactions.
-
GAD: The Genetic Association Database (GAD) is an archive of human genetic association studies of complex diseases, including summary data extracted from publications on candidate gene and GWAS studies. We use GAD for the development of a corpus on associations between genes and diseases (downloaded on January 21st, 2013).
-
EU-ADR: The corpus has been annotated for drugs, disorders, genes and their inter-relationships. For each of the drug-disorder, drug-target, and target-disorder relations three experts have annotated a set of 100 abstracts.
- PPI: This is a new, and much improved, binarization of BioInfer as reported in Heimonen et al., Complex-to-Pairwise Mapping of Biological Relationships using a Semantic Network Representation.
-
ADE: Development of a benchmark corpus to support the automatic extractionof drug-related adverse effects from medical case reports
-
TAC2017: The DDIExtraction2013 Shared Task focuses on extraction of drug-drug interactions.
-
SMM4H: Fourth Social Media Mining for Health (#SMM4H) Shared Task at ACL 2019
-
ADRMine: Corpus from Analysis of the effect of sentiment analysis on extracting adverse drug reactions from tweets and forum posts
- Pubmed Phrases: The dataset contains a collection of 705,915 PubMed Phrases (Kim et al., 2018) that are beneficial for information retrieval and human comprehension.