layout | title |
---|---|
home |
About the HDR UK Phenotype Portal |
Welcome to HDR UK National Phenomics Resource Project, a project funded by Health Data Research UK.
When patients interact with physicians, or are admitted into hospital, information is collected electronically on their symptoms, diagnoses, laboratory test results, and prescriptions. This information is stored securely in Electronic Health Records (EHR) and is a valuable resource for researchers and clinicians for improving health and healthcare. EHRs are however of variable detail and quality and contain many inconsistencies. As a result, researchers and data providers spend considerable time creating complex computer programs to fix and statistically analyse the information in EHR and identify which patients have which disease. Currently, there is no means to share these tools across institutions in the UK resulting in duplication of effort. Reproducibility of research is also hampered as others do not have access to the precise methods and definitions used in a particular study. This project addresses these issues by creating an open resource for EHR users (researchers, clinicians, the NHS and data providers) to share their methods.
Phenomics refers to the science of deriving new knowledge for health by studying multiple conditions in new ways. This involves studying all currently recognized diseases – so called ‘phenome wide’ approaches. In order to do this efficiently phenomics approaches require the creation of computable definitions of diseases, health states and traits, including temporal components of these (i.e. change and rate of change over time). It covers the full spectrum of health and disease across the entire life course and is relevant to a wide range of potential stakeholders and beneficiaries.
A primary reason for using data from EHR is the creation of phenotype algorithms to identify disease status, onset and progression. Phenotyping (describing the characteristics of disease) however is challenging as her data are collected for different purposes, have variable data quality and often require significant harmonisation. While considerable effort goes into these algorithms, there is no consistent methodology for creating and evaluating them and no centralised repository for depositing and sharing them.
We will create a national platform for dissemination of citable algorithms (incl. validations) and tools which will reduce duplication of effort and improve research reproducibility. We will explore methods for creating computable representations of algorithms for integration into actionable analytics for healthcare. Finally, we will fundamentally shift the EHR cultural landscape by a robust incentivisation programme, providing guidelines on best practices, cross-disciplinary training, and ensuring alignment with other international initiatives.
Though this project, we will deliver a fundamental step-change in the current EHR community in the UK by bringing together health data scientists, clinicians, computer scientists, public health experts and data curators under the FAIR principles (www.force11.org). The National Phenomics Resource will facilitate the dissemination and re-use of algorithms, tools and methods by the community. By establishing a national standard for creating, evaluating and representing phenotypes, we will accelerate the impact of discovery through increased transparency and replicability and maximise the usability and value of existing data repositories to new users. Finally, we will take the first steps towards establishing computational biomedical knowledge objects (e.g. guidelines with embedded phenotypes endorsed by NICE) which will enable the creation of actionable health analytics in the NHS.
When patients interact with physicians, or are admitted into hospital, information is collected electronically on their symptoms, diagnoses, laboratory test results, and prescriptions and stored in Electronic Health Records (EHR). EHR are a valuable resource for researchers and clinicians as they provide comprehesive information about a patients health, and healthcare, over long periods of time.
A primary use-case for EHR is the creation of phenotyping algorithms used to identify disease status, onset and progression or extraction of information on risk factors or biomarkers. These complex algorithms can enable researchers to extract information from EHR, statistically analyze it and use the findings to improve human health. While considerable effort goes into creating these algorithms, there is no consistent methodology for creating and evaluating them and no centralised repository for depositing and sharing them.
The HDR UK Phenotype Portal will facilitate the dissemination and re-use of algorithms, tools and methods by the community by providing a national resource for curating such algorithms and their associated metadata.
For an algorithm to be included in the Phenotype Portal, it must satisfy the following criteria:
- Define a disease (e.g. hypertension), life style risk factor (e.g. smoking) or biomarker (e.g. blood pressure)
- Derive information from one or more electronic health record data sources. This can include national and local sources. The definition of EHR includes administrative data such as billing/claims data, and clinical audits.
- Have one or more peer-reviewed outputs associated with it e.g. journal publication, scientific conferences, policy white papers etc.
- Provide evidence of how the phenotyping algorithm was validated.
Phenotyping algorithms are stored in the Phenotype Portal usign a combination of YAML, CSV and markdown files. There are two main components to each algorothm: a) the phenotype definition file (which is in YAML and markdown) and, b) one or more teminology files (also known as codelists) which are stored as CSV files. The section below provides information on their schema and contents.
All phenotype definition files associated with a phenotype use a common naming pattern:
AUTHORSURNAME_NAME_UUID.md
for example: axson_bronchiestasis_ZckoXfUWNXn8Jn7fdLQuxj.md
Phenotype files are stored in the _phenotypes directory.
Similarly, code list files follow a similar pattern:
NAME_UUID_TERMINOLOGY.csv
for example: axson_bronchiestasis_ZckoXfUWNXn8Jn7fdLQuxj_ICD10.csv
Codelist files are stored in the codelists directory.
The phenotype definition file is a markdown file with a YAML header. The YAML header is used to record metadata fields capturing information about the algorithm, the data sources, controlled clinical terminologies and other information.
For example, the code snippet below displays the metadata associated with the bronchiestasis phenotyping algorithm submitted by the HDR UK BREATHE Hub (you can view the raw file directly on the repository.)
title: Bronchiestasis
name: Bronchiestasis
phenotype_id: ZckoXfUWNXn8Jn7fdLQuxj
type: Disease or Syndrome
group: Respiratory
data_sources:
- Clinical Practice Research Datalink GOLD
- Clinical Practice Research Datalink Aurum
- Hospital Episode Statistics APC for CPRD GOLD
- Hospital Episode Statistics APC for CPRD Aurum
- Death Registration data for CPRD GOLD
- Death Registration data for CPRD Aurum
- UK Biobank
clinical_terminologies:
- Read Version 2
- SNOMED-CT
- ICD-10
- ICD-11
validation:
- prognostic
codelists:
- axson_bronchiestasis_ZckoXfUWNXn8Jn7fdLQuxj_ICD10.csv
- axson_bronchiestasis_ZckoXfUWNXn8Jn7fdLQuxj_ICD11.csv
- axson_bronchiestasis_ZckoXfUWNXn8Jn7fdLQuxj_SNOMEDCT.csv
- axson_bronchiestasis_ZckoXfUWNXn8Jn7fdLQuxj_UKBIOBANK.csv
valid_event_data_range: 01/01/2001 - 31/12/2019
sex:
- Female
- Male
author:
- Eleanor L Axson
- Jennifer K Quint
publications:
status: BETA
date: 2019-06-20
modified_date: 2019-06-20
version: 1
The metadata fields required are the following:
- title (string): Phenotype (long) name
- name (string): Phenotype (short) name
- data_sources (list of strings): Names of data sources that phenotype sources information from. These should be identical, if possible, to the names used to identify individual datasets in the HDR Gateway.
- clinical_terminologies (list of strings): List of controlled clinical terminologies that are used by the phenotype algorithm.
- validation (list of strings): evidence of validation used as evidence of phenotype robustness - valid values:
- prognostic: the ability to replicate known prognostic associations
- aetiologic: the ability to replicate known associations with risk factors
- genetic : the abity to replicate associations with known regions or variants
- cross-source: has the algorithm been evaluated in a similar external data source
- casenote review : has the algorithm been validated through manual review of clinical notes (this usually would result to PPV, NPV values)
- cross-country : has the algorithm been evaluated in a similar external healthcare system
- codelists (list of strings): (unordered) list of CSV terminologu files associated with the phenotype
- phenotype_id (list of strings): Unique universal phenotype identifier, generated using the
shortuuid
Python module. - group (string): Disease group for phenotype
- valid_event_data_range (list of strings): DD/MM/YYYY date range for events
- sex (list of strings): list of sexes valid for the phenotype
- author (list of strings): list of phenotype authors
- publications (list of strings): list of publications
- status (string): 'DRAFT' or 'FINAL' status
- date (string): date created
- modified_date (string): date last modified
- version (integer): integer version of phenotype, default '1'
Codelist files are specified as CSV files with one term per row - for example:
ICD-10 code,ICD-10 term
J47,Bronchiectasis
You can download a sample template file from the repository:
If you have a phenotyping algorithm that meets the eligibility requirements, we invite you to submit your data by one of the following ways:
- by 📫 email s.denaxas at ucl.ac.uk
- by 🐙 pull request (PR) on the GitHub repository
- by ⬆ Google form