Skip to content

Commit

Permalink
feat: adding unpublished studies (opentargets#290)
Browse files Browse the repository at this point in the history
* feat: adding unpublished studies

* feat: updating all gwas catalog sources

* feat: generalizing study ingestion to accept list of files

* fix: removing unused configuration

* feat: adding unpublished ancestries as well

* fix: updating ancestry config name

* Apply suggestions from code review

Co-authored-by: Kirill Tsukanov <[email protected]>

---------

Co-authored-by: Kirill Tsukanov <[email protected]>
  • Loading branch information
DSuveges and tskir authored Dec 5, 2023
1 parent bc5f2b3 commit 6118bb9
Show file tree
Hide file tree
Showing 4 changed files with 32 additions and 16 deletions.
13 changes: 9 additions & 4 deletions config/datasets/gcp.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,15 @@ anderson: gs://genetics-portal-input/v2g_input/andersson2014/enhancer_tss_associ
javierre: gs://genetics-portal-input/v2g_input/javierre_2016_preprocessed.parquet
jung: gs://genetics-portal-raw/pchic_jung2019/jung2019_pchic_tableS3.csv
thurman: gs://genetics-portal-input/v2g_input/thurman2012/genomewideCorrs_above0.7_promoterPlusMinus500kb_withGeneNames_32celltypeCategories.bed8.gz
catalog_associations: ${datasets.inputs}/v2d/gwas_catalog_v1.0.2-associations_e110_r2023-09-11.tsv
catalog_studies: ${datasets.inputs}/v2d/gwas-catalog-v1.0.3-studies-r2023-09-11.tsv
catalog_ancestries: ${datasets.inputs}/v2d/gwas-catalog-v1.0.3-ancestries-r2023-09-11.tsv
catalog_sumstats_lut: ${datasets.inputs}/v2d/harmonised_list-r2023-09-11.txt
catalog_associations: ${datasets.inputs}/v2d/gwas_catalog_v1.0.2-associations_e110_r2023-11-24.tsv
catalog_studies:
# To get a complete representation of all GWAS Catalog studies, we need to ingest the list of unpublished studies from a different file.
- ${datasets.inputs}/v2d/gwas-catalog-v1.0.3-studies-r2023-11-24.tsv
- ${datasets.inputs}/v2d/gwas-catalog-v1.0.3-unpublished-studies-r2023-11-24.tsv
catalog_ancestries:
- ${datasets.inputs}/v2d/gwas-catalog-v1.0.3-ancestries-r2023-11-24.tsv
- ${datasets.inputs}/v2d/gwas-catalog-v1.0.3-unpublished-ancestries-r2023-11-24.tsv
catalog_sumstats_lut: ${datasets.inputs}/v2d/harmonised_list-r2023-11-24a.txt
ukbiobank_manifest: gs://genetics-portal-input/ukb_phenotypes/neale2_saige_study_manifest.190430.tsv
l2g_gold_standard_curation: ${datasets.inputs}/l2g/gold_standard/curation.json
gene_interactions: ${datasets.inputs}/l2g/interaction # 23.09 data
Expand Down
4 changes: 2 additions & 2 deletions config/step/gwas_catalog.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
_target_: otg.gwas_catalog.GWASCatalogStep
catalog_studies_file: ${datasets.catalog_studies}
catalog_ancestry_file: ${datasets.catalog_ancestries}
catalog_study_files: ${datasets.catalog_studies}
catalog_ancestry_files: ${datasets.catalog_ancestries}
catalog_associations_file: ${datasets.catalog_associations}
catalog_sumstats_lut: ${datasets.catalog_sumstats_lut}
variant_annotation_path: ${datasets.variant_annotation}
Expand Down
12 changes: 6 additions & 6 deletions src/otg/gwas_catalog.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,8 @@ class GWASCatalogStep:
Attributes:
session (Session): Session object.
catalog_studies_file (str): Raw GWAS catalog studies file.
catalog_ancestry_file (str): Ancestry annotations file from GWAS Catalog.
catalog_study_files (list[str]): List of raw GWAS catalog studies file.
catalog_ancestry_files (list[str]): List of raw ancestry annotations files from GWAS Catalog.
catalog_sumstats_lut (str): GWAS Catalog summary statistics lookup table.
catalog_associations_file (str): Raw GWAS catalog associations file.
variant_annotation_path (str): Input variant annotation path.
Expand All @@ -35,8 +35,8 @@ class GWASCatalogStep:
"""

session: Session = MISSING
catalog_studies_file: str = MISSING
catalog_ancestry_file: str = MISSING
catalog_study_files: list[str] = MISSING
catalog_ancestry_files: list[str] = MISSING
catalog_sumstats_lut: str = MISSING
catalog_associations_file: str = MISSING
variant_annotation_path: str = MISSING
Expand All @@ -50,10 +50,10 @@ def __post_init__(self: GWASCatalogStep) -> None:
# Extract
va = VariantAnnotation.from_parquet(self.session, self.variant_annotation_path)
catalog_studies = self.session.spark.read.csv(
self.catalog_studies_file, sep="\t", header=True
self.catalog_study_files, sep="\t", header=True
)
ancestry_lut = self.session.spark.read.csv(
self.catalog_ancestry_file, sep="\t", header=True
self.catalog_ancestry_files, sep="\t", header=True
)
sumstats_lut = self.session.spark.read.csv(
self.catalog_sumstats_lut, sep="\t", header=False
Expand Down
19 changes: 15 additions & 4 deletions utils/update_GWAS_Catalog_data.sh
Original file line number Diff line number Diff line change
Expand Up @@ -55,17 +55,28 @@ wget -q ${RELEASE_URL}/gwas-catalog-download-studies-v1.0.3.txt \
-O gwas-catalog-v1.0.3-studies-r${YEAR}-${MONTH}-${DAY}.tsv
logging "File gwas-catalog-v1.0.3-studies-r${YEAR}-${MONTH}-${DAY}.tsv saved."

wget -q ${RELEASE_URL}/gwas-catalog-unpublished-studies-v1.0.3.tsv \
-O gwas-catalog-v1.0.3-unpublished-studies-r${YEAR}-${MONTH}-${DAY}.tsv
logging "File gwas-catalog-v1.0.3-unpublished-studies-r${YEAR}-${MONTH}-${DAY}.tsv saved."

wget -q ${RELEASE_URL}/gwas-catalog-download-ancestries-v1.0.3.txt \
-O gwas-catalog-v1.0.3-ancestries-r${YEAR}-${MONTH}-${DAY}.tsv
logging "File gwas-catalog-v1.0.3-ancestries-r${YEAR}-${MONTH}-${DAY}.tsv saved."

wget -q ${RELEASE_URL}/gwas-catalog-unpublished-ancestries-v1.0.3.tsv \
-O gwas-catalog-v1.0.3-unpublished-ancestries-r${YEAR}-${MONTH}-${DAY}.tsv
logging "File gwas-catalog-v1.0.3-unpublished-ancestries-r${YEAR}-${MONTH}-${DAY}.tsv saved."


wget -q ${BASE_URL}/summary_statistics/harmonised_list.txt -O harmonised_list-r${YEAR}-${MONTH}-${DAY}.txt
logging "File harmonised_list-r${YEAR}-${MONTH}-${DAY}.txt saved."

logging "Copying files to GCP..."
gsutil -q cp file://$(pwd)/gwas_catalog_v1.0.2-associations_e${ENSEMBL}_r${YEAR}-${MONTH}-${DAY}.tsv ${GCP_TARGET}/
gsutil -q cp file://$(pwd)/gwas-catalog-v1.0.3-studies-r${YEAR}-${MONTH}-${DAY}.tsv ${GCP_TARGET}/
gsutil -q cp file://$(pwd)/gwas-catalog-v1.0.3-ancestries-r${YEAR}-${MONTH}-${DAY}.tsv ${GCP_TARGET}/
gsutil -q cp file://$(pwd)/harmonised_list-r${YEAR}-${MONTH}-${DAY}.txt ${GCP_TARGET}/
gsutil -mq cp file://$(pwd)/gwas_catalog_v1.0.2-associations_e${ENSEMBL}_r${YEAR}-${MONTH}-${DAY}.tsv ${GCP_TARGET}/
gsutil -mq cp file://$(pwd)/gwas-catalog-v1.0.3-studies-r${YEAR}-${MONTH}-${DAY}.tsv ${GCP_TARGET}/
gsutil -mq cp file://$(pwd)/gwas-catalog-v1.0.3-ancestries-r${YEAR}-${MONTH}-${DAY}.tsv ${GCP_TARGET}/
gsutil -mq cp file://$(pwd)/harmonised_list-r${YEAR}-${MONTH}-${DAY}.txt ${GCP_TARGET}/
gsutil -mq cp file://$(pwd)/gwas-catalog-v1.0.3-unpublished-studies-r${YEAR}-${MONTH}-${DAY}.tsv ${GCP_TARGET}/
gsutil -mq cp file://$(pwd)/gwas-catalog-v1.0.3-unpublished-ancestries-r${YEAR}-${MONTH}-${DAY}.tsv ${GCP_TARGET}/

logging "Done."

0 comments on commit 6118bb9

Please sign in to comment.