Skip to content

Commit

Permalink
Do pics (opentargets#57)
Browse files Browse the repository at this point in the history
* feat: experimental LD using hail

* feat: handling of the lower triangle

* docs: more comments explaining what\'s going on

* feat: non-working implementation of ld information

* feat: ld information based on precomputed index

* fix: no longer neccesary

* feat: missing ld.py added

* docs: intervals functions examples

* fix: typo

* refactor: larger scale update in the GWAS Catalog data ingestion

* feat: precommit updated

* feat: first pics implementation derived from gnomAD LD information

* feat: modularise iterating over populations

* feat: finalizing GWAS Catalog ingestion steps

* fix: test fixed, due to changes in logic

* feat: ignoring .coverage file

* feat: modularised pics method

* feat: integrating pics

* chore: smoothing out bits

* feat: cleanup of the pics logic

No select statements, more concise functions and carrying over all required information

* fix: slight updates

* feat: map gnomad positions to ensembl positions LD

* fix: use cumsum instead of postprob

* feat: update studies schemas

* feat: working on integrating ingestion with pics

* feat: support for hail doctests

* test: _query_block_matrix example

* feat: ignore hail logs for testing

* feat: ingore TYPE_CHECKING blocks from testing

* feat: pics benchmark notebook (co-authored by Irene)

* feat: new finding on r > 1

* docs: Explanation about the coordinate shift

* test: several pics doctests

* test: fixed test for log neg pval

* style: remove variables only used for return

* feat: parametrise liftover chain

* refactor: consolidating some code

* feat: finishing pics

* chore: adding tests to effect harmonization

* chore: adding more tests

* feat: benchmarking new dataset

* fix: resolving minor bugs around pics

* Apply suggestions from code review

Co-authored-by: Irene López <[email protected]>

* fix: applying review comments

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix: addressing some more review comments

* fix: update GnomAD join

* fix: bug sorted out in pics filterin

* feat: abstracting QC flagging

* feat: solving clumping

* fix: clumping is now complete

Co-authored-by: David <[email protected]>
Co-authored-by: David Ochoa <[email protected]>
Co-authored-by: Irene López <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
5 people authored Jan 16, 2023
1 parent 8c5c532 commit 7e42f94
Show file tree
Hide file tree
Showing 27 changed files with 5,716 additions and 807 deletions.
4 changes: 4 additions & 0 deletions .coveragerc
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[report]
exclude_lines =
pragma: no cover
if TYPE_CHECKING:
33 changes: 32 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
PROJECT_ID ?= open-targets-genetics-dev
REGION ?= europe-west1
CLUSTER_NAME ?= il-coloc
CLUSTER_NAME ?= ${USER}-genetics-etl
PROJECT_NUMBER ?= $$(gcloud projects list --filter=${PROJECT_ID} --format="value(PROJECT_NUMBER)")
APP_NAME ?= $$(cat pyproject.toml| grep name | cut -d" " -f3 | sed 's/"//g')
VERSION_NO ?= $$(poetry version --short)
Expand All @@ -24,13 +24,27 @@ setup-dev: ## Setup dev environment

build: clean ## Build Python Package with Dependencies
@echo "Packaging Code and Dependencies for ${APP_NAME}-${VERSION_NO}"
@rm -rf ./dist
@poetry build
@cp ./src/*.py ./dist
@poetry run python ./utils/configure.py --cfg job > ./dist/config.yaml
@echo "Uploading to Dataproc"
@gsutil cp ./dist/${APP_NAME}-${VERSION_NO}-py3-none-any.whl gs://genetics_etl_python_playground/initialisation/
@gsutil cp ./utils/initialise_cluster.sh gs://genetics_etl_python_playground/initialisation/

prepare_pics: ## Create cluster for variant annotation
gcloud dataproc clusters create ${CLUSTER_NAME} \
--image-version=2.0 \
--project=${PROJECT_ID} \
--region=${REGION} \
--master-machine-type=n1-highmem-96 \
--enable-component-gateway \
--num-master-local-ssds=1 \
--master-local-ssd-interface=NVME \
--metadata="PACKAGE=gs://genetics_etl_python_playground/initialisation/${APP_NAME}-${VERSION_NO}-py3-none-any.whl" \
--initialization-actions=gs://genetics_etl_python_playground/initialisation/initialise_cluster.sh \
--single-node \
--max-idle=10m

prepare_variant_annotation: ## Create cluster for variant annotation
gcloud dataproc clusters create ${CLUSTER_NAME} \
Expand Down Expand Up @@ -150,6 +164,23 @@ run_gwas: ## Ingest gwas dataset on a dataproc cluster
gcloud dataproc jobs submit pyspark ./dist/run_gwas_ingest.py \
--cluster=${CLUSTER_NAME} \
--files=./dist/config.yaml \
--properties='spark.jars=/opt/conda/miniconda3/lib/python3.8/site-packages/hail/backend/hail-all-spark.jar,spark.driver.extraClassPath=/opt/conda/miniconda3/lib/python3.8/site-packages/hail/backend/hail-all-spark.jar,spark.executor.extraClassPath=./hail-all-spark.jar,spark.serializer=org.apache.spark.serializer.KryoSerializer,spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator' \
--py-files=gs://genetics_etl_python_playground/initialisation/${APP_NAME}-${VERSION_NO}-py3-none-any.whl \
--project=${PROJECT_ID} \
--region=${REGION}

run_pics: ## Run pics method
gcloud dataproc jobs submit pyspark ./dist/pics_experiment.py \
--cluster=${CLUSTER_NAME} \
--files=./dist/config.yaml \
--properties='spark.jars=/opt/conda/miniconda3/lib/python3.8/site-packages/hail/backend/hail-all-spark.jar,spark.driver.extraClassPath=/opt/conda/miniconda3/lib/python3.8/site-packages/hail/backend/hail-all-spark.jar,spark.executor.extraClassPath=./hail-all-spark.jar,spark.serializer=org.apache.spark.serializer.KryoSerializer,spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator' \
--project=${PROJECT_ID} \
--region=${REGION}

run_precompute_ld_index: ## Precompute ld-index information
gcloud dataproc jobs submit pyspark ./dist/run_precompute_ld_indexes.py \
--cluster=${CLUSTER_NAME} \
--files=./dist/config.yaml \
--properties='spark.jars=/opt/conda/miniconda3/lib/python3.8/site-packages/hail/backend/hail-all-spark.jar,spark.driver.extraClassPath=/opt/conda/miniconda3/lib/python3.8/site-packages/hail/backend/hail-all-spark.jar,spark.executor.extraClassPath=./hail-all-spark.jar,spark.serializer=org.apache.spark.serializer.KryoSerializer,spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator' \
--project=${PROJECT_ID} \
--region=${REGION}
50 changes: 46 additions & 4 deletions configs/etl/reference.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -83,15 +83,57 @@ variant_index:

gwas_ingest:
inputs:
variant_annotation: gs://ot-team/dsuveges/variant_annotation/2022-08-11
gwas_catalog_associations: ${etl.inputs}/v2d/gwas_catalog_v1.0.2-associations_e107_r2022-09-14.tsv
gwas_catalog_studies: ${etl.inputs}/v2d/gwas-catalog-v1.0.3-studies-r2022-09-14.tsv
gwas_catalog_ancestries: ${etl.inputs}/v2d/gwas-catalog-v1.0.3-ancestries-r2022-09-14.tsv
variant_annotation: ${etl.outputs}/variant_annotation
gnomad_populations:
- id: afr
index: gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.afr.common.ld.variant_indices.ht
matrix: gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.afr.common.adj.ld.bm
parsed_index: ${etl.inputs}/ld/gnomad_r2.1.1.afr.common.ld.variant_indices_2mb.parquet
- id: amr
index: gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.amr.common.ld.variant_indices.ht
matrix: gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.amr.common.adj.ld.bm
parsed_index: ${etl.inputs}/ld/gnomad_r2.1.1.amr.common.ld.variant_indices_2mb.parquet
- id: asj
index: gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.asj.common.ld.variant_indices.ht
matrix: gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.asj.common.adj.ld.bm
parsed_index: ${etl.inputs}/ld/gnomad_r2.1.1.asj.common.ld.variant_indices_2mb.parquet
- id: eas
index: gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.eas.common.ld.variant_indices.ht
matrix: gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.eas.common.adj.ld.bm
parsed_index: ${etl.inputs}/ld/gnomad_r2.1.1.eas.common.ld.variant_indices_2mb.parquet
- id: fin
index: gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.fin.common.ld.variant_indices.ht
matrix: gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.fin.common.adj.ld.bm
parsed_index: ${etl.inputs}/ld/gnomad_r2.1.1.fin.common.ld.variant_indices_2mb.parquet
- id: nfe
index: gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.nfe.common.ld.variant_indices.ht
matrix: gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.nfe.common.adj.ld.bm
parsed_index: ${etl.inputs}/ld/gnomad_r2.1.1.nfe.common.ld.variant_indices_2mb.parquet
- id: est
index: gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.est.common.ld.variant_indices.ht
matrix: gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.est.common.adj.ld.bm
parsed_index: ${etl.inputs}/ld/gnomad_r2.1.1.est.common.ld.variant_indices_2mb.parquet
- id: nwe
index: gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.nwe.common.ld.variant_indices.ht
matrix: gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.nwe.common.adj.ld.bm
parsed_index: ${etl.inputs}/ld/gnomad_r2.1.1.nwe.common.ld.variant_indices_2mb.parquet
- id: seu
index: gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.seu.common.ld.variant_indices.ht
matrix: gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.seu.common.adj.ld.bm
parsed_index: ${etl.inputs}/ld/gnomad_r2.1.1.seu.common.ld.variant_indices_2mb.parquet
gwas_catalog_associations: ${etl.inputs}/v2d/gwas_catalog_v1.0.2-associations_e107_r2022-11-29.tsv
gwas_catalog_studies: ${etl.inputs}/v2d/gwas-catalog-v1.0.3-studies-r2022-11-29.tsv
gwas_catalog_ancestries: ${etl.inputs}/v2d/gwas-catalog-v1.0.3-ancestries-r2022-11-29.tsv
summary_stats_list: ${etl.inputs}/v2d/harmonised_list.txt
grch37_to_grch38_chain: gs://hail-common/references/grch37_to_grch38.over.chain.gz
outputs:
gwas_catalog_associations: ${etl.outputs}/gwas_catalog_associations
gwas_catalog_studies: ${etl.outputs}/gwas_catalog_studies
pics_credible_set: ${etl.outputs}/pics_credible_set
parameters:
ingest_unpublished_studies: False
p_value_cutoff: 5e-8
ld_window: 2_000_000
min_r2: 0.5
k: 6.4 # Empiric constant that can be adjusted to fit the curve, 6.4 recommended.
machine: ${machine.default}
Loading

0 comments on commit 7e42f94

Please sign in to comment.