GitHub - jvierstra/tensorqtl: Ultrafast GPU-based QTL mapper

tensorQTL

tensorQTL is a GPU-enabled QTL mapper, achieving ~200-300 fold faster cis- and trans-QTL mapping compared to CPU-based implementations.

Note: this is a fork of the Broad tensorQTL repo with allowing for formulaic definition of QTL regression for arbitrary interaction terms. Variant filtering is a work in progress -- in this current implementation which focused on categorical interaction terms (i.e., cell type) we require at least 3 individuals for each genotype (0, 1, 2) for each categorical term. The filtering function can be found in the filter_term_samples function in core.py (line ~114). This can be customized for your particular dataset/use case.

To see how the interaction mode works please see this noteboook. Currently, there is no command-line implementation for interaction terms.

If you use tensorQTL in your research, please cite the following paper: Taylor-Weiner, Aguet, et al., Genome Biol. 20:228, 2019.

Empirical beta-approximated p-values are computed as described in FastQTL (Ongen et al., 2016).

Install

$ git clone [email protected]:jvierstra/tensorqtl.git

You can install tensorQTL using pip:

cd tensorflow
pip install .

or directly from this repository:

$ git clone [email protected]:jvierstra/tensorqtl.git
$ cd tensorqtl
# set up virtual environment and install
$ virtualenv venv
$ source venv/bin/activate
(venv)$ pip install -r install/requirements.txt .

Requirements

tensorQTL requires an environment configured with a GPU for optimal performance, but can also be run on a CPU. Instructions for setting up a virtual machine on Google Cloud Platform are provided here.

Input formats

Three inputs are required for QTL analyses with tensorQTL: genotypes, phenotypes, and covariates.

Phenotypes must be provided in BED format, with a single header line starting with # and the first four columns corresponding to: chr, start, end, phenotype_id, with the remaining columns corresponding to samples (the identifiers must match those in the genotype input). The BED file should specify the center of the cis-window (usually the TSS), with start == end-1. A function for generating a BED template from a gene annotation in GTF format is available in pyqtl (io.gtf_to_tss_bed).
Covariates can be provided as a tab-delimited text file (covariates x samples) or dataframe (samples x covariates), with row and column headers.
Genotypes must be in PLINK format, which can be generated from a VCF as follows:
```
plink2 --make-bed \
    --output-chr chrM \
    --vcf ${plink_prefix_path}.vcf.gz \
    --out ${plink_prefix_path}
```
If using PLINK 1.9 or earlier, add the --keep-allele-order flag.

Alternatively, the genotypes can be provided as a dataframe (genotypes x samples).

The examples notebook below contains examples of all input files. The input formats for phenotypes and covariates are identical to those used by FastQTL.

Examples

For examples illustrating cis- and trans-QTL mapping, please see tensorqtl_examples.ipynb.

Running tensorQTL

This section describes how to run the different modes of tensorQTL, both from the command line and within Python. For a full list of options, run

python3 -m tensorqtl --help

Loading input files

This section is only relevant when running tensorQTL in Python. The following imports are required:

import pandas as pd
import tensorqtl
from tensorqtl import genotypeio, cis, trans

Phenotypes and covariates can be loaded as follows:

phenotype_df, phenotype_pos_df = tensorqtl.read_phenotype_bed(phenotype_bed_file)
covariates_df = pd.read_csv(covariates_file, sep='\t', index_col=0).T  # samples x covariates

Genotypes can be loaded as follows, where plink_prefix_path is the path to the VCF in PLINK format (excluding .bed/.bim/.fam extensions):

pr = genotypeio.PlinkReader(plink_prefix_path)
# load genotypes and variants into data frames
genotype_df = pr.load_genotypes()
variant_df = pr.bim.set_index('snp')[['chrom', 'pos']]

To save memory when using genotypes for a subset of samples, a subset of samples can be loaded (this is not strictly necessary, since tensorQTL will select the relevant samples from genotype_df otherwise):

pr = genotypeio.PlinkReader(plink_prefix_path, select_samples=phenotype_df.columns)

cis-QTL mapping: permutations

This is the main mode for cis-QTL mapping. It generates phenotype-level summary statistics with empirical p-values, enabling calculation of genome-wide FDR. In Python:

cis_df = cis.map_cis(genotype_df, variant_df, phenotype_df, phenotype_pos_df, covariates_df)
tensorqtl.calculate_qvalues(cis_df, qvalue_lambda=0.85)

Shell command:

python3 -m tensorqtl ${plink_prefix_path} ${expression_bed} ${prefix} \
    --covariates ${covariates_file} \
    --mode cis

${prefix} specifies the output file name.

cis-QTL mapping: summary statistics for all variant-phenotype pairs

In Python:

cis.map_nominal(genotype_df, variant_df, phenotype_df, phenotype_pos_df,
                prefix, covariates_df, output_dir='.')

Shell command:

python3 -m tensorqtl ${plink_prefix_path} ${expression_bed} ${prefix} \
    --covariates ${covariates_file} \
    --mode cis_nominal

The results are written to a parquet file for each chromosome. These files can be read using pandas:

df = pd.read_parquet(file_name)

cis-QTL mapping: conditionally independent QTLs

This mode maps conditionally independent cis-QTLs using the stepwise regression procedure described in GTEx Consortium, 2017. The output from the permutation step (see map_cis above) is required. In Python:

indep_df = cis.map_independent(genotype_df, variant_df, cis_df,
                               phenotype_df, phenotype_pos_df, covariates_df)

Shell command:

python3 -m tensorqtl ${plink_prefix_path} ${expression_bed} ${prefix} \
    --covariates ${covariates_file} \
    --cis_output ${prefix}.cis_qtl.txt.gz \
    --mode cis_independent

cis-QTL mapping: interactions

Instead of mapping the standard linear model (p ~ g), this mode includes an interaction terms The formula can be specified as a 'R-style' using the patsy Python interface (see patsy for more information): p ~ g + term - 1. In this case -1, removes the intercept.

In Python:

res_full = cis.map_nominal_interactions(genotype_df, variant_df, phenotype_df, phenotype_pos_df,
                                phenotype_sample_df, formula=formula,
                                covariates_df=covariates_df, window=window, batch_size=1000,
                                write_output=False, prefix='qtl', output_dir='.', center=False, debug_n=-1)

The input options write_outputcontrol whether full summary statistics, respectively, are written to file (starting with prefix)

The argument center specifies wether coefficients are centered (only applied when a covariates matrix is provided).

batch_size can be changed if your get a 'CUDA out of memory error' (In this case, reduce the batch size).

debug_n can be used to test the first n phenotypes (vs. all) for debugging purposes.

The formula must specify p and g for phenotype and genotypes. These cannot occur as columns in the phenotype_sample_df dataframe.

Full summary statistics are saved as parquet files for each chromosome, in ${output_dir}/${prefix}.cis_qtl_pairs.${chr}.parquet. In this files, the columns b_g, b_g_se, pval_g are the effect size, standard error, and p-value of g in the model, with matching columns for i and gi.

The number of effective SNPs (m_eff) is computed for each phenotype using the eigenMT approach.

SSE (sum of squared error) and model degree of freeedom is returned for each variant-phenotype test in order to facilitate comparing nested models.

See this noteboook for a complete example of detecting QTLs with interaction terms.

trans-QTL mapping

This mode computes nominal associations between all phenotypes and genotypes. tensorQTL generates sparse output by default (associations with p-value < 1e-5). cis-associations are filtered out. The output is in parquet format, with four columns: phenotype_id, variant_id, pval, maf. In Python:

trans_df = trans.map_trans(genotype_df, phenotype_df, covariates_df,
                           return_sparse=True, pval_threshold=1e-5, maf_threshold=0.05,
                           batch_size=20000)
# remove cis-associations
trans_df = trans.filter_cis(trans_df, phenotype_pos_df.T.to_dict(), variant_df, window=5000000)

Shell command:

python3 -m tensorqtl ${plink_prefix_path} ${expression_bed} ${prefix} \
    --covariates ${covariates_file} \
    --mode trans

Name		Name	Last commit message	Last commit date
Latest commit History 239 Commits
example		example
install		install
tensorqtl		tensorqtl
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tensorQTL

Install

Requirements

Input formats

Examples

Running tensorQTL

Loading input files

cis-QTL mapping: permutations

cis-QTL mapping: summary statistics for all variant-phenotype pairs

cis-QTL mapping: conditionally independent QTLs

cis-QTL mapping: interactions

trans-QTL mapping

About

Releases

Packages

Languages

License

jvierstra/tensorqtl

Folders and files

Latest commit

History

Repository files navigation

tensorQTL

Install

Requirements

Input formats

Examples

Running tensorQTL

Loading input files

cis-QTL mapping: permutations

cis-QTL mapping: summary statistics for all variant-phenotype pairs

cis-QTL mapping: conditionally independent QTLs

cis-QTL mapping: interactions

trans-QTL mapping

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages