Skip to content

jvierstra/tensorqtl

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tensorQTL

tensorQTL is a GPU-enabled QTL mapper, achieving ~200-300 fold faster cis- and trans-QTL mapping compared to CPU-based implementations.

Note: this is a fork of the Broad tensorQTL repo with allowing for formulaic definition of QTL regression for arbitrary interaction terms. Variant filtering is a work in progress -- in this current implementation which focused on categorical interaction terms (i.e., cell type) we require at least 3 individuals for each genotype (0, 1, 2) for each categorical term. The filtering function can be found in the filter_term_samples function in core.py (line ~114). This can be customized for your particular dataset/use case.

To see how the interaction mode works please see this noteboook. Currently, there is no command-line implementation for interaction terms.

If you use tensorQTL in your research, please cite the following paper: Taylor-Weiner, Aguet, et al., Genome Biol. 20:228, 2019.

Empirical beta-approximated p-values are computed as described in FastQTL (Ongen et al., 2016).

Install

$ git clone [email protected]:jvierstra/tensorqtl.git

You can install tensorQTL using pip:

cd tensorflow
pip install .

or directly from this repository:

$ git clone [email protected]:jvierstra/tensorqtl.git
$ cd tensorqtl
# set up virtual environment and install
$ virtualenv venv
$ source venv/bin/activate
(venv)$ pip install -r install/requirements.txt .

Requirements

tensorQTL requires an environment configured with a GPU for optimal performance, but can also be run on a CPU. Instructions for setting up a virtual machine on Google Cloud Platform are provided here.

Input formats

Three inputs are required for QTL analyses with tensorQTL: genotypes, phenotypes, and covariates.

  • Phenotypes must be provided in BED format, with a single header line starting with # and the first four columns corresponding to: chr, start, end, phenotype_id, with the remaining columns corresponding to samples (the identifiers must match those in the genotype input). The BED file should specify the center of the cis-window (usually the TSS), with start == end-1. A function for generating a BED template from a gene annotation in GTF format is available in pyqtl (io.gtf_to_tss_bed).

  • Covariates can be provided as a tab-delimited text file (covariates x samples) or dataframe (samples x covariates), with row and column headers.

  • Genotypes must be in PLINK format, which can be generated from a VCF as follows:

    plink2 --make-bed \
        --output-chr chrM \
        --vcf ${plink_prefix_path}.vcf.gz \
        --out ${plink_prefix_path}
    

    If using PLINK 1.9 or earlier, add the --keep-allele-order flag.

    Alternatively, the genotypes can be provided as a dataframe (genotypes x samples).

The examples notebook below contains examples of all input files. The input formats for phenotypes and covariates are identical to those used by FastQTL.

Examples

For examples illustrating cis- and trans-QTL mapping, please see tensorqtl_examples.ipynb.

Running tensorQTL

This section describes how to run the different modes of tensorQTL, both from the command line and within Python. For a full list of options, run

python3 -m tensorqtl --help

Loading input files

This section is only relevant when running tensorQTL in Python. The following imports are required:

import pandas as pd
import tensorqtl
from tensorqtl import genotypeio, cis, trans

Phenotypes and covariates can be loaded as follows:

phenotype_df, phenotype_pos_df = tensorqtl.read_phenotype_bed(phenotype_bed_file)
covariates_df = pd.read_csv(covariates_file, sep='\t', index_col=0).T  # samples x covariates

Genotypes can be loaded as follows, where plink_prefix_path is the path to the VCF in PLINK format (excluding .bed/.bim/.fam extensions):

pr = genotypeio.PlinkReader(plink_prefix_path)
# load genotypes and variants into data frames
genotype_df = pr.load_genotypes()
variant_df = pr.bim.set_index('snp')[['chrom', 'pos']]

To save memory when using genotypes for a subset of samples, a subset of samples can be loaded (this is not strictly necessary, since tensorQTL will select the relevant samples from genotype_df otherwise):

pr = genotypeio.PlinkReader(plink_prefix_path, select_samples=phenotype_df.columns)

cis-QTL mapping: permutations

This is the main mode for cis-QTL mapping. It generates phenotype-level summary statistics with empirical p-values, enabling calculation of genome-wide FDR. In Python:

cis_df = cis.map_cis(genotype_df, variant_df, phenotype_df, phenotype_pos_df, covariates_df)
tensorqtl.calculate_qvalues(cis_df, qvalue_lambda=0.85)

Shell command:

python3 -m tensorqtl ${plink_prefix_path} ${expression_bed} ${prefix} \
    --covariates ${covariates_file} \
    --mode cis

${prefix} specifies the output file name.

cis-QTL mapping: summary statistics for all variant-phenotype pairs

In Python:

cis.map_nominal(genotype_df, variant_df, phenotype_df, phenotype_pos_df,
                prefix, covariates_df, output_dir='.')

Shell command:

python3 -m tensorqtl ${plink_prefix_path} ${expression_bed} ${prefix} \
    --covariates ${covariates_file} \
    --mode cis_nominal

The results are written to a parquet file for each chromosome. These files can be read using pandas:

df = pd.read_parquet(file_name)

cis-QTL mapping: conditionally independent QTLs

This mode maps conditionally independent cis-QTLs using the stepwise regression procedure described in GTEx Consortium, 2017. The output from the permutation step (see map_cis above) is required. In Python:

indep_df = cis.map_independent(genotype_df, variant_df, cis_df,
                               phenotype_df, phenotype_pos_df, covariates_df)

Shell command:

python3 -m tensorqtl ${plink_prefix_path} ${expression_bed} ${prefix} \
    --covariates ${covariates_file} \
    --cis_output ${prefix}.cis_qtl.txt.gz \
    --mode cis_independent

cis-QTL mapping: interactions

Instead of mapping the standard linear model (p ~ g), this mode includes an interaction terms The formula can be specified as a 'R-style' using the patsy Python interface (see patsy for more information): p ~ g + term - 1. In this case -1, removes the intercept.

In Python:

res_full = cis.map_nominal_interactions(genotype_df, variant_df, phenotype_df, phenotype_pos_df,
                                phenotype_sample_df, formula=formula,
                                covariates_df=covariates_df, window=window, batch_size=1000,
                                write_output=False, prefix='qtl', output_dir='.', center=False, debug_n=-1)

The input options write_outputcontrol whether full summary statistics, respectively, are written to file (starting with prefix)

The argument center specifies wether coefficients are centered (only applied when a covariates matrix is provided).

batch_size can be changed if your get a 'CUDA out of memory error' (In this case, reduce the batch size).

debug_n can be used to test the first n phenotypes (vs. all) for debugging purposes.

The formula must specify p and g for phenotype and genotypes. These cannot occur as columns in the phenotype_sample_df dataframe.

Full summary statistics are saved as parquet files for each chromosome, in ${output_dir}/${prefix}.cis_qtl_pairs.${chr}.parquet. In this files, the columns b_g, b_g_se, pval_g are the effect size, standard error, and p-value of g in the model, with matching columns for i and gi.

The number of effective SNPs (m_eff) is computed for each phenotype using the eigenMT approach.

SSE (sum of squared error) and model degree of freeedom is returned for each variant-phenotype test in order to facilitate comparing nested models.

See this noteboook for a complete example of detecting QTLs with interaction terms.

trans-QTL mapping

This mode computes nominal associations between all phenotypes and genotypes. tensorQTL generates sparse output by default (associations with p-value < 1e-5). cis-associations are filtered out. The output is in parquet format, with four columns: phenotype_id, variant_id, pval, maf. In Python:

trans_df = trans.map_trans(genotype_df, phenotype_df, covariates_df,
                           return_sparse=True, pval_threshold=1e-5, maf_threshold=0.05,
                           batch_size=20000)
# remove cis-associations
trans_df = trans.filter_cis(trans_df, phenotype_pos_df.T.to_dict(), variant_df, window=5000000)

Shell command:

python3 -m tensorqtl ${plink_prefix_path} ${expression_bed} ${prefix} \
    --covariates ${covariates_file} \
    --mode trans

About

Ultrafast GPU-based QTL mapper

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 64.6%
  • Jupyter Notebook 34.1%
  • Other 1.3%