Skip to content

GBRS Pipeline ReadMe

Mike Lloyd edited this page Aug 11, 2023 · 11 revisions

GBRS Documentation

NOTE: EMASE and GBRS are part of the same software package, and share functions between them. This workflow outputs from EMASE and GBRS, but they are derived from the same software package.

GBRS Pipeline (--workflow gbrs)

•   Step 1: Map reads with BOWTIE, and convert SAM to BAM. Note, R1 and R2 are mapped separately for paired-end data.  
•   Step 2: Convert BAM to EMASE. Note: R1 and R2 are converted separately for paired-end data.  
•   Step 3: Paired-end data: Find common alignments and compress EMASE file.  
•   Step 3: Single-end data: Compress EMASE file.  
•   Step 4: Quantify multiway expression  
•   Step 5: Genotype reconstruction   
•   Step 6: Quantify diploid expression with GBRS   
•   Step 7: Interpolate genotypes and genotype probabilities    
•   Step 8: Plot inferred genotypes  
•   Step 9: Export genotype probabilities      

Parameters for GBRS Pipeline

  • --pubdir

    • Default: /<PATH>
    • Comment: The directory that the saved outputs will be stored.
  • --sample_folder

    • Default: /<PATH>
    • Comment: The path to the folder that contains all the samples to be run by the pipeline. The files in this path can also be symbolic links.
  • --extension

    • Default: .fastq.gz
    • Comment: The expected extension for the input read files.
  • --pattern

    • Default: "*_R{1,2}*"
    • Comment: The expected R1 / R2 matching pattern. The default value will match reads with names like this READ_NAME_R1_MoreText.fastq.gz or READ_NAME_R1.fastq.gz
  • --read_type

    • Default: PE
    • Comment: Options: PE and SE. Type of reads: paired end (PE) or single end (SE).
  • --concat_lanes

    • Default: false
    • Comment: Options: false and true. If this boolean is specified, FASTQ files will be concatenated by sample. Used in cases where samples are divided across individual sequencing lanes.
  • --concat_sampleID_delim

    • Default: "_"
    • Comment: The delimited to split FASTQ file names.
  • --concat_sampleID_positions

    • Default: "1"
    • Comment: The number of elements to keep after splitting on the chosen delimiter in the sample name.
  • --bowtie_index

    • Default: /projects/compsci/omics_share/mouse/GRCm39/transcriptome/indices/imputed/rel_2112_v8/bowtie/bowtie.transcripts
    • Comment: Path to the bowtie index. Include the bowtie prefix in this path (e.g., /path/to/bowtie.transcripts where bowtie.transcripts.* are the full set of index files in the directory.
  • --transcripts_info

    • Default: /projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/emase.fullTranscripts.info
    • Comment: A file containing all transcript IDs. NOTE: These IDs must not contain haplotype IDs. This file must also have a 'length' column. Note that 'length' is not used in this context. ONLY IDs are used from this file. Can be obtained from prepare_emase workflow (emase.fullTranscripts.info)
  • --gbrs_strain_list

    • Default: "A,B,C,D,E,F,G,H"
    • Comment: A list of haplotype names corresponding to genomes used in hybrid genome construction (e.g., 'A,B,C,D,E,F,G,H').
  • --gene2transcript_csv

    • Default: /projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/emase.gene2transcripts.tsv
    • Comment: A file containing all gene to transcript ID translations. NOTE: These IDs must not contain haplotype IDs. Can be obtained from prepare_emase workflow (emase.gene2transcripts.tsv)
  • --full_transcript_info

    • Default: /projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/emase.pooled.fullTranscripts.info
    • Comment: A file containing all transcript IDs with transcript lengths. NOTE: These IDs must contain haplotype IDs. Can be obtained from prepare_emase workflow (emase.pooled.fullTranscripts.info)
  • --emase_model

    • Default: '4'
    • Comment: Options:(1, 2, 3, 4). 1: reads are apportioned among genes first, then between alleles, and then among isoforms. 2: reads are apportioned among genes first, then among isoforms, and then between alleles. 3: reads are apportioned among genes first, then among each isoform-allele combination.
  • --emission_prob_avecs

    • Default: /projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/gbrs_emissions_all_tissues.avecs.npz
    • Comment: The emission probability vector file.
  • --trans_prob_dir

    • Default: /projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/transition_probabilities
    • Comment: A directory containing all transition probability files.
  • --gene_position_file

    • Default: /projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/ref.gene_pos.ordered_ensBuild_105.npz
    • Comment: A python compressed NPZ file containing gene position in primary reference coordinates.
  • --genotype_grid

    • Default: /projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/ref.genome_grid.GRCm39.tsv
    • Comment: Simulated marker grid used in genotype inference.
  • --founder_hex_colors

    • Default: /projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/founder.hexcolor.info
    • Comment: A file containing hexcode colors for plotting genotypes.
  • --gbrs_expression_threshold

    • Default: 1.5
    • Comment: GBRS expression threshold limit required to quantify a gene.
  • --gbrs_sigma

    • Default: 0.12
    • Comment: GBRS scaling factor.

Pipeline Default Outputs

NOTE: * Represents a wild card that is a placeholder for values that will be filled by input file names and/or parameters when the pipeline is run.

Naming Convention Description
emase_report.html Nextflow autogenerated report
trace.txt Nextflow trace of processes
*/gbrs/*.emase.h5 EMASE/GBRS compressed BAM h5 file
*/emase/*.isoforms.alignment_counts Multiway raw alignment counts for all transcript isoforms
*/emase/*.isoforms.expected_read_counts Multiway expected read counts for all transcript isoforms
*/emase/*.isoforms.tpm Multiway transcript per million (tpm) for all transcript isoforms
*/emase/*.genes.alignment_counts Multiway raw alignment counts for all genes
*/emase/*.genes.expected_read_counts Multiway expected read counts for all genes
*/emase/*.genes.tpm Multiway transcript per million (tpm) for all genes
*/gbrs/*.isoforms.alignment_counts Diploid raw alignment counts for all transcript isoforms
*/gbrs/*.isoforms.expected_read_counts Diploid expected read counts for all transcript isoforms
*/gbrs/*.isoforms.tpm Diploid transcript per million (tpm) for all transcript isoforms
*/gbrs/*.genes.alignment_counts Diploid raw alignment counts for all genes
*/gbrs/*.genes.expected_read_counts Diploid expected read counts for all genes
*/gbrs/*.genes.tpm Diploid transcript per million (tpm) for all genes
*/gbrs/*.interpolated.genoprobs.npz Interpolated genotype probabilities in python NPZ format
*/gbrs/*.interpolated.genoprobs.tsv Interpolated genotype probabilities in tab delimited format
*/gbrs/*.plotted.genome.pdf Plotted interpolated genotypes
*/gbrs/*.genotypes.tsv Inferred genotypes for all genes
*/stats/*.bowtie_R1.log Statistics output from read1 mapping from Bowtie
*/stats/*.bowtie_R2.log Statistics output from read2 mapping from Bowtie, if data are paired end

Pipeline Options Outputs

There are no optional outputs for this workflow.

Clone this wiki locally