-
Notifications
You must be signed in to change notification settings - Fork 10
GBRS Pipeline ReadMe
NOTE: EMASE
and GBRS
are part of the same software package, and share functions between them. This workflow outputs from EMASE
and GBRS
, but they are derived from the same software package.
• Step 1: Map reads with BOWTIE, and convert SAM to BAM. Note, R1 and R2 are mapped separately for paired-end data.
• Step 2: Convert BAM to EMASE. Note: R1 and R2 are converted separately for paired-end data.
• Step 3: Paired-end data: Find common alignments and compress EMASE file.
• Step 3: Single-end data: Compress EMASE file.
• Step 4: Quantify multiway expression
• Step 5: Genotype reconstruction
• Step 6: Quantify diploid expression with GBRS
• Step 7: Interpolate genotypes and genotype probabilities
• Step 8: Plot inferred genotypes
• Step 9: Export genotype probabilities
flowchart TD
p0((Sample))
p3[BOWTIE_R1]
p3.5[BOWTIE_R1]
p4[BOWTIE_R2]
p5[GBRS_BAM2EMASE_R1]
p5.5[GBRS_BAM2EMASE_R1]
p6[GBRS_BAM2EMASE_R2]
p7[EMASE_GET_COMMON_ALIGNMENT]
p8[GBRS_COMPRESS_PE]
p8.5[GBRS_COMPRESS_SE]
p9[GBRS_QUANTIFY]
note1{Paired\nEnd\nData}
note2{Single\nEnd\nData}
o1([Compressed EMASE h5 File]):::output
o2([EMASE Isoform\nTPM Counts]):::output
o3([EMASE Isoform\nExpected Counts]):::output
o4([EMASE Isoform\nAlignment Counts]):::output
o5([EMASE Gene\nTPM Counts]):::output
o6([EMASE Gene\nExpected Counts]):::output
o7([EMASE Gene\nAlignment Counts]):::output
p10[GBRS_RECONSTRUCT]
p11[GBRS_QUANTIFY_GENOTYPES]
p12[GBRS_INTERPOLATE]
p13[GBRS_PLOT]
p14[GBRS_EXPORT]
o8([GBRS Isoform\nTPM Counts]):::output
o9([GBRS Isoform\nExpected Counts]):::output
o10([GBRS Isoform\nAlignment Counts]):::output
o11([GBRS Gene\nTPM Counts]):::output
o12([GBRS Gene\nExpected Counts]):::output
o13([GBRS Gene\nAlignment Counts]):::output
o14([Infered Genotypes\nby Gene]):::output
o15([Plotted Genome]):::output
o16([Interpolated Genoprobs]):::output
p0 -..-> |If Paired End Data| note1
p0 -..-> |If Single End Data| note2
subgraph paired_end [ ]
note1 --> p3
note1 --> p4
subgraph read_1 [ ]
p3 --> p5
end
subgraph read_2 [ ]
p4 --> p6
end
p5 --> p7
p6 --> p7
p7 --> p8
end
subgraph single_end [ ]
note2 --> p3.5
p3.5 --> p5.5
p5.5 --> p8.5
end
p8 --> o1
p8.5 --> o1
o1 --> p9
p9 --> o2
p9 --> o3
p9 --> o4
p9 --> o5
p9 --> o6
p9 --> o7
o5 --> p10
p10 --> o14
p10 --> p11
p10 --> p12
o1 --> p11
p12 --> p13
p12 --> p14
p11 --> o8
p11 --> o9
p11 --> o10
p11 --> o11
p11 --> o12
p11 --> o13
p13 --> o15
p14 --> o16
classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000
style single_end stroke:#333,stroke-width:2px
style paired_end stroke:#333,stroke-width:2px
-
--pubdir
- Default:
/<PATH>
- Comment: The directory that the saved outputs will be stored.
- Default:
-
-w
- Default:
/<PATH>
- Comment: The directory that all intermediary files and nextflow processes utilize. This directory can become quite large. This should be a location on /fastscratch or other directory with ample storage.
- Default:
-
--csv_input
- Default:
null
- Comment: Provide a CSV manifest file with the header: "sampleID,sex,generation,lane,fastq_1,fastq_2". See below for an example file. Fastq_2 is optional and used only in PE data. Fastq files can either be absolute paths to local files, or URLs to remote files. If remote URLs are provided,
--download_data
can be specified.
- Default:
-
--sample_folder
- Default:
/<PATH>
- Comment: The path to the folder that contains all the samples to be run by the pipeline. The files in this path can also be symbolic links.
- Default:
-
--extension
- Default:
.fastq.gz
- Comment: The expected extension for the input read files.
- Default:
-
--pattern
- Default:
"*_R{1,2}*"
- Comment: The expected R1 / R2 matching pattern. The default value will match reads with names like this
READ_NAME_R1_MoreText.fastq.gz
orREAD_NAME_R1.fastq.gz
- Default:
-
--read_type
- Default:
PE
- Comment: Options:
PE
andSE
. Type of reads: paired end (PE) or single end (SE).
- Default:
-
--concat_lanes
- Default:
false
- Comment: Options:
false
andtrue
. If this boolean is specified, FASTQ files will be concatenated by sample. Used in cases where samples are divided across individual sequencing lanes.
- Default:
-
--concat_sampleID_delim
- Default:
"_"
- Comment: The delimited to split FASTQ file names.
- Default:
-
--concat_sampleID_positions
- Default:
"1"
- Comment: The number of elements to keep after splitting on the chosen delimiter in the sample name.
- Default:
-
--bowtie_index
- Default:
/projects/compsci/omics_share/mouse/GRCm39/transcriptome/indices/imputed/rel_2112_v8/bowtie/bowtie.transcripts
- Comment: Path to the bowtie index. Include the bowtie prefix in this path (e.g.,
/path/to/bowtie.transcripts
where bowtie.transcripts.* are the full set of index files in the directory.
- Default:
-
--transcripts_info
- Default:
/projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/emase.fullTranscripts.info
- Comment: A file containing all transcript IDs. NOTE: These IDs must not contain haplotype IDs. This file must also have a 'length' column. Note that 'length' is not used in this context. ONLY IDs are used from this file. Can be obtained from
prepare_emase
workflow (emase.fullTranscripts.info)
- Default:
-
--gbrs_strain_list
- Default:
"A,B,C,D,E,F,G,H"
- Comment: A list of haplotype names corresponding to genomes used in hybrid genome construction (e.g., 'A,B,C,D,E,F,G,H').
- Default:
-
--gene2transcript_csv
- Default:
/projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/emase.gene2transcripts.tsv
- Comment: A file containing all gene to transcript ID translations. NOTE: These IDs must not contain haplotype IDs. Can be obtained from
prepare_emase
workflow (emase.gene2transcripts.tsv)
- Default:
-
--full_transcript_info
- Default:
/projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/emase.pooled.fullTranscripts.info
- Comment: A file containing all transcript IDs with transcript lengths. NOTE: These IDs must contain haplotype IDs. Can be obtained from
prepare_emase
workflow (emase.pooled.fullTranscripts.info)
- Default:
-
--emase_model
- Default:
'4'
- Comment: Options:(1, 2, 3, 4). 1: reads are apportioned among genes first, then between alleles, and then among isoforms. 2: reads are apportioned among genes first, then among isoforms, and then between alleles. 3: reads are apportioned among genes first, then among each isoform-allele combination.
- Default:
-
--emission_prob_avecs
- Default:
/projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/gbrs_emissions_all_tissues.avecs.npz
- Comment: The emission probability vector file.
- Default:
-
--trans_prob_dir
- Default:
/projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/transition_probabilities
- Comment: A directory containing all transition probability files.
- Default:
-
--gene_position_file
- Default:
/projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/ref.gene_pos.ordered_ensBuild_105.npz
- Comment: A python compressed NPZ file containing gene position in primary reference coordinates.
- Default:
-
--genotype_grid
- Default:
/projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/ref.genome_grid.GRCm39.tsv
- Comment: Simulated marker grid used in genotype inference.
- Default:
-
--founder_hex_colors
- Default:
/projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/founder.hexcolor.info
- Comment: A file containing hexcode colors for plotting genotypes.
- Default:
-
--gbrs_expression_threshold
- Default:
1.5
- Comment: GBRS expression threshold limit required to quantify a gene.
- Default:
-
--gbrs_sigma
- Default:
0.12
- Comment: GBRS scaling factor.
- Default:
NOTE: *
Represents a wild card that is a placeholder for values that will be filled by input file names and/or parameters when the pipeline is run.
Naming Convention | Description |
---|---|
emase_report.html |
Nextflow autogenerated report |
trace.txt |
Nextflow trace of processes |
*/gbrs/*.emase.h5 |
EMASE/GBRS compressed BAM h5 file |
*/emase/*.isoforms.alignment_counts |
Multiway raw alignment counts for all transcript isoforms |
*/emase/*.isoforms.expected_read_counts |
Multiway expected read counts for all transcript isoforms |
*/emase/*.isoforms.tpm |
Multiway transcript per million (tpm) for all transcript isoforms |
*/emase/*.genes.alignment_counts |
Multiway raw alignment counts for all genes |
*/emase/*.genes.expected_read_counts |
Multiway expected read counts for all genes |
*/emase/*.genes.tpm |
Multiway transcript per million (tpm) for all genes |
*/gbrs/*.isoforms.alignment_counts |
Diploid raw alignment counts for all transcript isoforms |
*/gbrs/*.isoforms.expected_read_counts |
Diploid expected read counts for all transcript isoforms |
*/gbrs/*.isoforms.tpm |
Diploid transcript per million (tpm) for all transcript isoforms |
*/gbrs/*.genes.alignment_counts |
Diploid raw alignment counts for all genes |
*/gbrs/*.genes.expected_read_counts |
Diploid expected read counts for all genes |
*/gbrs/*.genes.tpm |
Diploid transcript per million (tpm) for all genes |
*/gbrs/*.interpolated.genoprobs.npz |
Interpolated genotype probabilities in python NPZ format |
*/gbrs/*.interpolated.genoprobs.tsv |
Interpolated genotype probabilities in tab delimited format |
*/gbrs/*.plotted.genome.pdf |
Plotted interpolated genotypes |
*/gbrs/*.genotypes.tsv |
Inferred genotypes for all genes |
*/stats/*.bowtie_R1.log |
Statistics output from read1 mapping from Bowtie |
*/stats/*.bowtie_R2.log |
Statistics output from read2 mapping from Bowtie, if data are paired end |
There are no optional outputs for this workflow.
The required input header is: sampleID,lane,fastq_1,fastq_2
. Samples can be provided either paired or un-paired.
- The
sampleID
column is a unique identifies for each individual sample, which is associated with other samples based on status and patient ID. - The
lane
column contains lane information for individual samples. If a single sample ID is provided with multiple lanes, the sequences from each lane will be concatenated prior to analysis. - The
fastq_1
andfastq_2
columns must contain absolute paths or URLs to read 1 and read 2 from an Illumina paired-end sequencing run.
sampleID,sex,generation,lane,fastq_1,fastq_2
Sample_42,F,42,Lane_1,/path/to/sample_42_001_R1.fastq.gz,/path/to/sample_42_001_R2.fastq.gz
Sample_42,F,42,Lane_2,/path/to/sample_42_002_R1.fastq.gz,/path/to/sample_42_002_R2.fastq.gz
Sample_101,M,10,Lane_1,/path/to/sample_101_001_R1.fastq.gz,/path/to/sample_101_001_R2.fastq.gz
Sample_10191,M,80,Lane_1,/path/to/sample_10191_001_R1.fastq.gz,/path/to/sample_10191_001_R2.fastq.gz
sampleID,sex,generation,lane,fastq_1,fastq_2
Sample_42,F,42,Lane_1,/path/to/sample_42_001_R1.fastq.gz
Sample_42,F,42,Lane_2,/path/to/sample_42_002_R1.fastq.gz
Sample_101,M,10,Lane_1,/path/to/sample_101_001_R1.fastq.gz
Sample_10191,M,80,Lane_1,/path/to/sample_10191_001_R1.fastq.gz