-
Notifications
You must be signed in to change notification settings - Fork 10
GBRS Pipeline ReadMe
Mike Lloyd edited this page Aug 11, 2023
·
11 revisions
NOTE: EMASE
and GBRS
are part of the same software package, and share functions between them. This workflow outputs from EMASE
and GBRS
, but they are derived from the same software package.
• Step 1: Map reads with BOWTIE, and convert SAM to BAM. Note, R1 and R2 are mapped separately for paired-end data.
• Step 2: Convert BAM to EMASE. Note: R1 and R2 are converted separately for paired-end data.
• Step 3: Paired-end data: Find common alignments and compress EMASE file.
• Step 3: Single-end data: Compress EMASE file.
• Step 4: Quantify multiway expression
• Step 5: Genotype reconstruction
• Step 6: Quantify diploid expression with GBRS
• Step 7: Interpolate genotypes and genotype probabilities
• Step 8: Plot inferred genotypes
• Step 9: Export genotype probabilities
-
--pubdir
- Default:
/<PATH>
- Comment: The directory that the saved outputs will be stored.
- Default:
-
--sample_folder
- Default:
/<PATH>
- Comment: The path to the folder that contains all the samples to be run by the pipeline. The files in this path can also be symbolic links.
- Default:
-
--extension
- Default:
.fastq.gz
- Comment: The expected extension for the input read files.
- Default:
-
--pattern
- Default:
"*_R{1,2}*"
- Comment: The expected R1 / R2 matching pattern. The default value will match reads with names like this
READ_NAME_R1_MoreText.fastq.gz
orREAD_NAME_R1.fastq.gz
- Default:
-
--read_type
- Default:
PE
- Comment: Options:
PE
andSE
. Type of reads: paired end (PE) or single end (SE).
- Default:
-
--concat_lanes
- Default:
false
- Comment: Options:
false
andtrue
. If this boolean is specified, FASTQ files will be concatenated by sample. Used in cases where samples are divided across individual sequencing lanes.
- Default:
-
--concat_sampleID_delim
- Default:
"_"
- Comment: The delimited to split FASTQ file names.
- Default:
-
--concat_sampleID_positions
- Default:
"1"
- Comment: The number of elements to keep after splitting on the chosen delimiter in the sample name.
- Default:
-
--bowtie_index
- Default:
/projects/compsci/omics_share/mouse/GRCm39/transcriptome/indices/imputed/rel_2112_v8/bowtie/bowtie.transcripts
- Comment: Path to the bowtie index. Include the bowtie prefix in this path (e.g.,
/path/to/bowtie.transcripts
where bowtie.transcripts.* are the full set of index files in the directory.
- Default:
-
--transcripts_info
- Default:
/projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/emase.fullTranscripts.info
- Comment: A file containing all transcript IDs. NOTE: These IDs must not contain haplotype IDs. This file must also have a 'length' column. Note that 'length' is not used in this context. ONLY IDs are used from this file. Can be obtained from
prepare_emase
workflow (emase.fullTranscripts.info)
- Default:
-
--gbrs_strain_list
- Default:
"A,B,C,D,E,F,G,H"
- Comment: A list of haplotype names corresponding to genomes used in hybrid genome construction (e.g., 'A,B,C,D,E,F,G,H').
- Default:
-
--gene2transcript_csv
- Default:
/projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/emase.gene2transcripts.tsv
- Comment: A file containing all gene to transcript ID translations. NOTE: These IDs must not contain haplotype IDs. Can be obtained from
prepare_emase
workflow (emase.gene2transcripts.tsv)
- Default:
-
--full_transcript_info
- Default:
/projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/emase.pooled.fullTranscripts.info
- Comment: A file containing all transcript IDs with transcript lengths. NOTE: These IDs must contain haplotype IDs. Can be obtained from
prepare_emase
workflow (emase.pooled.fullTranscripts.info)
- Default:
-
--emase_model
- Default:
'4'
- Comment: Options:(1, 2, 3, 4). 1: reads are apportioned among genes first, then between alleles, and then among isoforms. 2: reads are apportioned among genes first, then among isoforms, and then between alleles. 3: reads are apportioned among genes first, then among each isoform-allele combination.
- Default:
-
--emission_prob_avecs
- Default:
/projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/gbrs_emissions_all_tissues.avecs.npz
- Comment: The emission probability vector file.
- Default:
-
--trans_prob_dir
- Default:
/projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/transition_probabilities
- Comment: A directory containing all transition probability files.
- Default:
-
--gene_position_file
- Default:
/projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/ref.gene_pos.ordered_ensBuild_105.npz
- Comment: A python compressed NPZ file containing gene position in primary reference coordinates.
- Default:
-
--genotype_grid
- Default:
/projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/ref.genome_grid.GRCm39.tsv
- Comment: Simulated marker grid used in genotype inference.
- Default:
-
--founder_hex_colors
- Default:
/projects/compsci/omics_share/mouse/GRCm39/supporting_files/emase_gbrs/rel_2112_v8/founder.hexcolor.info
- Comment: A file containing hexcode colors for plotting genotypes.
- Default:
-
--gbrs_expression_threshold
- Default:
1.5
- Comment: GBRS expression threshold limit required to quantify a gene.
- Default:
-
--gbrs_sigma
- Default:
0.12
- Comment: GBRS scaling factor.
- Default:
NOTE: *
Represents a wild card that is a placeholder for values that will be filled by input file names and/or parameters when the pipeline is run.
Naming Convention | Description |
---|---|
emase_report.html |
Nextflow autogenerated report |
trace.txt |
Nextflow trace of processes |
*/gbrs/*.emase.h5 |
EMASE/GBRS compressed BAM h5 file |
*/emase/*.isoforms.alignment_counts |
Multiway raw alignment counts for all transcript isoforms |
*/emase/*.isoforms.expected_read_counts |
Multiway expected read counts for all transcript isoforms |
*/emase/*.isoforms.tpm |
Multiway transcript per million (tpm) for all transcript isoforms |
*/emase/*.genes.alignment_counts |
Multiway raw alignment counts for all genes |
*/emase/*.genes.expected_read_counts |
Multiway expected read counts for all genes |
*/emase/*.genes.tpm |
Multiway transcript per million (tpm) for all genes |
*/gbrs/*.isoforms.alignment_counts |
Diploid raw alignment counts for all transcript isoforms |
*/gbrs/*.isoforms.expected_read_counts |
Diploid expected read counts for all transcript isoforms |
*/gbrs/*.isoforms.tpm |
Diploid transcript per million (tpm) for all transcript isoforms |
*/gbrs/*.genes.alignment_counts |
Diploid raw alignment counts for all genes |
*/gbrs/*.genes.expected_read_counts |
Diploid expected read counts for all genes |
*/gbrs/*.genes.tpm |
Diploid transcript per million (tpm) for all genes |
*/gbrs/*.interpolated.genoprobs.npz |
Interpolated genotype probabilities in python NPZ format |
*/gbrs/*.interpolated.genoprobs.tsv |
Interpolated genotype probabilities in tab delimited format |
*/gbrs/*.plotted.genome.pdf |
Plotted interpolated genotypes |
*/gbrs/*.genotypes.tsv |
Inferred genotypes for all genes |
*/stats/*.bowtie_R1.log |
Statistics output from read1 mapping from Bowtie |
*/stats/*.bowtie_R2.log |
Statistics output from read2 mapping from Bowtie, if data are paired end |
There are no optional outputs for this workflow.