Skip to content

Commit

Permalink
Deployed from Rework Hi-C workflow proposal acb94d7
Browse files Browse the repository at this point in the history
  • Loading branch information
stjudecloud-cloudy committed Apr 4, 2024
1 parent 53ca036 commit 1751035
Show file tree
Hide file tree
Showing 49 changed files with 679 additions and 577 deletions.
32 changes: 16 additions & 16 deletions 0001-rnaseq-workflow-v2.0.0.html
Original file line number Diff line number Diff line change
Expand Up @@ -250,13 +250,13 @@ <h2 id="gene-model-post-processing"><a class="header" href="#gene-model-post-pro
<p>Originally, I had posed this question to the group:</p>
<blockquote>
<ul>
<li>Previously, we were filtering out anything not matching &quot;level 1&quot; or &quot;level 2&quot; from the gene model. This was due to best practices outlined during our RNA-Seq Workflow v1.0 discussions. I propose we revert this for the following reasons:
<li>Previously, we were filtering out anything not matching "level 1" or "level 2" from the gene model. This was due to best practices outlined during our RNA-Seq Workflow v1.0 discussions. I propose we revert this for the following reasons:
<ul>
<li>The first sentence in section 2.2.2 of the <a href="https://github.com/alexdobin/STAR/blob/2.7.1a/doc/STARmanual.pdf">STAR 2.7.1.a manual</a>: &quot;The use of the most comprehensive annotations for a given species is strongly recommended&quot;. So it seems the author recommends you use the most comprehensive gene model.</li>
<li>Here is what <a href="https://www.gencodegenes.org/pages/faq.html">the GENCODE FAQ</a> has to say about the level 3 annotations: &quot;Ensembl loci where they are different from the Havana annotation or where no Havana annotation can be found&quot;. Given that the GENCODE geneset is the union of automated annotations from the <code>Ensembl-genebuild</code> and manual curation of the <code>Ensembl-Havana</code> team, this level should be harmless in the event that levels 1 &amp; 2 don't apply.</li>
<li>The first sentence in section 2.2.2 of the <a href="https://github.com/alexdobin/STAR/blob/2.7.1a/doc/STARmanual.pdf">STAR 2.7.1.a manual</a>: "The use of the most comprehensive annotations for a given species is strongly recommended". So it seems the author recommends you use the most comprehensive gene model.</li>
<li>Here is what <a href="https://www.gencodegenes.org/pages/faq.html">the GENCODE FAQ</a> has to say about the level 3 annotations: "Ensembl loci where they are different from the Havana annotation or where no Havana annotation can be found". Given that the GENCODE geneset is the union of automated annotations from the <code>Ensembl-genebuild</code> and manual curation of the <code>Ensembl-Havana</code> team, this level should be harmless in the event that levels 1 &amp; 2 don't apply.</li>
<li>Last, the various other pipelines in the community don't tend to remove these features:
<ul>
<li>The <a href="https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/#rna-seq-alignment-command-line-parameters">GDC documentation</a> does not currently describe any filtering of their <a href="https://www.gencodegenes.org/human/release_22.html">GENCODE v22</a> GTF (see the section labeled &quot;Step 1: Building the STAR index&quot;).</li>
<li>The <a href="https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/#rna-seq-alignment-command-line-parameters">GDC documentation</a> does not currently describe any filtering of their <a href="https://www.gencodegenes.org/human/release_22.html">GENCODE v22</a> GTF (see the section labeled "Step 1: Building the STAR index").</li>
<li>ENCODE is not filtering any features, they are just changing some of the contig names in the <a href="https://www.gencodegenes.org/human/release_24.html">GENCODE v24</a> GTF (see the analysis <a href="#ENCODE-GTF-generation">in the appendix</a>).</li>
<li>The TOPMed pipeline <a href="https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md#reference-annotation">does outline some postprocessing</a> they are doing to the <a href="https://www.gencodegenes.org/human/release_26.html">GENCODE v26</a> GTF, but it's mostly just collapsing exons. I inspected the script, which does have some functionality built in to blacklist a list of transcript IDs. However, the documentation does not specify that this is turned on by default.</li>
</ul>
Expand Down Expand Up @@ -306,20 +306,20 @@ <h2 id="reference-files"><a class="header" href="#reference-files">Reference fil
<pre><code class="language-bash">wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz -O GRCh38_no_alt.fa.gz
gunzip GRCh38_no_alt.fa.gz

echo &quot;a6da8681616c05eb542f1d91606a7b2f GRCh38_no_alt.fa&quot; &gt; GRCh38_no_alt.fa.md5
echo "a6da8681616c05eb542f1d91606a7b2f GRCh38_no_alt.fa" &gt; GRCh38_no_alt.fa.md5
md5sum -c GRCh38_no_alt.fa.md5
# &gt; GRCh38_no_alt.fa: OK
</code></pre>
</li>
<li>
<p>For the gene model, we use the GENCODE v31 &quot;comprehensive gene annotation&quot; GTF for the &quot;CHR&quot; regions. You can get a copy of the gene annotation file <a href="https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/gencode.v31.annotation.gtf.gz">here</a>. For the exact steps to generate the gene model we use, you can run the following commands:</p>
<p>For the gene model, we use the GENCODE v31 "comprehensive gene annotation" GTF for the "CHR" regions. You can get a copy of the gene annotation file <a href="https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/gencode.v31.annotation.gtf.gz">here</a>. For the exact steps to generate the gene model we use, you can run the following commands:</p>
<pre><code class="language-bash">wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/gencode.v31.annotation.gtf.gz
echo &quot;0aac86e8a6c9e5fa5afe2a286325da81 gencode.v31.annotation.gtf.gz&quot; &gt; gencode.v31.annotation.gtf.gz.md5
echo "0aac86e8a6c9e5fa5afe2a286325da81 gencode.v31.annotation.gtf.gz" &gt; gencode.v31.annotation.gtf.gz.md5
md5sum -c gencode.v31.annotation.gtf.gz.md5
# &gt; gencode.v31.annotation.gtf.gz: OK

gunzip gencode.v31.annotation.gtf.gz
echo &quot;4e22351ae216e72aa57cd6d6011960f8 gencode.v31.annotation.gtf&quot; &gt; gencode.v31.annotation.gtf.md5
echo "4e22351ae216e72aa57cd6d6011960f8 gencode.v31.annotation.gtf" &gt; gencode.v31.annotation.gtf.md5
md5sum -c gencode.v31.annotation.gtf.md5
# &gt; gencode.v31.annotation.gtf: OK
</code></pre>
Expand Down Expand Up @@ -394,7 +394,7 @@ <h2 id="workflow"><a class="header" href="#workflow">Workflow</a></h2>
<p>Run <code>picard SortSam</code> on the <code>STAR</code>-aligned BAM file. Note that this is much more memory efficient than using <code>STAR</code>'s built-in sorting (which often takes 100GB+ of RAM).</p>
<pre><code class="language-bash">picard SortSam I=$STAR_BAM \ # Input BAM.
O=$SORTED_BAM \ # Coordinate-sorted BAM.
SO=&quot;coordinate&quot; \ # Specify the output should be coordinate-sorted
SO="coordinate" \ # Specify the output should be coordinate-sorted
CREATE_INDEX=false \ # Explicitly do not create an index at this step, in case the default changes.
CREATE_MD5_FILE=false \ # Explicity do not create an md5 checksum at this step, in case the default changes.
COMPRESSION_LEVEL=5 \ # Explicitly set the compression level to 5, although, at the time of writing, this is the default.
Expand Down Expand Up @@ -425,8 +425,8 @@ <h2 id="workflow"><a class="header" href="#workflow">Workflow</a></h2>
<pre><code class="language-bash">htseq-count -f bam \ # Specify input file as BAM.
-r pos \ # Specify the BAM is position sorted.
-s $PROVIDED_OR_INFERRED \ # Strandedness as specified by the lab and confirmed by `ngsderive strandedness` above. Typically `reverse` for St. Jude Cloud data.
-m union \ # For reference, GDC uses &quot;intersection-nonempty&quot;.
-i gene_name \ # I'd like to use the colloquial gene name here. For reference, GDC uses &quot;gene_id&quot; here. Needs input from reviewers.
-m union \ # For reference, GDC uses "intersection-nonempty".
-i gene_name \ # I'd like to use the colloquial gene name here. For reference, GDC uses "gene_id" here. Needs input from reviewers.
--secondary-alignments ignore \ # Elect to ignore secondary alignments. Needs input from reviewers.
--supplementary-alignments ignore \ # Elect to ignore supplementary alignments. Needs input from reviewers.
$STAR_SORTED_BAM \ # STAR-aligned, coordinate-sorted BAM.
Expand Down Expand Up @@ -492,7 +492,7 @@ <h3 id="gencode-feature-comparisons"><a class="header" href="#gencode-feature-co
<ul>
<li>
<p>Commands that were used in the comparison of feature types between <code>GENCODE v21</code> and <code>GENCODE v31</code>:</p>
<pre><code class="language-bash">gawk '$0 !~ /^#/ { features[$3] += 1; total += 1 } END {for (key in features) { print &quot;Feature &quot; key &quot;: &quot; features[key] &quot; (&quot; (features[key]/total)*100 &quot;%)&quot; }; print &quot;Total: &quot; total}' gencode.v21.annotation.gtf
<pre><code class="language-bash">gawk '$0 !~ /^#/ { features[$3] += 1; total += 1 } END {for (key in features) { print "Feature " key ": " features[key] " (" (features[key]/total)*100 "%)" }; print "Total: " total}' gencode.v21.annotation.gtf
# Feature exon: 1162114 (45.6341%)
# Feature CDS: 696929 (27.3671%)
# Feature UTR: 275100 (10.8027%)
Expand All @@ -503,7 +503,7 @@ <h3 id="gencode-feature-comparisons"><a class="header" href="#gencode-feature-co
# Feature transcript: 196327 (7.7094%)
# Total: 2546594

gawk '$0 !~ /^#/ { features[$3] += 1; total += 1 } END {for (key in features) { print &quot;Feature &quot; key &quot;: &quot; features[key] &quot; (&quot; (features[key]/total)*100 &quot;%)&quot; }; print &quot;Total: &quot; total}' gencode.v31.annotation.gtf
gawk '$0 !~ /^#/ { features[$3] += 1; total += 1 } END {for (key in features) { print "Feature " key ": " features[key] " (" (features[key]/total)*100 "%)" }; print "Total: " total}' gencode.v31.annotation.gtf
# Feature exon: 1363843 (47.3247%)
# Feature CDS: 755320 (26.2092%)
# Feature UTR: 308315 (10.6984%)
Expand All @@ -517,7 +517,7 @@ <h3 id="gencode-feature-comparisons"><a class="header" href="#gencode-feature-co
</li>
<li>
<p>Commands that were used in the comparison of <code>gene_types</code> for <code>gene</code> features between <code>GENCODE v21</code> and <code>GENCODE v31</code>:</p>
<pre><code class="language-bash">gawk '$3 ~ /gene/' gencode.v21.annotation.gtf | gawk 'match($0, /gene_type &quot;([A-Za-z0-9]+)&quot;/, a) { types[a[1]] += 1; total += 1 } END {for (key in types) { print &quot;Gene type &quot; key &quot;: &quot; types[key] &quot; (&quot; (types[key]/total)*100 &quot;%)&quot; }; print &quot;Total: &quot; total}'
<pre><code class="language-bash">gawk '$3 ~ /gene/' gencode.v21.annotation.gtf | gawk 'match($0, /gene_type "([A-Za-z0-9]+)"/, a) { types[a[1]] += 1; total += 1 } END {for (key in types) { print "Gene type " key ": " types[key] " (" (types[key]/total)*100 "%)" }; print "Total: " total}'
# Gene type rRNA: 549 (2.54508%)
# Gene type antisense: 5542 (25.6919%)
# Gene type pseudogene: 29 (0.13444%)
Expand All @@ -528,7 +528,7 @@ <h3 id="gencode-feature-comparisons"><a class="header" href="#gencode-feature-co
# Gene type snRNA: 1912 (8.86375%)
# Total: 21571

gawk '$3 ~ /gene/' gencode.v31.annotation.gtf | gawk 'match($0, /gene_type &quot;([A-Za-z0-9]+)&quot;/, a) { types[a[1]] += 1; total += 1 } END {for (key in types) { print &quot;Gene type &quot; key &quot;: &quot; types[key] &quot; (&quot; (types[key]/total)*100 &quot;%)&quot; }; print &quot;Total: &quot; total}'
gawk '$3 ~ /gene/' gencode.v31.annotation.gtf | gawk 'match($0, /gene_type "([A-Za-z0-9]+)"/, a) { types[a[1]] += 1; total += 1 } END {for (key in types) { print "Gene type " key ": " types[key] " (" (types[key]/total)*100 "%)" }; print "Total: " total}'
# Gene type ribozyme: 8 (0.0351463%)
# Gene type scaRNA: 49 (0.215271%)
# Gene type lncRNA: 16840 (73.983%)
Expand All @@ -546,7 +546,7 @@ <h3 id="gencode-feature-comparisons"><a class="header" href="#gencode-feature-co
</li>
<li>
<p>The following is a quick command that can be used on a GENCODE GTF to summarize the level counts/percentages in the GTF file. This was used to quantify how much of the GTF would be eliminated when all level 3 features were removed.</p>
<pre><code class="language-bash">gawk 'match($0, /level ([0-9]+)/, a) { levels[a[1]] += 1; total += 1 } END {for (key in levels) { print &quot;Level &quot; key &quot;: &quot; levels[key] &quot; (&quot; (levels[key]/total)*100 &quot;%)&quot; }; print &quot;Total : &quot; total}' gencode.v31.annotation.gtf
<pre><code class="language-bash">gawk 'match($0, /level ([0-9]+)/, a) { levels[a[1]] += 1; total += 1 } END {for (key in levels) { print "Level " key ": " levels[key] " (" (levels[key]/total)*100 "%)" }; print "Total : " total}' gencode.v31.annotation.gtf
# Level 1: 186461 (6.4701%)
# Level 2: 2450972 (85.0475%)
# Level 3: 244453 (8.4824%)
Expand Down
4 changes: 2 additions & 2 deletions 0002-scRNA-Seq-workflow.html
Original file line number Diff line number Diff line change
Expand Up @@ -230,7 +230,7 @@ <h2 id="dependencies"><a class="header" href="#dependencies">Dependencies</a></h
<pre><code class="language-bash">cargo install --git https://github.com/stjude/fqlib.git --tag v0.8.0
</code></pre>
<p>Additionally, you will need to install <code>Cell Ranger</code> from 10x genomics to perform the alignment and feature calling.</p>
<pre><code class="language-bash">curl -o cellranger-7.0.0.tar.gz &quot;https://cf.10xgenomics.com/releases/cell-exp/cellranger-7.0.0.tar.gz?Expires=1659064381&amp;Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cHM6Ly9jZi4xMHhnZW5vbWljcy5jb20vcmVsZWFzZXMvY2VsbC1leHAvY2VsbHJhbmdlci03LjAuMC50YXIuZ3oiLCJDb25kaXRpb24iOnsiRGF0ZUxlc3NUaGFuIjp7IkFXUzpFcG9jaFRpbWUiOjE2NTkwNjQzODF9fX1dfQ__&amp;Signature=SGKFTznrgspKBPQ2oRZ5C9KazslNWsTkxtkGnnaaPtrOR2iihhvkJ5frOlT5ahkABzBlHf~8EjLqcJ6HofbiQ4McbrHsmAnRggfv2QK0WScF40qDE3kGm~Z57VumO5homkdJPEf9r5DlbMzlArE5-UBH91F0rMMribGjwABXJCjUsAuet0klbUg~~SzCxoNMfTvo3Nn7Mxv7ls51LQf72riNit9zne6WsHynIY7YBHaquVftTi-C6bMZw5A3NoekyA7LBTZwc6sFjrsFrf3s3BpYKeRkBtPdkDMljME9JLDo9wXM7Lf1SfTj1vtNVUDoNhMnTFplDNzyBaDEDm7GVg__&amp;Key-Pair-Id=APKAI7S6A5RYOXBWRPDA&quot;
<pre><code class="language-bash">curl -o cellranger-7.0.0.tar.gz "https://cf.10xgenomics.com/releases/cell-exp/cellranger-7.0.0.tar.gz?Expires=1659064381&amp;Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cHM6Ly9jZi4xMHhnZW5vbWljcy5jb20vcmVsZWFzZXMvY2VsbC1leHAvY2VsbHJhbmdlci03LjAuMC50YXIuZ3oiLCJDb25kaXRpb24iOnsiRGF0ZUxlc3NUaGFuIjp7IkFXUzpFcG9jaFRpbWUiOjE2NTkwNjQzODF9fX1dfQ__&amp;Signature=SGKFTznrgspKBPQ2oRZ5C9KazslNWsTkxtkGnnaaPtrOR2iihhvkJ5frOlT5ahkABzBlHf~8EjLqcJ6HofbiQ4McbrHsmAnRggfv2QK0WScF40qDE3kGm~Z57VumO5homkdJPEf9r5DlbMzlArE5-UBH91F0rMMribGjwABXJCjUsAuet0klbUg~~SzCxoNMfTvo3Nn7Mxv7ls51LQf72riNit9zne6WsHynIY7YBHaquVftTi-C6bMZw5A3NoekyA7LBTZwc6sFjrsFrf3s3BpYKeRkBtPdkDMljME9JLDo9wXM7Lf1SfTj1vtNVUDoNhMnTFplDNzyBaDEDm7GVg__&amp;Key-Pair-Id=APKAI7S6A5RYOXBWRPDA"

echo '30855cb96a097c9cab6b02bdb520423f cellranger-7.0.0.tar.gz' &gt; cellranger-7.0.0.tar.gz.md5

Expand All @@ -247,7 +247,7 @@ <h2 id="reference-files"><a class="header" href="#reference-files">Reference fil
<p>We will use the 10x provided reference files version 2020-A. You can get the file by running the following commands:</p>
<pre><code class="language-bash">wget https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz -O 10x-GRCh38-2020-A.tar.gz

echo &quot;dfd654de39bff23917471e7fcc7a00cd 10x-GRCh38-2020-A.tar.gz&quot; &gt; 10x-GRCh38-2020-A.tar.gz.md5
echo "dfd654de39bff23917471e7fcc7a00cd 10x-GRCh38-2020-A.tar.gz" &gt; 10x-GRCh38-2020-A.tar.gz.md5
md5sum -c 10x-GRCh38-2020-A.tar.gz.md5
# &gt; 10x-GRCh38-2020-A.tar.gz: OK

Expand Down
Loading

0 comments on commit 1751035

Please sign in to comment.