Document advanced features

deeptools · Jul 13, 2016 · 5fa189d · 5fa189d
1 parent dce8341
commit 5fa189d
Show file tree

Hide file tree

Showing 17 changed files with 83 additions and 16 deletions.
diff --git a/deeptools/heatmapper.py b/deeptools/heatmapper.py
@@ -70,6 +70,7 @@ def chopRegionsFromMiddle(exonsInput, left=0, right=0):
     the center point of the exons.
 
     The steps are as follow:
+
      1) Find the center point of the set of exons (e.g., [(0, 200), (300, 400), (800, 900)] would be centered at 200)
        * If a given exon spans the center point then the exon is split
      2) The given number of bases at the end of the left-of-center list are extracted

diff --git a/docs/content/advanced_features.rst b/docs/content/advanced_features.rst
@@ -0,0 +1,10 @@
+Advanced features
+=================
+
+Some of the features of deepTools are not self-explanatory. Below, we provide links to longer expositions on these more advanced features:
+
+ * :doc:`feature/blacklist`
+ * :doc:`feature/metagene`
+ * :doc:`feature/read_extension`
+ * :doc:`feature/unscaled_regions`
+ * :doc:`feature/read_offsets`
diff --git a/docs/content/feature/blacklist.rst b/docs/content/feature/blacklist.rst
@@ -0,0 +1,12 @@
+Blacklist Regions
+=================
+
+There are many sources of bias in ChIPseq experiments. Among the most prevalent of these is signal arising from "blacklist" regions (see `Carroll et al. <http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3989762/>`__ and the references therein for historical context). Blacklisted regions show notably enriched signal across many ChIP experiment types (e.g., regardless of what is being IPed or the experimental conditions). Including these regions can lead not only to false-positive peaks, but can also throw off between-sample normalization. An example of this is found below:
+
+.. image:: ../../images/feature-blacklist0.png
+
+The region on chromosome 9 starting around position 3 million marks the start of an annotated satellite repeat. As this region contains vastly more reads than expected, slight differences in enrichment here between samples can cause errors in between-sample scaling, thereby masking signal in non-repetitive regions. This can be seen in the IGV screenshot below, where the blacklisted region is just off the side of the screen.
+
+.. image:: ../../images/feature-blacklist1.png
+
+Note that the signal outside of the blacklisted region is slightly depressed due to the blacklisted region. Using the `--blackListFileName` option available throughout deepTools. The subtraction of these regions is accounted for in all normalizations.
diff --git a/docs/content/feature/metagene.rst b/docs/content/feature/metagene.rst
@@ -0,0 +1,12 @@
+Metagene analyses
+=================
+
+By default, `computeMatrix` uses the signal over entire contiguous regions (e.g., transcripts) for computing its output. While this is typically quite useful, in case such as RNAseq the results are less than ideal. Take, for example, the gene model and coverage profile below:
+
+.. image:: ../../images/feature-metagene0.png
+
+If clustering were done using such blocky coverage then the results would be biased by the number of exons and their positions. Instead, it's normally desired to ignore intronic regions and instead use only the signal in exons (denoted by blocks in the gene model). This can be accomlished by using the `--metagene` option in `computeMatrix` and supplying a BED12 or GTF file as a set of regions:
+
+.. image:: ../../images/feature-metagene1.png
+
+Note that for GTF files the regions used to define exons can be easily modified. For example, for RiboSeq samples it's preferable to use annotated coding regions, so specifying `--exonID CDS`. Likewise, entire genes can be used rather than transcripts by specifying `--transcriptID gene --transcript_id_designator gene_id`.
diff --git a/docs/content/feature/read_extension.rst b/docs/content/feature/read_extension.rst
@@ -0,0 +1,30 @@
+Read extension
+==============
+
+In the majority of NGS experiment, DNA (or RNA) is fragmented into small stretches and only the ends of these fragments sequenced. For many applications, it's desirable to quantify coverage of the entire original fragments over the genome. Consequently, there is an `--extendReads` option present throughout deepTools. This works as follows:
+
+Paired-end reads
+----------------
+
+ 1. Regions of the genome are sampled to determine the median fragment/read length.
+ 2. The genome is subdivided into disjoint regions. Each of these regions comprises one or more bins of some desired size (specified by `-bs`).
+ 3. For each region, all alignments overlapping it are gathered. In addition, all alignments within 2000 bases are gathered, as 2000 bases is the maximum allowed fragment size.
+ 4. The resulting collection of alignments are all extended according to their fragment length, which for paired-end reads is indicated in BAM files.
+
+   - For singletons, the expected fragment length from step 1 is used.
+
+ 5. For each of the extended reads, the count in each bin that it overlaps is incremented.
+
+Single-end reads
+----------------
+
+ 1. An extension length, L, is specified.
+ 2. The genome is subdivided into disjoint regions. Each of these regions comprises one or more bins of some desired size (specified by `-bs`).
+ 3. For each region, all alignments overlapping it are gathered. In addition, all alignments within 2000 bases are gathered, as 2000 bases is the maximum allowed fragment size.
+ 4. The resulting collection of alignments are all extended to length L.
+ 5. For each of the extended reads, the count in each bin that it overlaps is incremented.
+
+Blacklisted regions
+-------------------
+
+The question likely arises as to how alignments originating inside of blacklisted regions are handled. In short, any alignment contained completely within a blacklisted region is ignored, regardless of whether it would extend into a non-blacklisted region or not. Alignments only partially overlapping blacklisted regions are treated as normal, as are pairs of reads that span over a blacklisted region. This is primarily for the sake of performance, as otherwise each extended read would need to be checked to see if it overlaps a blacklisted region.
diff --git a/docs/content/feature/read_offsets.rst b/docs/content/feature/read_offsets.rst
@@ -0,0 +1,8 @@
+Offsetting signal to a given position
+=====================================
+
+A growing number of experiment types need to be analyzed by focusing the signal from each alignment at a single point. As an example, RiboSeq alignments tend to be offset such that the signal pause is centered around the translation start site, an offset of around 12. Alternatively, in GROseq experiments, the pause around the TSS becomes centered by using the 1st base of each read. This can be accomplished within `bamCoverage` using the `--Offset` option. A visual example is below:
+
+.. image:: ../../images/feature-offset0.png
+
+The alignments shown above overlap a transcript, denoted as a blue box, which in this case represents only the coding sequence. If the alignments are from a RiboSeq experiment then the signal from each alignment should be set at the ~12th base of each alignment. The section on the right denotes the resulting signal intensity, with the expected large peak at the translation start site.
diff --git a/docs/content/feature/unscaled_regions.rst b/docs/content/feature/unscaled_regions.rst
@@ -0,0 +1,7 @@
+Unscaled regions
+================
+
+Some experiments aim to quantify the distribution of pausing of factors, such as PolII, throughout gene or transcript bodies. PolII and many other factors, show pausing (i.e., accumulation of signal) near the start/end of transcripts. As scaling is normally performed to make all regions the same length, the breadth of the paused region could be scaled differently in each transcript. This would, in turn, cause biases during clustering or other analyses. In such cases, the `--unscaled5prime` and `--unscaled3prime` options in `computeMatrix` can be used. These will prevent regions at one or both end of transcripts (or other regions) to not be excluded from scaling, thereby allowing raw signal profiles to be compared across transcripts. An example of this from `Ferrari et al. 2013 <http://www.sciencedirect.com/science/article/pii/S2211124713005603>`__ is shown below:
+
+.. image:: ../../images/feature-unscaled0.png
+
diff --git a/docs/content/list_of_tools.rst b/docs/content/list_of_tools.rst
@@ -100,11 +100,7 @@ We offer several ways to filter those BAM files on the fly so that you don't nee
 
 These parameters are optional and available throughout deepTools.
 
-.. note::  In version 2.3 we introduced a sampling method to correct the effect of filtering when normalizing using
-``bamCoverage`` or ``bamCompare``. For previous versions, if you know that your files will be strongly affected by
- the filtering  of duplicates or reads of low quality then consider removing
- those reads *before* using ``bamCoverage`` or ``bamCompare``, as the filtering
- by deepTools is done *after* the scaling factors are calculated!
+.. note::  In version 2.3 we introduced a sampling method to correct the effect of filtering when normalizing using ``bamCoverage`` or ``bamCompare``. For previous versions, if you know that your files will be strongly affected by  the filtering  of duplicates or reads of low quality then consider removing  those reads *before* using ``bamCoverage`` or ``bamCompare``, as the filtering  by deepTools is done *after* the scaling factors are calculated!
 
 
 Tools for BAM and bigWig file processing

diff --git a/docs/content/tools/plotHeatmap.rst b/docs/content/tools/plotHeatmap.rst
@@ -118,5 +118,4 @@ we combine different colormap colors, different scales and the new  `--boxAround
 
 .. image:: ../../images/test_plots/ExampleHeatmap4.png
 
-.. tip:: **More examples** can be found in our
-`Gallery <http://deeptools.readthedocs.org/en/latest/content/example_gallery.html#normalized-chip-seq-signals-and-peak-regions>`_.
+.. tip:: **More examples** can be found in our `Gallery <http://deeptools.readthedocs.org/en/latest/content/example_gallery.html#normalized-chip-seq-signals-and-peak-regions>`_.
diff --git a/docs/images/feature-blacklist0.png b/docs/images/feature-blacklist0.png
diff --git a/docs/images/feature-blacklist1.png b/docs/images/feature-blacklist1.png
diff --git a/docs/images/feature-metagene0.png b/docs/images/feature-metagene0.png
diff --git a/docs/images/feature-metagene1.png b/docs/images/feature-metagene1.png
diff --git a/docs/images/feature-offset0.png b/docs/images/feature-offset0.png
diff --git a/docs/images/feature-unscaled0.png b/docs/images/feature-unscaled0.png
diff --git a/docs/index.rst b/docs/index.rst
@@ -29,6 +29,7 @@ Contents:
 
    content/installation
    content/list_of_tools
+   content/advanced_features
    content/example_usage
    content/changelog
    content/help_galaxy_intro

diff --git a/docs/source/deeptools.rst b/docs/source/deeptools.rst
@@ -100,15 +100,6 @@ deeptools.mapReduce module
     :undoc-members:
     :show-inheritance:
 
-
-deeptools.readBed module
-------------------------
-
-.. automodule:: deeptools.readBed
-    :members:
-    :undoc-members:
-    :show-inheritance:
-
 deeptools.utilities module
 --------------------------