Merge pull request #151 from timothymillar/f/docs

Version 0.8.0
PlantandFoodResearch · Jan 10, 2023 · 5fb4fab · 5fb4fab
2 parents 55cbba1 + d7908d9
commit 5fb4fab
Show file tree

Hide file tree

Showing 26 changed files with 829 additions and 198 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -12,7 +12,7 @@ repos:
     hooks:
       - id: black
         language_version: python3
-  - repo: https://gitlab.com/pycqa/flake8
+  - repo: https://github.com/PyCQA/flake8
     rev: 3.8.3
     hooks:
       - id: flake8

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,11 +2,15 @@
 
 ## Unreleased 
 
+
+## Beta v0.8.0
+
 New Features:
 - Combine `--bam`, `--bam-list` and `--sample-bam` arguments #128
 - Combine `--ploidy` and `--sample-ploidy` arguments #128
 - Combine `--inbreeding` and `--sample-inbreeding` arguments #128
 - Combine `--mcmc-temperatures` and `--sample-mcmc-temperatures` arguments #128
+- Improvements to documentation
 
 
 

diff --git a/README.rst b/README.rst
@@ -1,7 +1,9 @@
 MCHap
 =====
 
-Polyploid micro-haplotype assembly using Markov chain Monte Carlo simulation.
+MCHap is suite of command line tools for micro-haplotype assembly and genotype
+calling in autopolyploids.
+The primary components of MCHap are ``mchap assemble`` and ``mchap call``.
 
 Installation
 ------------
@@ -15,192 +17,50 @@ From the root directory of this repository run:
     python setup.py sdist
     pip install dist/mchap-*.tar.gz
 
+You should then be able to use the command line tool ``mchap`` which is a wrapper
+around ``mchap assemble`` and ``mchap call``.
 
-Usage
------
-
-Basic assemble example
-~~~~~~~~~~~~~~~~~~~~~~
-
-The wrapper CLI tool is ``mchap`` and assembly sub-tool is ``assemble``.
-At minimum this tool requires the following inputs:
-
-- One or more bam files of reads aligned to a reference genome
-- A bed file of target genomic loci for assembly
-- An indexed VCF file containing a 'reference set' of SNPs
-- An indexed fasta file containing the reference genome
-
-The ``assemble`` and ``call`` sub-tools write out an uncompressed VCF file
-to standard output.
-This should generally be compressed and written to a file:
-
-.. code:: bash
-
-    mchap assemble ... | bgzip > haplotypes.vcf.gz
-
-
-A simple example of running ``mchap assemble`` may look like this:
+MCHap includes a suite of unit tests which can be run from the root directory of
+this repository with:
 
 .. code:: bash
 
-    $ mchap assemble \
-        --bam file1.bam file2.bam file3.bam \
-        --targets loci.bed \
-        --variants snps.vcf.gz \
-        --reference reference.fasta \
-        --ploidy 4 \
-        --cores 8 | bgzip > haplotypes.vcf.gz
+    pytest -v ./
 
 
-Note that the input VCF should be indexed with tabix and the bam files
-and reference genome should be indexed with samtools.
+MCHap assemble
+--------------
 
-The `full list of arguments`_ for ``assemble`` can be accessed with:
+``mchap assemble`` is used for de novo assembly of micro-haplotypes in one or 
+more individuals.
+Haplotypes are assembled from aligned reads in BAM files using known SNVs 
+(single nucleotide variants) from a VCF file.
+A BED file is also required to specify the assembled loci.
+The output of ``mchap assemble`` is a VCF file with assembled micro-haplotype variants
+and genotype calls (some genotype calls may be incomplete).
+See the `MCHap assemble documentation`_ for further information..
 
-.. code:: bash
-
-    mchap assemble -h
+MCHap call
+----------
 
+``mchap call`` is used for (re-) calling genotypes using a set of known 
+micro-haplotypes.
+Genotypes are called using aligned reads in BAM files and known micro-haplotype alleles
+from a VCF file.
+The output of ``mchap call`` is a VCF file with micro-haplotype variants
+and genotype calls (all genotype calls will be complete).
+It is often beneficial to re-call genotypes with ``mchap call`` using the micro-haplotypes
+reported by ``mchap assemble``, particularly in populations of related samples.
+See the `MCHap call documentation`_ for further information.
 
-Job array example
-~~~~~~~~~~~~~~~~~
+Funding
+-------
 
-The ``assemble`` sub-tool can make use of multiple cores using ``--cores``
-argument which uses Pythons build in multiprocessing library.
-However this method of parallel processing can be difficult and inefficient
-fow large scale assembles in a shared HPC environment.
-In such a case it may be more efficient to submit many individual assemblies
-each producing a separate output VCF file and then merging the resulting
-VCFs.
-The following is an example of using the `asub`_ script for array submission
-using the ``--region`` and ``--region-id`` parameters to specify each
-individual locus as a sub-job:
+The development of MCHap was partially funded by the "Tools for Polyploids" Specialty Crop Research Initiative
+(NIFA USDA SCRI Award # 2020-51181-32156).
 
-.. code:: bash
+.. image:: docs/img/tools-for-polyploids.png
+   :width: 300
 
-    JOBNAME='myjob'
-    VCFDIR="./$JOBNAME.vcf"
-    mkdir $VCFDIR
-    while read line; do
-    contig=$(echo "$line" | cut -f 1)
-    start=$(echo "$line" | cut -f 2)
-    stop=$(echo "$line" | cut -f 3)
-    name=$(echo "$line" | cut -f 4)
-    region="$contig:$start-$stop"
-    cat << EOF
-    mchap assemble \
-        --region "$region" \
-        --region-id "$name" \
-        --variants "$VARIANTS" \
-        --reference "$REFERENCE" \
-        --sample-bam "$SAMPLE_BAMS" \
-        | bgzip > $VCFDIR/$name.vcf.gz
-    EOF
-    done <"$BEDFILE" | asub -c 100 -j "$JOBNAME"
-
-
-The above example will iterate though a 4 column bed file (``$BEDFILE``) of
-target loci and create a job array with one sub-job per entry in the bed file.
-Each sub-job will output a VCF file with a single record in a directory called
-"myjob.vcf" (from the ``$VCFDIR`` variable).
-Each VCF file will be named based on the locus name in column 4 of the bed file
-(hence these names must be unique).
-In this example we use the  ``--sample-bam`` option to specify the bam file for
-each sample explicitly as this is more efficient when running many small
-(sub-) jobs.
-Note that the ``--region-id`` argument is only used to set the id of each
-variant record in the VCF output and may be omitted in which case the the
-variant records will be un-named.
-
-
-Performance
------------
-
-The performance of ``mchap assemble`` will depend largely on your data set
-but can be tuned using the available parameters.
-Generally speaking assembles will be slower for higher ploidy organisms,
-with higher read-depths, or with more SNPs falling within each locus in the
-BED file.
-
-Jit compilation
-~~~~~~~~~~~~~~~
-
-MCHap heavily utilizes the numba JIT compiler to speed up MCMC simulations.
-However, the first time you run MCHap on a new system it will have to
-compile the functions that make use of the numba JIT compiler and the 
-compiled functions are then cached for reuse.
-This means that MCHap may run a bit slower the first time it's run after
-installation.
-
-Running on multiple threads
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The simplest way to speed up ``mchap assemble`` is to allow it to run on more
-that one CPU thread using the ``--cores`` argument.
-The maximum number of threads that can be utilized is the number of target
-loci specified in the input BED file.
-Note that if an assembly is run with more than ``1`` thread then the records
-in the output VCF may be in a different order than the loci were in the
-input BED file so it may be necessary to sort the VCF file.
-
-Tuning MCMC parameters
-~~~~~~~~~~~~~~~~~~~~~~
-
-The ``mchap assemble`` program uses Markov chain Monte-Carlo (MCMC)
-simulations to assemble haplotypes at each locus of each sample.
-Reducing the number of steps or complexity of steps will speed up the
-assembly bu may lower the reliability of the results.
-The number of steps is configured with ``--mcmc-steps`` and the number
-that will be removed as burn-in with ``--mcmc-burn``.
-It is recommended to remove at least ``100`` steps as burn-in and that
-at least ``1000`` steps should be kept to calculate posterior probabilities.
-
-The complexity of steps can also be configured by adjusting the proportion
-of structural sub-steps using the ``--mcmc-recombination-step-probability``
-and ``--mcmc-partial-dosage-step-probability`` arguments.
-These arguments represent the probability that a structural sub-step of
-that type will be performed as part of a step in the MCMC simulation.
-These sub-steps can be important for convergence so it is not recommended
-to reduce their probability much lower than ``0.25``.
-
-There is also an additional parameter called ``--mcmc-dosage-step-probability``
-which is used to configure the probability of a "full" dosage-swap sub-step.
-This sub-step type is particularly important for identifying the correct
-dosage of a genotype and is computationally very simple so its probability
-should not be less than ``0.5`` and it is generally recommended to leave it
-at its default value of ``1.0``
-
-Fixing SNPs that are likely to be homozygous
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-As mentioned above, the number of SNPs present in a locus has a significant
-impact on the assembly speed.
-The ``--mcmc-fix-homozygous`` argument can be used to identify SNPs that
-have a high probability of being homozygous and 'fixing' them so that they
-do not vary during the assemble process.
-This is applied on a per sample bases and will 'fix' SNPs in one sample
-even if they vary in others.
-The default value for this argument is ``0.999`` and so it will only 'fix'
-SNPs that are extremely unlikely to be heterozygous.
-Reducing this value to ``0.99`` could speed up the assembly process but
-lowering it too much may result in incorrectly called haplotypes especially
-in higher ploidy organisms.
-
-Parallel-tempering
-~~~~~~~~~~~~~~~~~~
-
-The ``mchap assemble`` program can use parallel-tempering to reduce the
-risk of multi-modality and thereby reduce the chance of incorrectly
-assembled haplotypes.
-However, parallel-tempering is computationally intensive as an additional
-MCMC simulation is run for each additional temperature.
-To balance this trade-off it's possible to specify parallel-temperature
-on a per-sample basis using the ``--sample-mcmc-temperatures`` parameter.
-For example, when assembling haplotypes for samples of a pedigree it may
-be desirable to specify multiple temperatures for founding individuals
-to ensure that the founding alleles are identified without using
-parallel-tempering for all of the progeny derived from those founders.
-
-
-.. _`full list of arguments`: cli-assemble-help.txt
-.. _`asub`: https://github.com/lh3/asub
+.. _`MCHap assemble documentation`: docs/assemble.rst
+.. _`MCHap call documentation`: docs/call.rst