Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactored, create pip package, INCOMPLETE #1

Open
wants to merge 3 commits into
base: sk_v17
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
conda/
.*
database/
binny_*
binny
test*
!.gitignore
*.egg-info
**/__pycache__
8 changes: 8 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
include bin/*
include recursive-include binny/*
include config/*
include envs/*
include test/*
include LICENSE
include README.md
include Snakefile
151 changes: 49 additions & 102 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,123 +1,70 @@
# binny
[![DOI](https://zenodo.org/badge/327396590.svg)](https://zenodo.org/badge/latestdoi/327396590)

## Installing binny
For binny to work, you need [conda](https://www.anaconda.com/).

1) Clone this repository to your disk:
```
git clone https://github.com/a-h-b/binny.git
```
Change into the binny directory:
```
cd binny
```
At this point, you have all the scripts you need to run the workflow using snakemake, and you'd just need to get some data and a database (see point ). If you want to use the **comfortable binny wrapper**, follow the points 2-6.

2) Adjust the file VARIABLE_CONFIG to your requirements (have a tab between the variable name and your setting):
* SNAKEMAKE_VIA_CONDA - set this to true, if you don't have snakemake in your path and want to install it via conda. Leave empty, if you don't need an additional snakemake.
* LOADING_MODULES - insert a bash command to load modules, if you need them to run conda. Leave empty, if you don't need to load a module.
* SUBMIT_COMMAND - insert the bash command you'll usually use to submit a job to your cluster to run on a single cpu for a few days. You only need this, if you want to have the snakemake top instance running in a submitted job. You alternatively have the option to run it on the frontend via tmux. Leave empty, if you want to use this version and have [tmux](https://github.com/tmux/tmux/wiki) installed.
* SCHEDULER - insert the name of the scheduler you want to use (currently `slurm` or `sge`). This determines the cluster config given to snakemake, e.g. the cluster config file for slurm is config/slurm.config.yaml . Also check that the settings in this file is correct. If you have a different system, contact us ( https://github.com/a-h-b/binny/issues ).
* MAX_THREADS - set this to the maximum number of cores you want to be using in a run. If you don't set this, the default will be 50 (which is more than will be used). Users can override this setting at runtime.
* NORMAL_MEM_EACH - set the size of the RAM of one core of your normal copute nodes (e.g. 8G). If you're not planning to use binny to submit to a cluster, you don't need to set this.
* BIGMEM_MEM_EACH - set the size of the RAM of one core of your bigmem (or highmem) compute nodes. If you're not planning to use binny to submit to a cluster or don't have separate bigmem nodes, you don't need to set this.
# binny

## Requirements
At the moment binny only runs on Linux. \
Conda (and optionally, recommended Mamba) as well as Git need to be available.

3) Decide how you want to run binny, if you let it submit jobs to the cluster:
Only do one of the two:
* if you want to submit the process running snakemake to the cluster:
```
cp runscripts/binny_submit.sh binny
chmod 755 binny
```
* if you want to keep the process running snakemake on the frontend using tmux:
```
cp runscripts/binny_tmux.sh
chmod 755 binny
```
## Quickstart
Here is a quick guide on the installation and test run of binny. Please check out the longer description below to set up binny on a cluster environment.

4) **optional**: Install snakemake via conda:
If you want to use snakemake via conda (and you've set SNAKEMAKE_VIA_CONDA to true), install the environment, as [recommended by Snakemake](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html):
1) Clone this repository with git
```
conda install -c conda-forge mamba
mamba create --prefix $PWD/conda/snakemake_env
conda activate $PWD/conda/snakemake_env
mamba install -c conda-forge -c bioconda snakemake
conda deactivate
# git clone https://github.com/a-h-b/binny.git
git clone -b sk_v17 https://github.com/ohickl/binny.git
cd binny
```

5) **optional**: Set permissions / PATH:
binny is meant to be used by multiple users. Set the permissions accordingly. I'd suggest:
* to have read access for all files for the users plus
* execution rights for the binny file and the .sh scripts in the subfolder submit_scripts
* read, write and execution rights for the conda subfolder
* to add the binny directory to your path.
* It can also be useful to make the VARIABLE_CONFIG file not-writable, because you will always need it. The same goes for config.default.yaml once you've set the paths to the databases you want to use (see below).
2) Create the binny environment (mamba is recommended for speed over conda).

6) Initialize conda environments:
This run sets up the conda environments that will be usable by all users and will download a database of essential genes from (https://webdav-r3lab.uni.lu/public/R3lab/IMP/essential.hmm):
```
./binny -i config/config.init.yaml
```
This step will take several minutes to an hour. It will create a folder with the name "database". It contains the database of essential genes. You can move this elsewhere and specify the path in the config, if you wish.
I strongly suggest to **remove one line from the activation script** after the installation, namely the one reading: `R CMD javareconf > /dev/null 2>&1 || true`, because you don't need this line later and if two users run this at the same time it can cause trouble. You can do this by running:
```
sed -i "s/R CMD javareconf/#R CMD javareconf/" conda/*/etc/conda/activate.d/activate-r-base.sh
* Set up environment manager (and optionally channel priority)
```
# Choose env manager
my_env_manager='mamba' # or 'conda'

7) **Optional** test run:
You should be able to test run by
```
./binny -l -n "TESTRUN" -r config/config.test.yaml
# Optional: Set path to desired location to install conda environment into
my_conda_env_path="absolute/path/to/conda/env/dir" #adjust path here

# If the conda channel priority is set to 'strict', the env creation will likely fail
# so you might need to use:
# conda config --set channel_priority flexible
```
If all goes well, binny will run in the current session, load the conda environments, and make and fill a directory called testoutput. A completed run contains a file "contigs2bin.tsv".
If you don't want to see binny's guts at this point, you can also run this with the -c or -f settings to submit to your cluster or start a tmux session (see How to run binny below).

## How to run binny
To run the binny, you need a config file, plus data:
* The config file (in yaml format) is read by Snakemake to determine the inputs, arguments and outputs.
* You need contigs in a fasta file and the alignments of metagenomic reads in bam format, both have to be set in the config file.

### Using the binny wrapper
As shown in the installation description above, binny can be run in a single step, by calling the binny executable. Since most of the configuration is done via the config file, the options are very limited. You can either:
* -c run (submit to a cluster) binny and make a report (-r), or
* -l run (in the current terminal) binny and make a report (-r), or
* -f run (in a tmux session on the frontend) binny *only available in the tmux installation* and make a report (-r), or
* just make a report (-r), or
* run a dryrun (-d), or
* unlock a working directory, if a run was killed (-u)
* initialize the conda environmnets only (-i) - you should only need this during the installation.
It is strongly recommended to **first run a dryrun on a new configuration**, which will tell you within a few seconds and without submission to a cluster whether your chosen steps work together, the input files are where you want them, and your sample file is formatted correctly. In all cases you need the config file as the last argument.

* Create binny environment
```
binny -d -r config.yaml
# ${my_env_manager} set in previous step.
# Optionally set target path to install to with --prefix ${my_conda_env_path}
${my_env_manager} env create --file workflow/envs/binny.yaml

# If necessary change conda channel priority back to strict:
# conda config --set channel_priority strict
```
You can also set the number of cpus to maximally run at the same time with -t. The defaults (1 for local/frontend runs and 50 for clusters) are reasonable for many settings and if you don't know what this means, you probably don't have to worry. But you may want to increase the numbers for larger datasets or bigger infrastructure, or decrease the numbers to match your environment's constraints.
You can add a name for your main job (-n NAME), e.g.:

3) Database and Mantis setup with test run.
```
binny -c -n RUNNAME -r config.yaml
${my_env_manager} activate binny # or ${my_env_manager} activate ${my_conda_env_path}/binny

./binny --outputdir test_output \
--assembly test/contigs_4bins.fa \
--bam test/reads_4bins*.bam \
--threads 4
```
Note that spaces in RUNNAME are not allowed and dots will be replaced by underscores.

If you use the tmux version, you can see the tmux process running by typing `tmux ls`. You can also see the progress by checking the stdandard error file `tail RUNNAME_XXXXXXXXXX.stderr`.
If all goes well, binny will run in the current session, load the CheckM data, setup Mantis, and make and fill a directory called `test_output`. A completed run should contain four fasta files with one bin each in `test_output/bins`.
If so, binny is good to go.

Depending on your dataset and settings and your cluster's scheduler, the workflow will take a few minutes to hours to finish.
To view all available parameters and accompanying explanations use `./binny --help`

### Running snakemake manually
Once metagenomic data and the config file are present, the workflow can be started from the binny directory by the snakemake command:
```
snakemake -s Snakefile --configfile /PATH/TO/YOUR/CONFIGFILE --use-conda
```
If you're using a computing cluster, add your cluster's submission command and the number of jobs you want to maximally run at the same time, e.g.:
```
snakemake -j 50 -s Snakefile --cluster "qsub -l h_rt={resources.runtime},h_vmem=8G -pe smp {threads} -cwd" --configfile /PATH/TO/YOUR/CONFIGFILE --use-conda
```
This will submit most steps as their own job to your cluster's queue. The same can be achieved with a [cluster configuration](https://snakemake.readthedocs.io/en/stable/executing/cluster-cloud.html#cluster-execution):
```
snakemake -j 50 -s Snakefile --cluster-config PATH/TO/SCHEDULER.config.yaml --cluster "{cluster.call} {cluster.runtime}{resources.runtime} {cluster.mem_per_cpu}{resources.mem} {cluster.threads}{threads} {cluster.partition}" --configfile /PATH/TO/YOUR/CONFIGFILE --use-conda
```
If you want to share the conda installation with colleagues, use the `--conda-prefix` argument of Snakemake
```
snakemake -j 50 -s Snakefile --cluster-config PATH/TO/SCHEDULER.config.yaml --cluster "{cluster.call} {cluster.runtime}{params.runtime} {cluster.mem_per_cpu}{resources.mem} {cluster.threads}{threads} {cluster.partition}" --use-conda --conda-prefix /PATH/TO/YOUR/COMMON/CONDA/DIRECTORY
```
Depending on your dataset and settings, and your cluster's queue, the workflow will take a few minutes to days to finish.

### CheckM databases

The marker gene data file `checkm_data_2015_01_16.tar.gz` is downloaded from [here](https://data.ace.uq.edu.au/public/CheckM_databases), and the following files are processed:
* taxon_marker_sets.tsv
* tigrfam2pfam.tsv
* checkm.hmm

The processed marker gene file, `taxon_marker_sets_lineage_sorted.tsv`, can be found in the `database` directory by default and is generated using `remove_unused_checkm_hmm_profiles.py` found under
`workflow/scripts`.
Loading