two-steps-clustering

Snakemake pipeline to cluster proteomes in two steps with MMseqs2, MAFFT, HHsuite and MCL:

Proteins are clustered into subfamilies in a first round using MMseq2.
For each subfamily, a MSA is generated using MAFTT.
MSA profiles are compared all-vs-all using hhblits.
Subfamilies are clustered into families using MCL based on probability and coverage obtained with hhblits.

Instalation

Pipeline is written to run as an Snakemake workflow. After cloning the repo, create a conda environment spc containing Snakemake:

# clone the repo
$ git clone https://github.com/dcarrillox/two-steps-clustering.git
$ cd two-steps-clustering

# create minimum environment for execution
$ conda create -n spc --file=conda-linux-64.lock

# activate environment
$ conda activate spc

# test Snakemake
$ (spc) snakemake --version
6.6.1

Execution

Pipeline's input is a FASTA file with all the proteins to be clustered. The path to this file needs to be provided in the config file config/config.yaml under the proteome section. Parameters for execution can be tuned in the config file as well.

To run the pipeline:

# adjust number of threads with the -j argument
$ snakemake -j 16  --use-conda --conda-frontend mamba

Final results with the proteins clustered into subfamilies and families should be under results/protein_clustering_results.tsv.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
config		config
docs		docs
workflow		workflow
.gitignore		.gitignore
README.md		README.md
conda-linux-64.lock		conda-linux-64.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

two-steps-clustering

Instalation

Execution

About

Releases

Packages

Languages

dcarrillox/two-steps-clustering

Folders and files

Latest commit

History

Repository files navigation

two-steps-clustering

Instalation

Execution

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages