Skip to content
darcyabjones edited this page Jun 4, 2021 · 1 revision

Predector

Table of contents

Predector is a pipeline to run numerous secretome and fungal effector prediction tools, and to combine them in usable and informative ways.

The pipeline currently includes: SignalP (3, 4, 5), TargetP (v2), DeepLoc, TMHMM, Phobius, DeepSig, CAZyme finding (with dbCAN), Pfamscan, searches against PHI-base, Pepstats, ApoplastP, LOCALIZER and EffectorP 1 and 2. These results are summarised as a table that includes most information that would typically be used for secretome analysis. Effector candidates are ranked using a learning-to-rank machine learning method, which balances the tradeoff between secretion prediction and effector property prediction, with higher-sensitivity, comparable specificity, and better ordering than naive combinations of these features. We recommend that users incorporate these ranked effector scores with experimental evidence or homology matches to prioritise other more expensive efforts (e.g. cloning or structural modelling).

We hope that predector can become a platform enabling multiple secretome analyses, with a special focus on eukaryotic (currently only Fungal) effector discovery. We also seek to establish data informed best practises for secretome analysis tasks, where previously there was only a loose consensus, and to make it easy to follow them.

Predector is designed to be run on complete predicted proteomes, as you would get after gene prediction or from databases like uniprot. Although the pipeline will happily run with processed mature proteins or peptide fragments, the analyses that are run as part of the pipeline are not intended for this purpose and any results from such input should be considered with extreme caution.

Quick install

1. Install Conda, Docker, or Singularity

We provide automated ways of installing dependencies using conda environments (linux OS only), or docker or singularity containers.

Please follow the instructions at one of the following links to install:

NB. We cannot support conda environments on Mac or Windows. Please use a Linux virtual machine or one of the containerised options.

2. Download the proprietary software dependencies

Predector runs several tools that we cannot download for you automatically. Please register for and download each of the following tools, and place them all somewhere that you can access from your terminal. Where you have a choice between versions for different operating systems, you should always take the Linux version (even if using Mac or Windows).

Note that DTU (SignalP etc) don't keep older patches and minor versions available. If the specified version isn't available to download, another version with the same major number should be fine.

3. Build the conda environment or container

We provide an install script that should install the dependencies for the majority of users.

In the following command, substitute the assigned value of ENVIRONMENT for conda, docker, or singularity as suitable. Make sure you're in the same directory as the proprietary source archives. If the names below don't match the filenames you have exactly, adjust the command accordingly. For singularity and docker container building you may be prompted for your root password (via sudo).

ENVIRONMENT=docker

curl -s "https://raw.githubusercontent.com/ccdmb/predector/1.0.0/install.sh" \
| bash -s "${ENVIRONMENT}" \
    -3 signalp-3.0.Linux.tar.Z \
    -4 signalp-4.1g.Linux.tar.gz \
    -5 signalp-5.0b.Linux.tar.gz \
    -t targetp-2.0.Linux.tar.gz \
    -d deeploc-1.0.All.tar.gz \
    -m tmhmm-2.0c.Linux.tar.gz \
    -p phobius101_linux.tar.gz

This will create the conda environment (named predector), or the docker (tagged predector/predector:1.0.0) or singularity (file ./predector.sif) containers.

Take note of the message given upon completion, which will tell you how to use the container or environment with predector.

If you have issues during installation or want to customise where things are built, please consult the extended documentation. Or save the install script locally and run install.sh --help.

4. Install NextFlow

NextFlow requires a bash compatible terminal, and Java version 8+. We require NextFlow version 21 or above. Extended install instructions are available at: https://www.nextflow.io/.

curl -s https://get.nextflow.io | bash

Or using conda:

conda install -c bioconda nextflow>=21

5. Test the pipeline

Use one of the commands below using information given upon completion of dependency install script. Make sure you use the environment that you specified in Step 3.

Using conda:

nextflow run -profile test -with-conda /home/username/path/to/environment -resume -r 1.0.0 ccdmb/predector

Using docker:

nextflow run -profile test,docker -resume -r 1.0.0 ccdmb/predector

# if your docker configuration requires sudo use this profile instead
nextflow run -profile test,docker_sudo -resume -r 1.0.0 ccdmb/predector

Using singularity:

nextflow run -profile test -with-singularity path/to/predector.sif -resume -r 1.0.0 ccdmb/predector

# or if you've build the container using docker and it's in your local docker registry.
nextflow run -profile test,singularity -resume -r 1.0.0 ccdmb/predector

Extended dependency install guide

If the quick install method doesn't work for you, you might need to run the environment build steps manually. It would be great if you could also contact us to report the issue, so that we can get the quick install instructions working for more people.

The following guides assume that you have successfully followed the steps 1, 2, and 4, and aim to teplace step 3.

Building the conda environment the long way

We provide a conda environment file that can be downloaded and installed. This environment contains several "placeholder" packages to deal with the proprietary software. Essentially, these placeholder packages contain scripts to take the source files of the proprietary software, and install them into the conda environment for you.

It is necessary to run both of the code blocks below to properly install the environment.

First we create the conda environment, which includes the non-proprietary dependencies and the "placeholder" packages.

# Download the environment config file.
curl -o environment.yml https://raw.githubusercontent.com/ccdmb/predector/1.0.0/environment.yml

# Create the environment
conda env create -f environment.yml
conda activate predector

To complete the installation we need to run the *-register scripts, which install the proprietary source archives you downloaded yourself. You can copy-paste the entire command below directly into your terminal. Modify the source tar archive filenames in the commands if necessary.

signalp3-register signalp-3.0.Linux.tar.Z \
&& signalp4-register signalp-4.1g.Linux.tar.gz \
&& signalp5-register signalp-5.0b.Linux.tar.gz \
&& targetp2-register targetp-2.0.Linux.tar.gz \
&& deeploc-register deeploc-1.0.All.tar.gz \
&& phobius-register phobius101_linux.tar.gz \
&& tmhmm2-register tmhmm-2.0c.Linux.tar.gz

If any of the *-register scripts fail, please contact the authors or raise an issue on github (we'll try to have an FAQ setup soon).

Building the Docker container the long way

For docker and anything that supports docker images we have a prebuilt container on DockerHub containing all of the open-source components. To install the proprietary software we use this image as a base to build on with a new dockerfile. To build the new image with the proprietary dependencies, you need to run the command below which can all be copy-pasted directly into your terminal. Modify the source .tar archive filenames in the command if necessary. Depending on how you installed docker you may need to use sudo docker in place of docker.

curl -s https://raw.githubusercontent.com/ccdmb/predector/1.0.0/Dockerfile \
| docker build \
  --build-arg SIGNALP3=signalp-3.0.Linux.tar.Z \
  --build-arg SIGNALP4=signalp-4.1g.Linux.tar.gz \
  --build-arg SIGNALP5=signalp-5.0b.Linux.tar.gz \
  --build-arg TARGETP2=targetp-2.0.Linux.tar.gz \
  --build-arg PHOBIUS=phobius101_linux.tar.gz \
  --build-arg TMHMM=tmhmm-2.0c.Linux.tar.gz \
  --build-arg DEEPLOC=deeploc-1.0.All.tar.gz \
  -t predector/predector:1.0.0 \
  -f - \
  .

Your container should now be available as predector/predector:1.0.0 in your docker registry docker images.

Building the Singularity container the long way

There are a few ways to build the singularity image with the proprietary software installed (the filename predector.sif in the sections below).

If you only have singularity installed, you can build the container directly by downloading the .def file and setting some environment variables with the paths to the proprietary source archives. The following commands will build this image for you, and can be copy-pasted directly into your terminal. Modify the source tar archive filenames if necessary.

# This is used to emulate the --build-args functionality of docker.
# Singularity lacks this feature. You can unset the variables after you're done.
export SIGNALP3=signalp-3.0.Linux.tar.Z
export SIGNALP4=signalp-4.1g.Linux.tar.gz
export SIGNALP5=signalp-5.0b.Linux.tar.gz
export TARGETP2=targetp-2.0.Linux.tar.gz
export PHOBIUS=phobius101_linux.tar.gz
export TMHMM=tmhmm-2.0c.Linux.tar.gz
export DEEPLOC=deeploc-1.0.All.tar.gz

# Download the .def file
curl -o ./singularity.def https://raw.githubusercontent.com/ccdmb/predector/1.0.0/singularity.def

# Build the .sif singularity image.
# Note that `sudo -E` is important, it tells sudo to keep the environment variables
# that we just set.
sudo -E singularity build \
  predector.sif \
  ./singularity.def

If you've already built the container using docker, you can convert them to singularity format. You don't need to use sudo even if your docker installation usually requires it.

singularity build predector.sif docker-daemon://predector/predector:1.0.0

Because the container images are quite large, singularity build will sometimes fail if your /tmp partition isn't big enough. In that case, set the following environment variables and remove the cache directory (rm -rf -- "${PWD}/cache") when singularity build is finished.

export SINGULARITY_CACHEDIR="${PWD}/cache"
export SINGULARITY_TMPDIR="${PWD}/cache"
export SINGULARITY_LOCALCACHEDIR="${PWD}/cache"

Copying environments to places where you don't have root user permission

We can't really just put the final container images up on dockerhub or singularity hub, since that would violate the proprietary license agreements. So if you don't have root user permission on the computer (e.g. a supercomputing cluster) you're going to run the analysis on you can either use the conda environments or build a container on a different computer and copy the image up.

If the option is available to you, I would recommend using the singularity containers for HPC. Singularity container .sif files can be simply copied to whereever you're running the analysis.

Some supercomputing centres will have shifter installed, which allows you to run jobs with docker containers. Note that there are two versions of shifter and nextflow only supports one of them (the nersc one). Docker containers can be saved as a tarball and copied wherever you like.

# You could pipe this through gzip if you wanted.
docker save predector/predector:1.0.0 > predector.tar

And the on the other end

docker load -i predector.tar

Conda environment should be able to be built anywhere, since they don't require root user permission. You should just be able to follow the instructions described earlier. Just make sure that you install the environment on a shared filesystem (i.e. one that all nodes in your cluster can access).

There are also options for "packing" a conda environment into something that you can copy around (e.g. conda-pack), though we haven't tried this yet.

Hopefully, one of these options will work for you.

Common install issues

Running with docker Unable to find image 'predector/predector:1.0.0' locally

This usually means that you haven't built the docker image locally. Remember that we cannot distribute some of the dependencies, so you need to build the container image and move it to where you'll be running.

Please check that you have the docker container in your local registry:

docker images

It's also possible that you built a different environment (e.g. conda or singularity). Check conda info -e or for any .sif files where your source archives are.

Another possibility is that you are trying to run the pipeline using a container built for a different version of the pipeline. Please check that the version tag in docker images is the same as the pipeline that you're trying to run. Update the pipeline if necessary using nextflow pull ccdmb/predector.

Running with singularity ERROR : Failed to set loop flags on loop device: Resource temporarily unavailable.

This is caused by nextflow trying to launch lots of tasks with the same singularity image at the same time. Updating singularity to version >= 3.5 should resolve the issue.

Running the pipeline

To run predector you need your input proteomes as uncompressed fasta files, and a downloaded copy of the PHI-base fasta file.

Assuming that you've installed the dependencies, and know which dependency system you're using (conda, docker, or singularity), you can run like so:

Conda:

nextflow run \
  -resume \
  -r 1.0.0 \
  -with-conda /path/to/conda/env \
  ccdmb/predector \
  --phibase phibase-latest.fas \
  --proteome "my_proteomes/*.faa"

Docker:

nextflow run \
  -resume \
  -r 1.0.0 \
  -profile docker \
  ccdmb/predector \
  --phibase phibase-latest.fas \
  --proteome "my_proteomes/*.faa"

Singularity:

nextflow run \
  -resume \
  -r 1.0.0 \
  -with-singularity ./path/to/singularity.sif \
  ccdmb/predector \
  --phibase phibase-latest.fas \
  --proteome "my_proteomes/*.faa"

Command line parameters

To get a list of all available parameters, use the --help argument.

nextflow run ccdmb/predector --help

Important parameters are:

--proteome <path or glob>
  Path to the fasta formatted protein sequences.
  Multiple files can be specified using globbing patterns in quotes.
--phibase <path>
  Path to the PHI-base fasta dataset.

-profile <string>
  Specify a pre-set configuration profile to use.
  Multiple profiles can be specified by separating them with a comma.
  Common choices: test, docker, docker_sudo

-c | -config <path>
  Provide a custom configuration file.
  If you want to customise things like how many CPUs different tasks
  can use, whether to use the SLURM scheduler etc, this is the way
  to do it. See the predector or nextflow documentation for details
  on how to write these.

-with-conda <path>
  The path to a conda environment to use for dependencies.

-with-singularity <path>
  Path to the singularity container file to use for dependencies.

--outdir <path>
  Base directory to store the pipeline results
  default: 'results'

--tracedir <path>
  Directory to store pipeline runtime information
  default: 'results/pipeline_info'

--chunk_size <int>
  The number of proteins to run as a single chunk in the pipeline
  default: 5000

--nostrip
  Don't strip the proteome filename extension when creating the output filenames
  default: false

Profiles and configuration

Nextflow uses configuration files to specify how many cpus or RAM a task can use, or whether to use a SLURM scheduler on a supercomputing cluster etc. You can also use these config files to provide parameters.

To select different configurations, you can either use one of the preset "profiles", or you can provide your own nextflow config files to the -config parameter https://www.nextflow.io/docs/latest/config.html. This enables you to tune the number of CPUs used per task etc to your own computing system.

Profiles

We have several available profiles that configure where to find software, cpu, memory etc.

type profile description
software docker Run the processes in a docker container.
software docker_sudo Run the processes in a docker container, using sudo docker.
software podman Run the processes in a container using podman.
software singularity Run the process using singularity (by pulling it from the local docker registry). To use a singularity image file use the -with-singularity image.sif parameter instead.
cpu c4 Use up to 4 CPUs/cores per computer/node.
cpu c8 Use up to 8 CPUs/cores ...
cpu c16 Use up to 16 CPUs/cores ...
memory r8 Use up to 8Gb RAM per computer/node.
memory r16 Use up to 16Gb RAM
memory r32 Use up to 32Gb RAM
memory r64 Use up to 64Gb RAM
time t1 Limits process time to 1hr, 5hr, and 12hr for short, medium and long tasks.
time t2 Limits process time to 2hr, 10hr, and 24hr for short, medium and long tasks.
time t3 Limits process time to 3hr, 15hr, and 24hr for short, medium and long tasks.
time t4 Limits process time to 4hr, 20hr, and 48hr for short, medium and long tasks.
compute pawsey_zeus A combined profile to use the Pawsey supercomputing centre's Zeus cluster. This sets cpu, memory, and time parameters appropriate for using this cluster.

You can mix and match these profiles, using the -profile parameter. By default, the pipeline will behave as if you ran the pipeline with -profile c4,r8 (4 CPUs, and 8 Gb memory) which should be compatible with most modern laptop computers and smaller cloud instances. But you can increase the number of CPUs available e.g. to make up to 16 CPUs available with -profile c16 which will have 16 cores available and 8 GB of memory. To make more memory available, specify one of the r* profiles e.g. -profile c16,r32.

The time profiles (t*) are useful for limiting running times of tasks. By default the times are not limited, but these can be useful to use if you are running on a supercomputing cluster (specifying the times can get you through the queue faster) or on commercial cloud computing services (so you don't rack up an unexpected bill if something stalls somehow).

So to use combine all of these things; to use docker containers, extra ram and CPUs etc you can provide the profile -profile c16,r32,t2,docker.

Custom configuration

If the preset profiles don't meet your needs you can provide a custom config file. Extended documentation can be found here: https://www.nextflow.io/docs/latest/config.html.

I'll detail some pipeline specific configuration below but I suggest you start by copying the file https://github.com/ccdmb/predector/tree/master/conf/template_single_node.config and modify as necessary.

Each nextflow task is labelled with the software name, cpu, ram, and time requirements for each task. In the config files, you can select these tasks by label.

kind label description
cpu cpu_low Used for single threaded tasks. Generally doesn't need to be touched.
cpu cpu_medium Used for parallelised tasks that are IO bound. E.G. signalp 3 & 4, deeploc etc.
cpu cpu_high Used for parallelised tasks that use lots of CPUs efficiently. Usually this should be all available CPUs.
memory ram_low Used for processes with low RAM requirements, e.g. downloads.
memory ram_medium Used for tasks with moderate RAM requirements, and many of the parallelised tasks (e.g. with cpu_medium).
memory ram_high Used for tasks with high RAM requirements. Usually this should be all available RAM.
time time_short Used with tasks that should be super quick like sed or splitting files etc (1 or 2 hours at the very most).
time time_medium Used for more expensive tasks, most parallelised tasks should be able to complete within this time (e.g 5-10 hours).
time time_long Used for potentially long running tasks or tasks with times that depends on external factors e.g. downloads.
software download Software environment for downloading things. (i.e. contains wget)
software posix " for using general posix/GNU tools
software predectorutils " Tasks that use the predector-utils scripts.
software signalp3
software signalp4
software signalp5
software deepsig
software phobius
software tmhmm
software deeploc
software apoplastp
software localizer
software effectorp1
software effectorp2
software emboss
software hmmer3
software pfamscan
software mmseqs

Running different pipeline versions.

We pin the version of the pipeline to run in all of our example commands with the -r 1.0.0 parameter. These flags are optional, but recommended so that you know which version you ran. Different versions of the pipelines may output different scores, use different parameters look etc. It also re-enforces the link between the pipeline version and the docker container tags.

If you have previously run predector and want to update it to use a new version, you can either provide a new version to the -r parameter, and add the -latest flag to tell nextflow to pull new changes from the github repository. Likewise, you can run old versions of the pipeline by simply changing -r. You can also pull new changes without running the pipeline using nextflow pull ccdmb/predector.

Note that the software environments (conda, docker, singularity) often will not be entirely compatible between versions. You should probably rebuild the container or conda environment from scratch when changing versions. I suggest keeping copies of the proprietary dependencies handy in a folder or archive, and just building and removing the container/environment as you need it.

Providing pre-downloaded Pfam and dbCAN datasets.

Sometimes the Pfam or dbCAN servers can be a bit slow for downloads, and are occasionally unavailable which will cause the pipeline to fail. You may want to keep the downloaded databases to reuse them (or pre-download them).

If you've already run the pipeline once, they'll be in the results folder (unless you specified --outdir) so you can do:

cp -rL results/downloads ./downloads
nextflow run \
  -profile test \
  -resume ccdmb/predector \
  --pfam_hmm downloads/Pfam-A.hmm.gz \
  --pfam_dat downloads/Pfam-A.hmm.dat.gz \
  --pfam_active_site downloads/active_site.dat.gz \
  --dbcan downloads/dbCAN.txt

This will skip the download step at the beginning and just use those files, which saves a few minutes.

Pipeline output

Predector output several files for each input file that you provide, and some additional ones that can be useful for debugging results.

Results will always be placed under the directory specified by the parameter --outdir (./results by default).

Downloaded databases (i.e. Pfam and dbCAN) are stored in the downloads subdirectory. Deduplicated sequences and a tab-separated values file mapping the deduplicated sequence ids to their filenames and original ids is in the deduplicated subdirectory.

Other directories will be named after the input filenames and each contain several tables.

*-ranked.tsv

This is the main output table that includes the scores and most of the parameters that are important for effector or secretion prediction. There are a lot of columns, though generally you'll only be interested in a few of them.

  1. seqid -- The protein name in the fasta you provided.
  2. effector_score -- Float. The predector machine learning effector score for this protein.
  3. manual_effector_score -- Float. The manually created effector score, which is the sum of the products of several values in this spreadsheet. Consult the paper for details.
  4. manual_secretion_score -- Float. The manually created secretion score, which is the sum of the products of several values in this spreadsheet.
  5. phibase_effector -- Boolean [0, 1] indicating whether the protein had a significant hit to one of the phibase phenotypes: Effector, Hypervirulence, or loss of pathogenicity.
  6. phibase_virulence -- Boolean [0, 1] indicating whether the protein had a significant hit with the phenotype "reduced virulence".
  7. phibase_lethal -- Boolean [0, 1] indicating whether the protein had a significant hit with the phenotype "lethal".
  8. phibase_phenotypes -- A comma separated list of the PHI-base phenotypes in the significant hits to PHI-base.
  9. phibase_matches -- A comma separated list of the PHI-base entries that were significant hits.
  10. effector_match -- Boolean [0, 1] indicating whether the protein had a significant hit in the predector curated set of fungal effectors.
  11. effector_matches -- A comma separated list of the matches to the curated set of fungal effectors.
  12. pfam_match -- Boolean [0, 1] indicating whether the protein had a significant hit to one of the selected Pfam HMMs associated with virulence function.
  13. pfam_matches -- A comma separated list of all Pfam HMMs matched.
  14. dbcan_match -- Boolean [0, 1] indicating whether the protein had a significant hit to one of the dbCAN domains associated with virulence function.
  15. dbcan_matches -- A comma separated lst of all dbCAN matches.
  16. effectorp1 -- Float. The raw EffectorP v1 prediction pseudo-probability. Values above 0.5 are considered to be effector predictions.
  17. effectorp2 -- Float. The raw EffectorP v2 prediction pseudo-probability. Values above 0.5 are considered to be effector predictions. Values below 0.6 are annotated in the raw EffectorP output as "unlikely effectors".
  18. is_secreted -- Boolean [0, 1] indicating whether the protein had a signal peptide predicted by any method, and does not have >=2 transmembrane domains predicted by either TMHMM or Phobius.
  19. any_signal_peptide -- Boolean [0, 1] indicating whether any of the signal peptide prediction methods predict the protein to have a signal peptide.
  20. apoplastp -- Float. The raw ApoplastP "apoplast" localised prediction pseudo probability. Values above 0.5 are considered to be apoplastically localised.
  21. single_transmembrane -- Boolean [0, 1] indicating whether the protein is predicted to have 1 transmembrane domain by TMHMM or Phobius (and not >1 for either), and in the case of TMHMM the predicted number of TM AAs in the first 60 residues is less than 10.
  22. multiple_transmembrane -- Boolean [0, 1] indicating whether a protein is predicted to have more than 1 transmembrane domain by either Phobius or TMHMM.
  23. molecular_weight -- Float. The predicted molecular weight (Daltons) of the protein.
  24. residue_number -- Integer. The length of the protein or number of residues/AAs.
  25. charge -- Float. The overall predicted charge of the protein.
  26. isoelectric_point -- Float. The predicted isoelectric point of the protein.
  27. aa_c_number -- Integer. The number of Cysteine residues in the protein.
  28. aa_tiny_number -- Integer. The number of tiny residues (A, C, G, S, or T) in the protein.
  29. aa_small_number -- Integer. The number of small residues (A, B, C, D, G, N, P, S, T, or V) in the protein.
  30. aa_aliphatic_number -- Integer. The number of aliphatic residues (A, I, L, or V) in the protein.
  31. aa_aromatic_number -- Integer. The number of aromatic residues (F, H, W, or Y) in the protein.
  32. aa_nonpolar_number -- Integer. The number of non-polar residues (A, C, F, G, I, L, M, P, V, W, or Y) in the protein.
  33. aa_charged_number -- Integer. The number of charged residues (B, D, E, H, K, R, or Z) in the protein.
  34. aa_basic_number -- Integer. The number of basic residues (H, K, or R) in the protein.
  35. aa_acidic_number -- Integer. The number of acidic residues (B, D, E or Z) in the protein.
  36. fykin_gap -- Float. The number of FYKIN residues + 1 divided by the number of GAP residues + 1. Testa et al. 2016 describe RIP affected regions as being enriched for FYKIN residues, and depleted in GAP residues.
  37. localizer_nuclear -- Boolean [0, 1] or None '.' indicating whether localiser predicted an internal nuclear localisation peptide. These predictions are run on mature peptides predicted by SignalP 5. Any entry with '.' indicates where the program was not run.
  38. localizer_chloro -- Boolean [0, 1] or None '.' indicating whether localiser predicted an internal chloroplast localisation peptide. These predictions are run on mature peptides predicted by SignalP 5. Any entry with '.' indicates where the program was not run.
  39. localizer_mito -- Boolean [0, 1] or None '.' indicating whether localiser predicted an internal mitochondrial localisation peptide. These predictions are run on mature peptides predicted by SignalP 5. Any entry with '.' indicates where the program was not run.
  40. signalp3_nn -- Boolean [0, 1] indicating whether the protein is predicted to have a signal peptide by the neural network model in SignalP 3.
  41. signalp3_hmm -- Boolean [0, 1] indicating whether the protein is predicted to have a signal peptide by the HMM model in SignalP 3.
  42. signalp4 -- Boolean [0, 1] indicating whether the protein is predicted to have a signal peptide by SignalP 4.
  43. signalp5 -- Boolean [0, 1] indicating whether the protein is predicted to have a signal peptide by SignalP 5.
  44. deepsig -- Boolean [0, 1] indicating whether the protein is predicted to have a signal peptide by DeepSig.
  45. phobius_sp -- Boolean [0, 1] indicating whether the protein is predicted to have a signal peptide by Phobius.
  46. phobius_tmcount -- Integer. The number of transmembrane domains predicted by Phobius.
  47. tmhmm_tmcount -- Integer. The number of transmembrane domains predicted by TMHMM.
  48. tmhmm_first_60 -- Float. The predicted number of transmembrane AAs in the first 60 residues of the protein by TMHMM.
  49. tmhmm_exp_aa -- Float. The predicted number of transmembrane AAs in the protein by TMHMM.
  50. tmhmm_first_tm_sp_coverage -- Float. The proportion of the first predicted TM domain that overlaps with the median predicted signal-peptide cut site. Where no signal peptide or no TM domains are predicted, this will always be 0.
  51. targetp_secreted -- Boolean [0, 1] indicating whether TargetP 2 predicts the protein to be secreted.
  52. targetp_secreted_prob -- Float. The TargetP pseudo-probability of secretion.
  53. targetp_mitochondrial_prob -- Float. The TargetP pseudo-probability of mitochondrial localisation.
  54. deeploc_membrane -- Float. DeepLoc pseudo-probability of membrane association.
  55. deeploc_nucleus -- Float. DeepLoc pseudo-probability of nuclear localisation. Note that all DeepLoc values other than "membrane" are from the same classifier, so the sum of all of the pseudo-probabilities will be 1.
  56. deeploc_cytoplasm -- Float. DeepLoc pseudo-probability of cytoplasmic localisation.
  57. deeploc_extracellular -- Float. DeepLoc pseudo-probability of extracellular localisation.
  58. deeploc_mitochondrion -- Float. DeepLoc pseudo-probability of mitochondrial localisation.
  59. deeploc_cell_membrane -- Float. DeepLoc pseudo-probability of cell membrane localisation.
  60. deeploc_endoplasmic_reticulum -- Float. DeepLoc pseudo-probability of ER localisation.
  61. deeploc_plastid -- Float. DeepLoc pseudo-probability of plastid localisation.
  62. deeploc_golgi -- Float. DeepLoc pseudo-probability of golgi apparatus localisation.
  63. deeploc_lysosome -- Float. DeepLoc pseudo-probability of lysosomal localisation.
  64. deeploc_peroxisome -- Float. DeepLoc pseudo-probability of peroxisomal localisation.
  65. signalp3_nn_d -- Float. The raw D-score for the SignalP 3 neural network.
  66. signalp3_hmm_s -- Float. The raw S-score for the SignalP 3 HMM predictor.
  67. signalp4_d -- Float. The raw D-score for SignalP 4. See discussion of choosing multiple thresholds in the SignalP FAQs.
  68. signalp5_prob -- Float. The SignalP 5 signal peptide pseudo-probability.

*.gff3

This file contains gff3 versions of results from analyses that have some positional information (e.g. signal/target peptides or alignments). The columns are:

  1. The protein seqid in your input fasta file.
  2. The analysis that gave this result. Note that for database matches, both the software and database are listed, separated by a colon (:).
  3. The closest Sequence Ontology term that could be used to describe the region.
  4. The start of the region being described (1-based).
  5. The end of the region being described (1-based inclusive).
  6. The score of the match if available. For MMSeqs2 and HMMER matches, this is the e-value. For SignalP 3-nn and 4 this will be the D-score, for SignalP 3-hmm this is the S-probability, and for SignalP5, DeepSig, TargetP and LOCALIZER mitochondrial or chloroplast predictions this will be the probability score.
  7. The strand. This will always be unstranded (.), since proteins don't have direction in the same way nucleotides do.
  8. The phase, this will always be . because it is only valid for CDS features.
  9. The GFF attributes. In here the remaining raw results and scores will be present. Of particular interest are the Gap and Target attributes, which define what database match an alignment found and the bounds in the matched sequence, and match/mismatch positions.

Individual results tables

There are a bunch of tables that are just TSV versions of the original outputs. Most of the tools outputs are not well described and not in convenient formats for parsing so we don't keep them around. We've done our best to retain all of the information in the original formats as a TSV version.

The original formats are described in:

DeepLoc doesn't have any output format documentation that I can find, but hopefully it's pretty self explanatory for you. Note that all DeepLoc values other than "membrane" are from the same classifier, so the sum of all of the pseudo-probabilities will be 1.

FAQ

We'll update these as we find new issues and get feedback. Please raise an issue on GitHub or email us if you have an issue not covered here.

What do predector "effector scores" actually mean?

It's best to think of the learning to rank scores (and the manually designed ranking scores) as arbitrary numbers that attempt to make effectors appear near the top of a sorted list. The scores will not be consistent between different versions of the model, so please be careful if you're trying to compare scores. Similarly, like with EffectorP the scores should not be treated as a 'likelihood'. Although the you can generally say that proteins with higher scores will be more like known effectors, the difference in "effector-ness" between 0 and 1 is not necessarily the same as it is between 1 and 2 (and so on).

In the upcoming paper for version 1 we present some comparisons with EffectorP classification using a score threshold of 0, but this is not how we suggest you use these scores and the threshold may not be applicable in the future if we change how the model is trained. In general, it's best to look at some additional evidence (e.g. homologues or presence-absence) and manually evaluate candidates in descending order of score (i.e. using predector as a decision support system) until you have enough to work with.

In the first version of the model, the predictions between 0 and 1 can contain some odd effector predictions (e.g. NRPS genes). This is because the model has tried to accomodate some unusual effectors, but the decision tree rules (with discontinuous boundaries) can let some things through that obviously aren't effectors. If you delve into the proteins with lower scores we recommended that you manually evaluate the protein properties in the ranking sheet yourself to select candidates.

With predector we really wanted to encourage you to look at your data. Ranking separates the bulk of good proteins from bad ones, so a it's easier to decide when to stop manually evaluating candidates and settle on a list. Think of it like searching for papers on the web. The first page usually contains something relevant to what you're interested in, but sometimes there are some gems in the 2nd and 3rd pages.

How should I cite predector?

Predector isn't published yet though the manuscript is near submission. In the mean time, the url to the main GitHub repository will be fine https://github.com/ccdmb/predector.

Please also cite the dependencies that we use whenever possible. I understand that citation limits can be an issue, but the continued maintenance development of tools relies on these citations. There is a BibTeX formatted file with citations in the main github repository, which can be imported into most citation managers. The dependency citations are also listed below.

  • Almagro Armenteros, J. J., Sønderby, C. K., Sønderby, S. K., Nielsen, H., & Winther, O. (2017). DeepLoc: Prediction of protein subcellular localization using deep learning. Bioinformatics, 33(21), 3387–3395. https://doi.org/10.1093/bioinformatics/btx431
  • Armenteros, Jose Juan Almagro, Salvatore, M., Emanuelsson, O., Winther, O., Heijne, G. von, Elofsson, A., & Nielsen, H. (2019). Detecting sequence signals in targeting peptides using deep learning. Life Science Alliance, 2(5). https://doi.org/10.26508/lsa.201900429
  • Armenteros, José Juan Almagro, Tsirigos, K. D., Sønderby, C. K., Petersen, T. N., Winther, O., Brunak, S., Heijne, G. von, & Nielsen, H. (2019). SignalP 5.0 improves signal peptide predictions using deep neural networks. Nature Biotechnology, 37(4), 420–423. https://doi.org/10.1038/s41587-019-0036-z
  • Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., & Notredame, C. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology, 35(4), 316–319. https://doi.org/10.1038/nbt.3820
  • Dyrløv Bendtsen, J., Nielsen, H., von Heijne, G., & Brunak, S. (2004). Improved Prediction of Signal Peptides: SignalP 3.0. Journal of Molecular Biology, 340(4), 783–795. https://doi.org/10.1016/j.jmb.2004.05.028
  • Eddy, S. R. (2011). Accelerated Profile HMM Searches. PLOS Computational Biology, 7(10), e1002195. https://doi.org/10.1371/journal.pcbi.1002195
  • Finn, R. D., Bateman, A., Clements, J., Coggill, P., Eberhardt, R. Y., Eddy, S. R., Heger, A., Hetherington, K., Holm, L., Mistry, J., Sonnhammer, E. L. L., Tate, J., & Punta, M. (2014). Pfam: The protein families database. Nucleic Acids Research, 42(Database issue), D222–D230. https://doi.org/10.1093/nar/gkt1223
  • Käll, L., Krogh, A., & Sonnhammer, E. L. L. (2004). A Combined Transmembrane Topology and Signal Peptide Prediction Method. Journal of Molecular Biology, 338(5), 1027–1036. https://doi.org/10.1016/j.jmb.2004.03.016
  • Krogh, A., Larsson, B., von Heijne, G., & Sonnhammer, E. L. (2001). Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes. Journal of Molecular Biology, 305(3), 567–580. https://doi.org/10.1006/jmbi.2000.4315
  • Petersen, T. N., Brunak, S., Heijne, G. von, & Nielsen, H. (2011). SignalP 4.0: Discriminating signal peptides from transmembrane regions. Nature Methods, 8(10), 785–786. https://doi.org/10.1038/nmeth.1701
  • Rice, P., Longden, I., & Bleasby, A. (2000). EMBOSS: The European Molecular Biology Open Software Suite. Trends in Genetics, 16(6), 276–277. https://doi.org/10.1016/S0168-9525(00)02024-2
  • Savojardo, C., Martelli, P. L., Fariselli, P., & Casadio, R. (2018). DeepSig: Deep learning improves signal peptide detection in proteins. Bioinformatics, 34(10), 1690–1696. https://doi.org/10.1093/bioinformatics/btx818
  • Sperschneider, J., Catanzariti, A.-M., DeBoer, K., Petre, B., Gardiner, D. M., Singh, K. B., Dodds, P. N., & Taylor, J. M. (2017). LOCALIZER: Subcellular localization prediction of both plant and effector proteins in the plant cell. Scientific Reports, 7(1), 1–14. https://doi.org/10.1038/srep44598
  • Sperschneider, J., Dodds, P. N., Gardiner, D. M., Singh, K. B., & Taylor, J. M. (2018). Improved prediction of fungal effector proteins from secretomes with EffectorP 2.0. Molecular Plant Pathology, 19(9), 2094–2110. https://doi.org/10.1111/mpp.12682
  • Sperschneider, J., Dodds, P. N., Singh, K. B., & Taylor, J. M. (2018). ApoplastP: Prediction of effectors and plant proteins in the apoplast using machine learning. New Phytologist, 217(4), 1764–1778. https://doi.org/10.1111/nph.14946
  • Sperschneider, J., Gardiner, D. M., Dodds, P. N., Tini, F., Covarelli, L., Singh, K. B., Manners, J. M., & Taylor, J. M. (2016). EffectorP: Predicting fungal effector proteins from secretomes using machine learning. New Phytologist, 210(2), 743–761. https://doi.org/10.1111/nph.13794
  • Steinegger, M., & Söding, J. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35(11), 1026–1028. https://doi.org/10.1038/nbt.3988
  • Tange, O. (2020). GNU Parallel 20200522 ('Kraftwerk'). Zenodo. https://doi.org/10.5281/zenodo.3841377
  • Urban, M., Cuzick, A., Seager, J., Wood, V., Rutherford, K., Venkatesh, S. Y., De Silva, N., Martinez, M. C., Pedro, H., Yates, A. D., Hassani-Pak, K., & Hammond-Kosack, K. E. (2020). PHI-base: The pathogen–host interactions database. Nucleic Acids Research, 48(D1), D613–D620. https://doi.org/10.1093/nar/gkz904
  • Zhang, H., Yohe, T., Huang, L., Entwistle, S., Wu, P., Yang, Z., Busk, P. K., Xu, Y., & Yin, Y. (2018). dbCAN2: A meta server for automated carbohydrate-active enzyme annotation. Nucleic Acids Research, 46(W1), W95–W101. https://doi.org/10.1093/nar/gky418