Skip to content

Commit

Permalink
Added some more documentation. Install scripts now support custom Doc…
Browse files Browse the repository at this point in the history
…kerfiles and Singularity def-files for development.
  • Loading branch information
darcyabjones committed Jan 13, 2022
1 parent ddf7ead commit 187ab3e
Show file tree
Hide file tree
Showing 6 changed files with 138 additions and 20 deletions.
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ COPY "${SIGNALP3}" \
"${DEEPLOC}" \
"${PHOBIUS}" \
"${TMHMM}" \
/tmp/onbuild
/tmp/onbuild/

# CONDA_PREFIX should be set by the base container.
RUN echo \
Expand Down
53 changes: 53 additions & 0 deletions docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,59 @@ conda list > conda_environment.txt
```


### Error while running signalp 6 `ValueError: zero-size array to reduction operation maximum which has no identity`

This is a known issue with some sequences and certain versions of SignalP 6.
Unfortunately we can't do much about this other than report the troublesome sequence(s) to the developers.

If you contact us or raise an issue we can do that for you.
Please include the sequences that are causing the issue and the exact version of SignalP 6 that you downloaded so that we can be most helpful.
Otherwise if you use GitHub you can raise an issue yourself in [their repository](https://github.com/fteufel/signalp-6.0) (note though that the code that's up there isn't actually what it distributed).
They also list contact emails in their [installation instructions](https://github.com/fteufel/signalp-6.0/blob/main/installation_instructions.md#bugs-and-questions).


As a temporary fix you can either re-run the pipeline using the `--no_signalp6` parameter, which will not run SignalP 6 on any sequences.
Alternatively, you can manually mark this chunk (internally we chunk the input into sets of `--chunk_size` unique sequences (default 5000)) as completed. This will only skip signalp6 for an individual chunk.

1) Find the working directory of the task from the error message.
It will look like this:
```
Work dir:
/home/ubuntu/predector_analysis/work/7e/954be70138c4c29467945fade280ab
```

2) Set the exit code to 0 and create an empty output file:

```
DIR_CONTAINING_ERROR=/home/ubuntu/predector_analysis/work/7e/954be70138c4c29467945fade280ab
echo "0" > "${DIR_CONTAINING_ERROR}/.exitcode"
touch "${DIR_CONTAINING_ERROR}/out.ldjson"
```

3) Re-run the pipeline as you did before with the `-resume` option.

This should restart the pipeline and continue as if SignalP 6 hadn't failed (though it may still fail on a different chunk).
Note however that if you skip the analysis for one chunk, the manual ranking scores (and probably the learned ranking scores in the near future) won't be reliable (because the other chunks will have more information).


## Error while running a process with `Command exit status: 137`

The error-code usually means that you have run out of memory in a task.
At the time of writing this seems to happen when running SignalP 6 on relatively small computers (e.g. with <6GB RAM available).

General strategies for reducing memory usage are to reduce the `--chunk_size` to below 1000 (say 500).
Specifically for SignalP 6 you can also try reducing the `--signalp6_bsize` to 10.
You can read more about these parameters in the [Command line parameters section](#command-line-parameters).

If you encounter this issue in the final steps when producing the output and ranking tables, it may be the case that one of your input fasta files is very large.
As noted in the [running the pipeline section](#running-the-pipeline), Predector was designed to handle typical proteomes.
The number of proteomes doesn't really matter because internally we deduplicate and divide the inputs into chunks, but if one single input fasta has say >10e5 proteins, this can cause an issue if you don't have lots of RAM (I find that about 30GB is needed for a few million proteins).
If your proteins aren't split into proteomes (e.g you're running on a set downloaded from UniProt), it's best to split them yourself to batches of about 20000, and then concatenate the final tables yourself. We can guide you through dealing with this to make use of what you have already computed, so please get in touch.

If you encounter this issue with other processes please let us know.
We've done our best to keep peak memory use low for most tasks, but there may be cases that we hadn't considered.


## FAQ

Expand Down
9 changes: 9 additions & 0 deletions docs/running.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,13 @@ See below for some ways you can typically provide files to the `--proteome` para
You can find more info on the Globbing operations that are supported by Nextflow in the [Java documentation](https://docs.oracle.com/javase/tutorial/essential/io/fileOps.html#glob).


Predector is designed to run with typical proteomes, e.g. with an average of ~15000 proteins.
Internally we de-duplicate sequences and split the fasta files into smaller chunks to reduce redundant computation, enhance parallelism, and control peak memory usage.
You do not need to concatenate your proteomes together, instead you should keep them separate and use the globbing patterns above.
Inputting a single very large fasta file will potentially cause the pipeline to fail in the final steps producing the final ranking and analysis tables, as the "re-duplicated" results can be extremely large.
If you are running a task that doesn't naturally separate (e.g. a multi-species dataset downloaded from a UniProtKB query), it's best to chunk the fasta into sets of roughly 20000 (e.g. using [seqkit](https://bioinf.shenwei.me/seqkit/usage/#split)) and use the globbing pattern on those split fastas.


### Accessing and copying the results

By default the results of the pipeline are stored in the `results` folder. You can change this directory using the `--outdir` parameter to the pipeline.
Expand Down Expand Up @@ -224,6 +231,8 @@ Those starting with two hyphens `--` are Predector defined parameters.
In the pipeline ranking output tables we also provide a manual (i.e. not machine learning) ranking score for both effectors `manual_effector_score` and secretion `manual_secretion_score`.
This was provided so that you could customise the ranking if the ML ranker isn't what you want.

> NOTE: If you decide not to run specific analyses (e.g. signalp6 or Pfam), this may affect comparability between different runs of the pipeline.
These scores are computed by a relatively simple linear function weighting features in the ranking table.
You can customise the weights applied to the features from the command line.

Expand Down
68 changes: 62 additions & 6 deletions install.sh
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,12 @@ CONDA_COMMAND="conda"
CONDA_ENV_DIR=
CONDA_TEMPLATE=

# Only valid for Singularity
SINGULARITY_DEFFILE=

# Only docker
DOCKERFILE=

# This sets -x
DEBUG=false

Expand Down Expand Up @@ -117,6 +123,12 @@ Optional parameters:
--conda-template -- Use this conda environment.yml file instead of downloading it from github.
Only affects conda installs.
--singularity-deffile -- Use this singularity .def file instead of downloading it from github.
Only affects singularity installs.
--dockerfile -- Use this dockerfile instead of downloading it from github.
Only affects docker installs.
Flags:
--debug -- Increased verbosity for developer use.
-h|--help -- Show this message and exit.
Expand Down Expand Up @@ -228,6 +240,18 @@ case $key in
shift
shift
;;
--singularity-deffile)
check_nodefault_param "--singularity-deffile" "${SINGULARITY_DEFFILE}" "${2:-}"
SINGULARITY_DEFFILE="$2"
shift
shift
;;
--dockerfile)
check_nodefault_param "--dockerfile" "${DOCKERFILE}" "${2:-}"
DOCKERFILE="$2"
shift
shift
;;
--debug)
DEBUG=true
shift # past argument
Expand Down Expand Up @@ -304,6 +328,16 @@ then
[ ! -f "${CONDA_TEMPLATE:-}" ] && echo "The specified alternate conda template '${CONDA_TEMPLATE}' does not exist." 1>&2 && FAILED=true
fi

if [ ! -z "${SINGULARITY_DEFFILE:-}" ]
then
[ ! -f "${SINGULARITY_DEFFILE:-}" ] && echo "The specified alternate singularity .def file '${SINGULARITY_DEFFILE}' does not exist." 1>&2 && FAILED=true
fi

if [ ! -z "${DOCKERFILE:-}" ]
then
[ ! -f "${DOCKERFILE:-}" ] && echo "The specified alternate singularity .def file '${DOCKERFILE}' does not exist." 1>&2 && FAILED=true
fi

if [ "${FAILED}" = "true" ]
then
echo 1>&2
Expand All @@ -320,7 +354,7 @@ warn_signalp6_not_installed() {
echo
echo "WARNING: SignalP 6 was not installed because you didn't provide the tar-ball."
echo "WARNING: Predector will automatically skip running SignalP 6."
echo "WARNING: You may also like to specify the `--no_signalp6` flag when running the pipeline."
echo "WARNING: You may also like to specify the '--no_signalp6' flag when running the pipeline."
}


Expand Down Expand Up @@ -610,8 +644,16 @@ setup_docker() {
SP6_FLAG=""
fi

curl -s "${URL}" \
| ${SUDO} docker build \
TMPFILE=".predector$$.Dockerfile"
if [ -z "${DOCKERFILE:-}" ]
then
curl -o "${TMPFILE}" -s "${URL}"
DOCKERFILE_FILE="${TMPFILE}"
else
DOCKERFILE_FILE="${DOCKERFILE}"
fi

${SUDO} docker build \
--build-arg SIGNALP3="${SIGNALP3}" \
--build-arg SIGNALP4="${SIGNALP4}" \
--build-arg SIGNALP5="${SIGNALP5}" \
Expand All @@ -621,10 +663,15 @@ setup_docker() {
--build-arg TMHMM="${TMHMM}" \
--build-arg DEEPLOC="${DEEPLOC}" \
--tag "${NAME}" \
--file - \
--file "${DOCKERFILE_FILE}" \
. \
|| RETCODE="$?"

if [ -z "${DOCKERFILE:-}" ]
then
rm -f "${TMPFILE}"
fi

if [ "${RETCODE:-0}" -ne 0 ]
then
docker_build_error
Expand Down Expand Up @@ -728,13 +775,22 @@ setup_singularity() {

# Download the .def file
export TMPFILE=".predector$$.def"
curl -s -o "${TMPFILE}" "${URL}"

if [ -z "${SINGULARITY_DEFFILE:-}" ]
then
curl -s -o "${TMPFILE}" "${URL}"
else
# This is necessary because singularity doesn't look
# in the local docker registry by default
sed '/^bootstrap:/s/docker[[:space:]]*$/docker-daemon/' "${SINGULARITY_DEFFILE}" > "${TMPFILE}"
fi
export SINGULARITY_DEFFILE_FILE="${TMPFILE}"

# Build the .sif singularity image.
# Note that `sudo -E` is important, it tells sudo to keep the environment variables
# that we just set.
sudo -E bash -eu -c '
singularity build "${NAME}" "${TMPFILE}" || RETCODE="$?"
singularity build "${NAME}" "${SINGULARITY_DEFFILE_FILE}" || RETCODE="$?"
rm -rf -- "${SINGULARITY_CACHEDIR}"
exit "${RETCODE:-0}"
' || RETCODE="$?"
Expand Down
13 changes: 13 additions & 0 deletions test/known_troubling_seqs.fasta
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
>P000004B9 I know that this will crash some versions of SignalP 6
MAFRLFAGITGRQLLAGGAALGGTGLAGSLIQTESERLQATEAQVQFHTSSIHPTPVGFS
PWQIRNDYPTSDILKARLKAQKDDSLPNAPSPLIPAPGLPGDFEGENAPWFKYDYEKEPE
KFAEAIREYCFDGNVDKGFRLNENKIRDWYHAPWMHYRDPNSMCTEREPINGFTFERATP
AGEFAKTQNVTLQNWAIGFYNATGATVFGDMWKDPDNPDFSQNKEFPVGTCVFKILLNNS
TPEQMPIQDGAPTMHAVISKSTSNGKERNDFASPLRLIQVDFAVVDKRSPIGWVFGTFMY
NKDQPGKGPWDRLTLVGLQWGNDHWLTNQVYDETKAEGRVAKPRECYIHKKAEDIRKREG
GTRPSWGWNGRMNGPADNFISACASCHSTSTSHPMYNGKVKDGVKQTYGMVPPLNMKPLP
PQPKEGNTFSDVMIYFRNVMGGVPFDEGVNPNNPDEYDPTYKSKVKSADYSLQLQVGWAN
YKKWKEDHETVLQSIFRKTRYVIGSELAGASDLSQRDQGRQEPTDDGPVE
>P00000D45 I know that this will crase some versions of SignalP6
MYSRLFYLKSSYIIYFEPLFSNAIINILSFINSLASPLTIFCFALSAQALSTIFYFRIFI
FIFHSWILLFHFYFTCSFKTYEHQHSKMVPAYRMQSPRALPRTYLYVWPYK
13 changes: 0 additions & 13 deletions test/test_set.fasta
Original file line number Diff line number Diff line change
Expand Up @@ -122,16 +122,3 @@ MTGNRSLVQRHRKKFVVSSVLFATLFATCAITVYFSKRWLYKQHLKMTEQRFVKEQIKRRFVQTQQDSLYTLYELMPVMT
MVAFSSLICALTSIASTLMPTGLEPESSVNVTERGMYDFVLGAHNDHRRRASINYDQNYQTGGQVSYSPSNTGFSVNWNTQDDFVVGVGWTTGSSAPINFGGSFSVNSGTGLLSVYGWSTNPLVEYYIMEDNHNYPAQGTVKGTVTSDGATYTIWENTRVNEPSIQGTATFNQYISVRNSPRTSGTVTVQNHFNAWASLGLHLGQMNYQVVAVEGWGGSGSASQSVSNA
>YHG6_SCHPO
MNIYSVGLFYFFLVFIGAQAMDLDITDYQSIDNTVNIMMKDLMNYWNASSQAFVASYWWVTGATMGALLYNYELFNNDTYVDLISSSLLYNAGSGFDYQPSFEYFNLGNDDQGMWAAAAMDAAEANFSPPNSTEHSWLELTQAAFNRMSGRWDSSTCGGGLRWQAFAWLNGYSYKASVSNALLFQLSSRLARFTNESVYSDWANKIWDWTTDVGFVNTTTYAVYDGADTSTNCTTLDPSQWSYNIGIFMVGAAYMYNYTGETVWRERLDGLISHATSYFFTDDIAWDPQCEYFDDCNSDQTAFKGIFMQSFGNTIRLAPYTYDTLYPLIQTSAAAAAKQCCGGYSGTSCGIYWFWNNGTWDDNYGVQEQFSALQAVQMLMIEYAPEIATLASSTDNRSNSTYASNVVINDTNTTTTIVVKEKDRGGAGFLTFLSAIFILGASIWALVEDEEGKIPSRGKKGIAISS
>P000004B9 I know that this will crash some versions of SignalP 6
MAFRLFAGITGRQLLAGGAALGGTGLAGSLIQTESERLQATEAQVQFHTSSIHPTPVGFS
PWQIRNDYPTSDILKARLKAQKDDSLPNAPSPLIPAPGLPGDFEGENAPWFKYDYEKEPE
KFAEAIREYCFDGNVDKGFRLNENKIRDWYHAPWMHYRDPNSMCTEREPINGFTFERATP
AGEFAKTQNVTLQNWAIGFYNATGATVFGDMWKDPDNPDFSQNKEFPVGTCVFKILLNNS
TPEQMPIQDGAPTMHAVISKSTSNGKERNDFASPLRLIQVDFAVVDKRSPIGWVFGTFMY
NKDQPGKGPWDRLTLVGLQWGNDHWLTNQVYDETKAEGRVAKPRECYIHKKAEDIRKREG
GTRPSWGWNGRMNGPADNFISACASCHSTSTSHPMYNGKVKDGVKQTYGMVPPLNMKPLP
PQPKEGNTFSDVMIYFRNVMGGVPFDEGVNPNNPDEYDPTYKSKVKSADYSLQLQVGWAN
YKKWKEDHETVLQSIFRKTRYVIGSELAGASDLSQRDQGRQEPTDDGPVE
>P00000D45 I know that this will crase some versions of SignalP6
MYSRLFYLKSSYIIYFEPLFSNAIINILSFINSLASPLTIFCFALSAQALSTIFYFRIFI
FIFHSWILLFHFYFTCSFKTYEHQHSKMVPAYRMQSPRALPRTYLYVWPYK

0 comments on commit 187ab3e

Please sign in to comment.