You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We moved our HPC to new hardware in November/December, and the dust is slowly settling. Weirdly, my modified conda version of 1.2.7 wouldn't work anymore - signalp_v4 kept segfaulting.
So I gave your 1.2.8-alpha branch a try and I'm happy to report it resolved on the first try, without the need for a modified environment.yml. But..
Bug:
Running a full test with nextflow run -profile test -with-conda "$USW/miniconda3/envs/predector" -resume -r 1.2.8-alpha ccdmb/predector is reporting the same signalp_v4 segfaults I experienced on 1.2.7. Additionally, tmhmm is now also segfaulting.
The error replicates on both regular compute nodes and in userspace on front-end nodes. There was no difference between having the working directory on BeeGFS spinning disks or NFS SSDs.
Expected result:
A successful test run
OS:
Debian GNU / Linux 12 (bookworm)
conda
Nextflow 24.10.3
No WSL, no macOS
File systems: BeeGFS on $WORK, NFS on $USW, $HOME and $SSD
$USW and $HOME are read-only on regular compute nodes
Logs:
nextflow.log reporting on tmhmm failing out.txt is in fact an empty file, in.fasta looks healthy
Jan-07 12:04:10.091 [TaskFinalizer-1] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
task: name=tmhmm (2); work-dir=/$WORK/temp/predector/work/3a/31b0f9b87da082d33c8e812808ca39
error [nextflow.exception.ProcessFailedException]: Process `tmhmm (2)` terminated with an error exit status (65)
Jan-07 12:04:10.115 [TaskFinalizer-1] ERROR nextflow.processor.TaskProcessor - Error executing process > 'tmhmm (2)'
Caused by:
Process `tmhmm (2)` terminated with an error exit status (65)
Command executed:
CHUNKSIZE="$(decide_task_chunksize.sh in.fasta "4" 100)"
# tail -n+2 is to remove header
parallel --halt now,fail=1 --joblog log.txt -j "4" -N "${CHUNKSIZE}" --line-buffer --recstart '>' --pipe 'tmhmm -short -d' < in.fasta | cat > out.txt
predutils r2js --pipeline-version "1.2.8-alpha" --software-version "2.0c" -o out.ldjson tmhmm out.txt in.fasta
rm -rf -- TMHMM_*
Command exit status:
65
Command output:
(empty)
Command error:
decodeanhmm 1.1g
Copyright (C) 1998 by Anders Krogh
decodeanhmm 1.1g
Copyright (C) 1998 by Anders Krogh
decodeanhmm 1.1g
Copyright (C) 1998 by Anders Krogh
Segmentation fault
decodeanhmm 1.1g
Copyright (C) 1998 by Anders Krogh
Segmentation fault
Name "main::lab" used only once: possible typo at /$USW/miniconda3/envs/predector/share/tmhmm-2.0c-3/bin/tmhmmformat.pl line 130.
Name "main::score" used only once: possible typo at /$USW/miniconda3/envs/predector/share/tmhmm-2.0c-3/bin/tmhmmformat.pl line 114.
Name "main::normscore" used only once: possible typo at /$USW/miniconda3/envs/predector/share/tmhmm-2.0c-3/bin/tmhmmformat.pl line 115.
Segmentation fault
Name "main::score" used only once: possible typo at /$USW/miniconda3/envs/predector/share/tmhmm-2.0c-3/bin/tmhmmformat.pl line 114.
Name "main::normscore" used only once: possible typo at /$USW/miniconda3/envs/predector/share/tmhmm-2.0c-3/bin/tmhmmformat.pl line 115.
Name "main::lab" used only once: possible typo at /$USW/miniconda3/envs/predector/share/tmhmm-2.0c-3/bin/tmhmmformat.pl line 130.
Name "main::score" used only once: possible typo at /$USW/miniconda3/envs/predector/share/tmhmm-2.0c-3/bin/tmhmmformat.pl line 114.
Name "main::normscore" used only once: possible typo at /$USW/miniconda3/envs/predector/share/tmhmm-2.0c-3/bin/tmhmmformat.pl line 115.
Name "main::lab" used only once: possible typo at /$USW/miniconda3/envs/predector/share/tmhmm-2.0c-3/bin/tmhmmformat.pl line 130.
Name "main::lab" used only once: possible typo at /$USW/miniconda3/envs/predector/share/tmhmm-2.0c-3/bin/tmhmmformat.pl line 130.
Name "main::normscore" used only once: possible typo at /$USW/miniconda3/envs/predector/share/tmhmm-2.0c-3/bin/tmhmmformat.pl line 115.
Name "main::score" used only once: possible typo at /$USW/miniconda3/envs/predector/share/tmhmm-2.0c-3/bin/tmhmmformat.pl line 114.
Failed to parse file <out.txt>.
We could not parse any records from the input file.
It's possible that the input is empty, or that it is in the wrong format.
This can happen if an analysis fails but doesn't tell us that it failed.
Please check the input file indicated above and contact us for help if you need it.`
nextflow.log on signalp_v4 and tmhmm ultimatively failing
Jan-07 12:04:10.190 [TaskFinalizer-2] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
task: name=signalp_v4 (2); work-dir=/$WORK/temp/predector/work/3a/0423ae5e1e929bd321cbbd59fe6424
error [nextflow.exception.ProcessFailedException]: Process `signalp_v4 (2)` terminated with an error exit status (65)
Jan-07 12:04:10.191 [TaskFinalizer-3] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
task: name=tmhmm (1); work-dir=/$WORK/temp/predector/work/f9/0635e5eedd5eb9509d3800d10b43c1
error [nextflow.exception.ProcessFailedException]: Process `tmhmm (1)` terminated with an error exit status (65)
Jan-07 12:04:10.216 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 42; name: signalp_v4 (1); status: COMPLETED; exit: 65; error: -; workDir: /$WORK/temp/predector/work/48/a0bf2ceb1145f89b91a9bf6d91b30d]
Jan-07 12:04:10.218 [TaskFinalizer-4] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
task: name=signalp_v4 (1); work-dir=/$WORK/temp/predector/work/48/a0bf2ceb1145f89b91a9bf6d91b30d
error [nextflow.exception.ProcessFailedException]: Process `signalp_v4 (1)` terminated with an error exit status (65)
signalp_v4 (2) logs
From $WORK/temp/predector/work/3a/0423ae5e1e929bd321cbbd59fe6424
Logs for signalp_v4 (1) look very similar
command.log
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Failed to parse file </$WORK/temp/predector/work/3a/0423ae5e1e929bd321cbbd59fe6424/out.txt> at line 8.
In field 'name': Could not parse value '' as a non-empty string.
out.txt
# SignalP-4.1g euk predictions
# name Cmax pos Ymax pos Smax pos Smean D ? Dmaxcut Networks-used
# SignalP-4.1g euk predictions
# name Cmax pos Ymax pos Smax pos Smean D ? Dmaxcut Networks-used
# SignalP-4.1g euk predictions
# name Cmax pos Ymax pos Smax pos Smean D ? Dmaxcut Networks-used
# SignalP-4.1g euk predictions
# name Cmax pos Ymax pos Smax pos Smean D ? Dmaxcut Networks-used
0.000 1 0.000 1 0.000 1 0.000 0.000 N 0.450 SignalP-noTM
0.000 1 0.000 1 0.000 1 0.000 0.000 N 0.450 SignalP-noTM
0.000 1 0.000 1 0.000 1 0.000 0.000 N 0.450 SignalP-noTM
0.000 1 0.000 1 0.000 1 0.000 0.000 N 0.450 SignalP-noTM
Comments:
Interestingly enough, the registration processes for both signalp_v4 and tmhmm2 also segfault:
Registering source file /$USW/predector/dependencies/signalp-4.1g.Linux.tar.gz for signalp4 into conda environment at:
/$USW/miniconda3/envs/predector/share/signalp4-4.1g-3
Unregistering old source files if they exist.
patching file signalp
Finished registering signalp4.
Testing installation...
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Test succeeded.
signalp4 is now fully installed!!
Registering source file /$USW/predector/dependencies/tmhmm-2.0c.Linux.tar.gz for tmhmm into conda environment at:
/$USW/miniconda3/envs/predector/share/tmhmm-2.0c-3
Unregistering old source files if they exist.
patching file bin/tmhmm
Finished registering tmhmm.
Testing installation...
Segmentation fault
Test succeeded.
tmhmm is now full installed!
edit (Jan 08)
I tried my luck with Apptainer (ex-Singularity) today. The /dev/ branch 1.2.8-alpha won't create containers (it does not find the register scripts for proprietary software in post), but the /master/ 1.2.7 branch worked nicely. More than stoked to find the local Apptainer version passes all tests.
Copying the exact same environment to the HPC however? Not so great. I'd encountered the same error runing the conda version of 1.2.7 on the new HPC before.
nextflow.log
ERROR ~ Error executing process > 'signalp_v4 (1)'
Caused by:
Process `signalp_v4 (1)` terminated with an error exit status (65)
Command executed:
CHUNKSIZE="$(decide_task_chunksize.sh in.fasta "4" 100)"
parallel --halt now,fail=1 --joblog log.txt -j "4" -N "${CHUNKSIZE}" --line-buffer --recstart '>' --cat 'signalp4 -t "euk" -f short "{}"' < in.fasta | cat > out.txt
predutils r2js --pipeline-version "1.2.8-alpha" --software-version "4.1g" -o out.ldjson signalp4 out.txt in.fasta
Command exit status:
65
Command output:
(empty)
Command error:
INFO: Environment variable SINGULARITYENV_TMPDIR is set, but APPTAINERENV_TMPDIR is preferred
INFO: Environment variable SINGULARITYENV_NXF_TASK_WORKDIR is set, but APPTAINERENV_NXF_TASK_WORKDIR is preferred
INFO: Environment variable SINGULARITYENV_NXF_DEBUG is set, but APPTAINERENV_NXF_DEBUG is preferred
INFO: gocryptfs not found, will not be able to use gocryptfs
Segmentation fault (core dumped)
Segmentation fault (core dumped)
Segmentation fault (core dumped)
Segmentation fault (core dumped)
Segmentation fault (core dumped)
Segmentation fault (core dumped)
Segmentation fault (core dumped)
Segmentation fault
Segmentation fault (core dumped)
Segmentation fault (core dumped)
Segmentation fault (core dumped)
Segmentation fault (core dumped)
Segmentation fault (core dumped)
Segmentation fault (core dumped)
Segmentation fault (core dumped)
Segmentation fault (core dumped)
Failed to parse file <out.txt> at line 8.
The line had the wrong number of columns. Expected 12 but got 11
The only thing that I know changed is the way /tmp/ folders work. They are now managed per-session, using the $TMPDIR variable. But I don't quite see how that could cause such an error.
edit (Jan 09)
Just like the apptainer version 1.2.7, the conda version of 1.2.8-alpha passes all tests on a local installation. I'm going to escalate this issue to our HPC admins for now, assuming it's a local anomaly.
edit (Jan 14)
We're assuming it might be related to Omnipath causing problems with OpenMPI. I"ll keep trying stuff out and update this post when I find something!
edit (Jan 16)
Tried my luck setting OMPI_MCA_mtl=ofi as environment variable as recommended, to no avail. I'm not versed in Nextflow, so I might've done something wrong. I tried two approaches on a forked version of Predector (literally just changed the config files around):
Passing the variable directly in bash: OMPI_MCA_mtl=ofi nextflow run -profile test -with-conda "/usw/bbe0337/miniconda3/envs/predector" -resume -r 1.2.8-alpha markusHaferkamp/predector
Changing nextflow.config:
profiles{
test {
includeConfig "$baseDir/conf/test.config"
env.OMPI_MCA_mtl = 'ofi'
}
}
Back to square one it is!
The text was updated successfully, but these errors were encountered:
Hi again!
We moved our HPC to new hardware in November/December, and the dust is slowly settling. Weirdly, my modified conda version of
1.2.7
wouldn't work anymore -signalp_v4
kept segfaulting.So I gave your
1.2.8-alpha
branch a try and I'm happy to report it resolved on the first try, without the need for a modifiedenvironment.yml
. But..Bug:
Running a full test with
nextflow run -profile test -with-conda "$USW/miniconda3/envs/predector" -resume -r 1.2.8-alpha ccdmb/predector
is reporting the samesignalp_v4
segfaults I experienced on1.2.7
. Additionally,tmhmm
is now also segfaulting.The error replicates on both regular compute nodes and in userspace on front-end nodes. There was no difference between having the working directory on BeeGFS spinning disks or NFS SSDs.
Expected result:
A successful test run
OS:
Logs:
nextflow.log reporting on tmhmm failing
out.txt
is in fact an empty file,in.fasta
looks healthynextflow.log on signalp_v4 and tmhmm ultimatively failing
signalp_v4 (2) logs
From
$WORK/temp/predector/work/3a/0423ae5e1e929bd321cbbd59fe6424
Logs for
signalp_v4 (1)
look very similarComments:
Interestingly enough, the registration processes for both
signalp_v4
andtmhmm2
also segfault:edit (Jan 08)
I tried my luck with Apptainer (ex-Singularity) today. The /dev/ branch
1.2.8-alpha
won't create containers (it does not find the register scripts for proprietary software in post), but the /master/1.2.7
branch worked nicely. More than stoked to find the local Apptainer version passes all tests.Copying the exact same environment to the HPC however? Not so great. I'd encountered the same error runing the
conda
version of1.2.7
on the new HPC before.nextflow.log
The only thing that I know changed is the way /tmp/ folders work. They are now managed per-session, using the $TMPDIR variable. But I don't quite see how that could cause such an error.
edit (Jan 09)
Just like the
apptainer
version1.2.7
, theconda
version of1.2.8-alpha
passes all tests on a local installation. I'm going to escalate this issue to our HPC admins for now, assuming it's a local anomaly.edit (Jan 14)
We're assuming it might be related to Omnipath causing problems with OpenMPI. I"ll keep trying stuff out and update this post when I find something!
edit (Jan 16)
Tried my luck setting
OMPI_MCA_mtl=ofi
as environment variable as recommended, to no avail. I'm not versed in Nextflow, so I might've done something wrong. I tried two approaches on a forked version of Predector (literally just changed the config files around):Passing the variable directly in bash:
OMPI_MCA_mtl=ofi nextflow run -profile test -with-conda "/usw/bbe0337/miniconda3/envs/predector" -resume -r 1.2.8-alpha markusHaferkamp/predector
Changing nextflow.config:
Back to square one it is!
The text was updated successfully, but these errors were encountered: