You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I ran dpgen and reached the fp step, this error occurred. I believe the cause of the error is that unconverged tasks were not categorized as failed tasks, which led to this issue. Part of the log file is as follows: dpdispatcher.log:
024-11-16 19:41:00,268 - INFO : job: 8478e1cf10abefb934c3ebc4e2ca1bb5a9265d5b 17340881 terminated; fail_cout is 4; resubmitting job
2024-11-16 19:41:00,295 - INFO : job:8478e1cf10abefb934c3ebc4e2ca1bb5a9265d5b re-submit after terminated; new job_id is 17341121
2024-11-16 19:41:00,509 - INFO : job:8478e1cf10abefb934c3ebc4e2ca1bb5a9265d5b job_id:17341121 after re-submitting; the state now is <JobStatus.waiting: 2>
2024-11-16 19:41:00,509 - INFO : job: 4f184443ae19c2074dab082c47f4935e9ebc6f7a 17340916 terminated; fail_cout is 3; resubmitting job
Traceback (most recent call last):
File "/public/home/1234/soft/anaconda3/envs/dpdata/lib/python3.10/site-packages/dpdispatcher/submission.py", line 358, in handle_unexpected_submission_state
job.handle_unexpected_job_state()
File "/public/home/1234/soft/anaconda3/envs/dpdata/lib/python3.10/site-packages/dpdispatcher/submission.py", line 862, in handle_unexpected_job_state
raise RuntimeError(err_msg)
RuntimeError: job:4f184443ae19c2074dab082c47f4935e9ebc6f7a 17340916 failed 3 times.
Possible remote error message: �[31m==> /public/home/1234/DPGEN-cp2k/remotefile/fp/d1034656268ea53c6a7e208951eec1341db12e59/task.000.000518/output <==
:605 *
===== Routine Calling Stack =====
4 scf_env_do_scf
3 qs_energies
2 qs_forces
1 CP2K
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
Proc: [[42699,1],0]
Errorcode: 1
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
prterun has exited due to process rank 0 with PID 0 on node node0402 calling
"abort". This may have caused other processes in the application to be
terminated by signals sent by prterun (as reported here).
CP2K output:
___ *
/ \ *
[ABORT] *
___/ SCF run NOT converged. To continue the calculation regardless, *
| please set the keyword IGNORE_CONVERGENCE_FAILURE. *
O/| *
/| | *
/ \ qs_scf.F:605 *
===== Routine Calling Stack =====
4 scf_env_do_scf
3 qs_energies
2 qs_forces
1 CP2K
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
Proc: [[42699,1],0]
Errorcode: 1
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
prterun has exited due to process rank 0 with PID 0 on node node0402 calling
"abort". This may have caused other processes in the application to be
terminated by signals sent by prterun (as reported here).
The text was updated successfully, but these errors were encountered:
It's the designed behavior to throw errors if the command gives a non-zero exit code. You can write the shell command to return a zero exit code forcibly; see: https://stackoverflow.com/a/57189853/9567349
Thank you for your clarification. So, can I understand this issue as being caused by grouping multiple cases together when submitting the fp tasks, and the failure of SCF convergence in one of the CP2K tasks led to this problem? If I change it to submit each task individually using sbatch, can this effectively resolve the issue?
When I ran dpgen and reached the fp step, this error occurred. I believe the cause of the error is that unconverged tasks were not categorized as failed tasks, which led to this issue. Part of the log file is as follows:
dpdispatcher.log:
024-11-16 19:41:00,268 - INFO : job: 8478e1cf10abefb934c3ebc4e2ca1bb5a9265d5b 17340881 terminated; fail_cout is 4; resubmitting job
2024-11-16 19:41:00,295 - INFO : job:8478e1cf10abefb934c3ebc4e2ca1bb5a9265d5b re-submit after terminated; new job_id is 17341121
2024-11-16 19:41:00,509 - INFO : job:8478e1cf10abefb934c3ebc4e2ca1bb5a9265d5b job_id:17341121 after re-submitting; the state now is <JobStatus.waiting: 2>
2024-11-16 19:41:00,509 - INFO : job: 4f184443ae19c2074dab082c47f4935e9ebc6f7a 17340916 terminated; fail_cout is 3; resubmitting job
Traceback (most recent call last):
File "/public/home/1234/soft/anaconda3/envs/dpdata/lib/python3.10/site-packages/dpdispatcher/submission.py", line 358, in handle_unexpected_submission_state
job.handle_unexpected_job_state()
File "/public/home/1234/soft/anaconda3/envs/dpdata/lib/python3.10/site-packages/dpdispatcher/submission.py", line 862, in handle_unexpected_job_state
raise RuntimeError(err_msg)
RuntimeError: job:4f184443ae19c2074dab082c47f4935e9ebc6f7a 17340916 failed 3 times.
Possible remote error message: �[31m==> /public/home/1234/DPGEN-cp2k/remotefile/fp/d1034656268ea53c6a7e208951eec1341db12e59/task.000.000518/output <==
:605 *
===== Routine Calling Stack =====
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
Proc: [[42699,1],0]
Errorcode: 1
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
prterun has exited due to process rank 0 with PID 0 on node node0402 calling
"abort". This may have caused other processes in the application to be
terminated by signals sent by prterun (as reported here).
CP2K output:
===== Routine Calling Stack =====
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
Proc: [[42699,1],0]
Errorcode: 1
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
prterun has exited due to process rank 0 with PID 0 on node node0402 calling
"abort". This may have caused other processes in the application to be
terminated by signals sent by prterun (as reported here).
The text was updated successfully, but these errors were encountered: