dpgen RuntimeError #506

Cc-12342234 · 2024-11-16T12:02:21Z

When I ran dpgen and reached the fp step, this error occurred. I believe the cause of the error is that unconverged tasks were not categorized as failed tasks, which led to this issue. Part of the log file is as follows:
dpdispatcher.log：
024-11-16 19:41:00,268 - INFO : job: 8478e1cf10abefb934c3ebc4e2ca1bb5a9265d5b 17340881 terminated; fail_cout is 4; resubmitting job
2024-11-16 19:41:00,295 - INFO : job:8478e1cf10abefb934c3ebc4e2ca1bb5a9265d5b re-submit after terminated; new job_id is 17341121
2024-11-16 19:41:00,509 - INFO : job:8478e1cf10abefb934c3ebc4e2ca1bb5a9265d5b job_id:17341121 after re-submitting; the state now is <JobStatus.waiting: 2>
2024-11-16 19:41:00,509 - INFO : job: 4f184443ae19c2074dab082c47f4935e9ebc6f7a 17340916 terminated; fail_cout is 3; resubmitting job
Traceback (most recent call last):
File "/public/home/1234/soft/anaconda3/envs/dpdata/lib/python3.10/site-packages/dpdispatcher/submission.py", line 358, in handle_unexpected_submission_state
job.handle_unexpected_job_state()
File "/public/home/1234/soft/anaconda3/envs/dpdata/lib/python3.10/site-packages/dpdispatcher/submission.py", line 862, in handle_unexpected_job_state
raise RuntimeError(err_msg)
RuntimeError: job:4f184443ae19c2074dab082c47f4935e9ebc6f7a 17340916 failed 3 times.
Possible remote error message: �[31m==> /public/home/1234/DPGEN-cp2k/remotefile/fp/d1034656268ea53c6a7e208951eec1341db12e59/task.000.000518/output <==
:605 *

===== Routine Calling Stack =====

        4 scf_env_do_scf
        3 qs_energies
        2 qs_forces
        1 CP2K

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
Proc: [[42699,1],0]
Errorcode: 1

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

prterun has exited due to process rank 0 with PID 0 on node node0402 calling
"abort". This may have caused other processes in the application to be
terminated by signals sent by prterun (as reported here).

CP2K output:

___ *
/ \ *
[ABORT] *
___/ SCF run NOT converged. To continue the calculation regardless, *
| please set the keyword IGNORE_CONVERGENCE_FAILURE. *
O/| *
/| | *
/ \ qs_scf.F:605 *

===== Routine Calling Stack =====

        4 scf_env_do_scf
        3 qs_energies
        2 qs_forces
        1 CP2K

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
Proc: [[42699,1],0]
Errorcode: 1

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

prterun has exited due to process rank 0 with PID 0 on node node0402 calling
"abort". This may have caused other processes in the application to be
terminated by signals sent by prterun (as reported here).

The text was updated successfully, but these errors were encountered:

njzjz · 2024-11-17T03:20:51Z

It's the designed behavior to throw errors if the command gives a non-zero exit code. You can write the shell command to return a zero exit code forcibly; see: https://stackoverflow.com/a/57189853/9567349

Cc-12342234 · 2024-11-17T04:24:35Z

Thank you for your clarification. So, can I understand this issue as being caused by grouping multiple cases together when submitting the fp tasks, and the failure of SCF convergence in one of the CP2K tasks led to this problem? If I change it to submit each task individually using sbatch, can this effectively resolve the issue?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dpgen RuntimeError #506

dpgen RuntimeError #506

Cc-12342234 commented Nov 16, 2024

njzjz commented Nov 17, 2024

Cc-12342234 commented Nov 17, 2024

dpgen RuntimeError #506

dpgen RuntimeError #506

Comments

Cc-12342234 commented Nov 16, 2024

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.

prterun has exited due to process rank 0 with PID 0 on node node0402 calling "abort". This may have caused other processes in the application to be terminated by signals sent by prterun (as reported here).

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.

prterun has exited due to process rank 0 with PID 0 on node node0402 calling "abort". This may have caused other processes in the application to be terminated by signals sent by prterun (as reported here).

njzjz commented Nov 17, 2024

Cc-12342234 commented Nov 17, 2024

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

prterun has exited due to process rank 0 with PID 0 on node node0402 calling
"abort". This may have caused other processes in the application to be
terminated by signals sent by prterun (as reported here).

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

prterun has exited due to process rank 0 with PID 0 on node node0402 calling
"abort". This may have caused other processes in the application to be
terminated by signals sent by prterun (as reported here).