Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dpgen RuntimeError #506

Open
Cc-12342234 opened this issue Nov 16, 2024 · 2 comments
Open

dpgen RuntimeError #506

Cc-12342234 opened this issue Nov 16, 2024 · 2 comments

Comments

@Cc-12342234
Copy link

When I ran dpgen and reached the fp step, this error occurred. I believe the cause of the error is that unconverged tasks were not categorized as failed tasks, which led to this issue. Part of the log file is as follows:
dpdispatcher.log
024-11-16 19:41:00,268 - INFO : job: 8478e1cf10abefb934c3ebc4e2ca1bb5a9265d5b 17340881 terminated; fail_cout is 4; resubmitting job
2024-11-16 19:41:00,295 - INFO : job:8478e1cf10abefb934c3ebc4e2ca1bb5a9265d5b re-submit after terminated; new job_id is 17341121
2024-11-16 19:41:00,509 - INFO : job:8478e1cf10abefb934c3ebc4e2ca1bb5a9265d5b job_id:17341121 after re-submitting; the state now is <JobStatus.waiting: 2>
2024-11-16 19:41:00,509 - INFO : job: 4f184443ae19c2074dab082c47f4935e9ebc6f7a 17340916 terminated; fail_cout is 3; resubmitting job
Traceback (most recent call last):
File "/public/home/1234/soft/anaconda3/envs/dpdata/lib/python3.10/site-packages/dpdispatcher/submission.py", line 358, in handle_unexpected_submission_state
job.handle_unexpected_job_state()
File "/public/home/1234/soft/anaconda3/envs/dpdata/lib/python3.10/site-packages/dpdispatcher/submission.py", line 862, in handle_unexpected_job_state
raise RuntimeError(err_msg)
RuntimeError: job:4f184443ae19c2074dab082c47f4935e9ebc6f7a 17340916 failed 3 times.
Possible remote error message: �[31m==> /public/home/1234/DPGEN-cp2k/remotefile/fp/d1034656268ea53c6a7e208951eec1341db12e59/task.000.000518/output <==
:605 *


===== Routine Calling Stack =====

        4 scf_env_do_scf
        3 qs_energies
        2 qs_forces
        1 CP2K

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
Proc: [[42699,1],0]
Errorcode: 1

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.


prterun has exited due to process rank 0 with PID 0 on node node0402 calling
"abort". This may have caused other processes in the application to be
terminated by signals sent by prterun (as reported here).

CP2K output:


  • ___ *
  • / \ *
  • [ABORT] *
  • ___/ SCF run NOT converged. To continue the calculation regardless, *
  • | please set the keyword IGNORE_CONVERGENCE_FAILURE. *
  • O/| *
  • /| | *
  • / \ qs_scf.F:605 *

===== Routine Calling Stack =====

        4 scf_env_do_scf
        3 qs_energies
        2 qs_forces
        1 CP2K

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
Proc: [[42699,1],0]
Errorcode: 1

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.


prterun has exited due to process rank 0 with PID 0 on node node0402 calling
"abort". This may have caused other processes in the application to be
terminated by signals sent by prterun (as reported here).

@njzjz
Copy link
Member

njzjz commented Nov 17, 2024

It's the designed behavior to throw errors if the command gives a non-zero exit code. You can write the shell command to return a zero exit code forcibly; see: https://stackoverflow.com/a/57189853/9567349

@Cc-12342234
Copy link
Author

Thank you for your clarification. So, can I understand this issue as being caused by grouping multiple cases together when submitting the fp tasks, and the failure of SCF convergence in one of the CP2K tasks led to this problem? If I change it to submit each task individually using sbatch, can this effectively resolve the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants