Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slurmctld enters a spawn-kill loop #1

Open
ocramz opened this issue Mar 17, 2016 · 3 comments
Open

slurmctld enters a spawn-kill loop #1

ocramz opened this issue Mar 17, 2016 · 3 comments

Comments

@ocramz
Copy link

ocramz commented Mar 17, 2016

I observe a loop in my fork (NB: I run docker-compute in a Travis instance) ; when all services in steady state, slurmctld keeps restarting, and the following two lines appear continuosly in the second part of the log:

slurmctld_1 | 2016-03-17 14:32:00,617 INFO spawned: 'slurmctld' with pid 9203
slurmctld_1 | 2016-03-17 14:32:00,652 INFO exited: slurmctld (exit status 0; not expected)

Full log, up to the beginning of the loop:

Pulling consul (qnib/consul:latest)...
latest: Pulling from qnib/consul
Digest: sha256:53b8ea7af183312ba70917f4b0f68d5631fced9ae3559d6e29923de78c7bdd52
Status: Downloaded newer image for qnib/consul:latest
Creating dockercompute_consul_1...
Pulling slurmctld (qnib/slurmctld:latest)...
latest: Pulling from qnib/slurmctld
Digest: sha256:81f8c2f2b8f07c92a2c1adca2bc2e2e70ef713ce2bee86cba845761e0254245a
Status: Downloaded newer image for qnib/slurmctld:latest
Creating dockercompute_slurmctld_1...
Pulling compute (qnib/compute:latest)...
latest: Pulling from qnib/compute
Digest: sha256:ce03ba5acd061dfa0aaaeeb48b2b72e9f802ef09df3dfda93c5f7f149ddc609a
Status: Downloaded newer image for qnib/compute:latest
Creating dockercompute_compute_1...
Attaching to dockercompute_consul_1, dockercompute_slurmctld_1, dockercompute_compute_1
consul_1    | 2016-03-17 14:20:15,886 CRIT Supervisor running as root (no user in config file)
consul_1    | 2016-03-17 14:20:15,887 WARN Included extra file "/etc/supervisord.d/consul.ini" during parsing
consul_1    | 2016-03-17 14:20:15,905 INFO RPC interface 'supervisor' initialized
consul_1    | 2016-03-17 14:20:15,905 CRIT Server 'unix_http_server' running without any HTTP authentication checking
consul_1    | 2016-03-17 14:20:15,905 INFO supervisord started with pid 13
consul_1    | 2016-03-17 14:20:16,907 INFO spawned: 'consul' with pid 16
consul_1    | 2016-03-17 14:20:22,447 INFO success: consul entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
compute_1   | 2016-03-17 14:21:39,568 CRIT Supervisor running as root (no user in config file)
compute_1   | 2016-03-17 14:21:39,568 WARN Included extra file "/etc/supervisord.d/slurmd.ini" during parsing
compute_1   | 2016-03-17 14:21:39,568 WARN Included extra file "/etc/supervisord.d/slurm_update.ini" during parsing
compute_1   | 2016-03-17 14:21:39,568 WARN Included extra file "/etc/supervisord.d/munged.ini" during parsing
compute_1   | 2016-03-17 14:21:39,568 WARN Included extra file "/etc/supervisord.d/watchpsutil.ini" during parsing
compute_1   | 2016-03-17 14:21:39,569 WARN Included extra file "/etc/supervisord.d/diamond.ini" during parsing
compute_1   | 2016-03-17 14:21:39,569 WARN Included extra file "/etc/supervisord.d/sensu-api.ini" during parsing
compute_1   | 2016-03-17 14:21:39,569 WARN Included extra file "/etc/supervisord.d/sensu-client.ini" during parsing
compute_1   | 2016-03-17 14:21:39,569 WARN Included extra file "/etc/supervisord.d/sensu-server.ini" during parsing
compute_1   | 2016-03-17 14:21:39,569 WARN Included extra file "/etc/supervisord.d/rsyslog_conf.ini" during parsing
compute_1   | 2016-03-17 14:21:39,569 WARN Included extra file "/etc/supervisord.d/rsyslog.ini" during parsing
compute_1   | 2016-03-17 14:21:39,569 WARN Included extra file "/etc/supervisord.d/consul.ini" during parsing
compute_1   | 2016-03-17 14:21:39,593 INFO RPC interface 'supervisor' initialized
compute_1   | 2016-03-17 14:21:39,593 CRIT Server 'unix_http_server' running without any HTTP authentication checking
compute_1   | 2016-03-17 14:21:39,593 INFO supervisord started with pid 13
slurmctld_1 | 2016-03-17 14:21:12,438 CRIT Supervisor running as root (no user in config file)
slurmctld_1 | 2016-03-17 14:21:12,438 WARN Included extra file "/etc/supervisord.d/scratchsetup.ini" during parsing
slurmctld_1 | 2016-03-17 14:21:12,438 WARN Included extra file "/etc/supervisord.d/slurmstats.ini" during parsing
slurmctld_1 | 2016-03-17 14:21:12,438 WARN Included extra file "/etc/supervisord.d/slurmctld.ini" during parsing
slurmctld_1 | 2016-03-17 14:21:12,438 WARN Included extra file "/etc/supervisord.d/slurm_update.ini" during parsing
slurmctld_1 | 2016-03-17 14:21:12,438 WARN Included extra file "/etc/supervisord.d/munged.ini" during parsing
slurmctld_1 | 2016-03-17 14:21:12,438 WARN Included extra file "/etc/supervisord.d/watchpsutil.ini" during parsing
slurmctld_1 | 2016-03-17 14:21:12,438 WARN Included extra file "/etc/supervisord.d/diamond.ini" during parsing
slurmctld_1 | 2016-03-17 14:21:12,438 WARN Included extra file "/etc/supervisord.d/sensu-api.ini" during parsing
slurmctld_1 | 2016-03-17 14:21:12,438 WARN Included extra file "/etc/supervisord.d/sensu-client.ini" during parsing
slurmctld_1 | 2016-03-17 14:21:12,438 WARN Included extra file "/etc/supervisord.d/sensu-server.ini" during parsing
slurmctld_1 | 2016-03-17 14:21:12,439 WARN Included extra file "/etc/supervisord.d/rsyslog_conf.ini" during parsing
slurmctld_1 | 2016-03-17 14:21:12,439 WARN Included extra file "/etc/supervisord.d/rsyslog.ini" during parsing
slurmctld_1 | 2016-03-17 14:21:12,439 WARN Included extra file "/etc/supervisord.d/consul.ini" during parsing
slurmctld_1 | 2016-03-17 14:21:12,463 INFO RPC interface 'supervisor' initialized
slurmctld_1 | 2016-03-17 14:21:12,463 CRIT Server 'unix_http_server' running without any HTTP authentication checking
slurmctld_1 | 2016-03-17 14:21:12,464 INFO supervisord started with pid 13
slurmctld_1 | 2016-03-17 14:21:13,465 INFO spawned: 'diamond' with pid 16
slurmctld_1 | 2016-03-17 14:21:13,467 INFO spawned: 'slurmctld' with pid 17
slurmctld_1 | 2016-03-17 14:21:13,468 INFO spawned: 'slurmstats' with pid 18
slurmctld_1 | 2016-03-17 14:21:13,474 INFO spawned: 'consul' with pid 19
slurmctld_1 | 2016-03-17 14:21:13,478 INFO spawned: 'sratchsetup' with pid 21
slurmctld_1 | 2016-03-17 14:21:13,481 INFO spawned: 'rsyslog-conf' with pid 22
slurmctld_1 | 2016-03-17 14:21:13,487 INFO spawned: 'sensu-api' with pid 23
slurmctld_1 | 2016-03-17 14:21:13,488 INFO spawned: 'sensu-client' with pid 24
slurmctld_1 | 2016-03-17 14:21:13,509 INFO spawned: 'slurm_update' with pid 28
slurmctld_1 | 2016-03-17 14:21:13,515 INFO spawned: 'rsyslog' with pid 32
slurmctld_1 | 2016-03-17 14:21:13,524 INFO spawned: 'munged' with pid 34
slurmctld_1 | 2016-03-17 14:21:13,532 INFO spawned: 'watchpsutil' with pid 40
slurmctld_1 | 2016-03-17 14:21:13,534 INFO spawned: 'sensu-server' with pid 44
slurmctld_1 | 2016-03-17 14:21:13,543 INFO success: diamond entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
slurmctld_1 | 2016-03-17 14:21:13,554 INFO exited: sratchsetup (exit status 1; not expected)
slurmctld_1 | 2016-03-17 14:21:13,610 INFO exited: sensu-server (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:21:13,615 INFO exited: slurmctld (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:21:14,519 INFO success: sensu-api entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
slurmctld_1 | 2016-03-17 14:21:14,519 INFO success: sensu-client entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
slurmctld_1 | 2016-03-17 14:21:14,519 INFO success: slurm_update entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
slurmctld_1 | 2016-03-17 14:21:14,519 INFO success: rsyslog entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
slurmctld_1 | 2016-03-17 14:21:14,519 INFO success: munged entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
slurmctld_1 | 2016-03-17 14:21:14,529 INFO success: watchpsutil entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
slurmctld_1 | 2016-03-17 14:21:14,537 INFO exited: diamond (exit status 0; expected)
slurmctld_1 | 2016-03-17 14:21:14,959 INFO spawned: 'slurmctld' with pid 412
slurmctld_1 | 2016-03-17 14:21:14,961 INFO spawned: 'sratchsetup' with pid 413
slurmctld_1 | 2016-03-17 14:21:14,962 INFO spawned: 'sensu-server' with pid 414
slurmctld_1 | 2016-03-17 14:21:15,030 INFO exited: sensu-server (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:21:15,057 INFO exited: sratchsetup (exit status 1; not expected)
slurmctld_1 | 2016-03-17 14:21:15,100 INFO exited: slurmctld (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:21:15,545 INFO exited: sensu-api (exit status 0; expected)
slurmctld_1 | 2016-03-17 14:21:17,847 INFO spawned: 'slurmctld' with pid 464
slurmctld_1 | 2016-03-17 14:21:17,876 INFO spawned: 'sratchsetup' with pid 465
slurmctld_1 | 2016-03-17 14:21:17,878 INFO spawned: 'sensu-server' with pid 466
slurmctld_1 | 2016-03-17 14:21:17,894 INFO exited: sratchsetup (exit status 1; not expected)
slurmctld_1 | 2016-03-17 14:21:17,905 INFO exited: sensu-server (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:21:17,907 INFO exited: slurmctld (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:21:18,543 INFO success: consul entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
slurmctld_1 | 2016-03-17 14:21:20,983 INFO spawned: 'slurmctld' with pid 520
slurmctld_1 | 2016-03-17 14:21:20,984 INFO spawned: 'sratchsetup' with pid 523
slurmctld_1 | 2016-03-17 14:21:20,986 INFO spawned: 'sensu-server' with pid 524
slurmctld_1 | 2016-03-17 14:21:21,024 INFO exited: sratchsetup (exit status 1; not expected)
slurmctld_1 | 2016-03-17 14:21:21,026 INFO gave up: sratchsetup entered FATAL state, too many start retries too quickly
slurmctld_1 | 2016-03-17 14:21:21,030 INFO exited: sensu-server (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:21:21,042 INFO gave up: sensu-server entered FATAL state, too many start retries too quickly
slurmctld_1 | 2016-03-17 14:21:21,047 INFO exited: slurmctld (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:21:25,615 INFO spawned: 'slurmctld' with pid 594
slurmctld_1 | 2016-03-17 14:21:25,665 INFO exited: slurmctld (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:21:28,545 INFO success: slurmstats entered RUNNING state, process has stayed up for > than 15 seconds (startsecs)
slurmctld_1 | 2016-03-17 14:21:28,545 INFO success: rsyslog-conf entered RUNNING state, process has stayed up for > than 15 seconds (startsecs)
slurmctld_1 | 2016-03-17 14:21:31,479 INFO spawned: 'slurmctld' with pid 682
slurmctld_1 | 2016-03-17 14:21:31,520 INFO exited: slurmctld (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:21:37,581 INFO spawned: 'slurmctld' with pid 770
slurmctld_1 | 2016-03-17 14:21:37,615 INFO exited: slurmctld (exit status 0; not expected)
compute_1   | 2016-03-17 14:21:40,596 INFO spawned: 'diamond' with pid 16
compute_1   | 2016-03-17 14:21:40,597 INFO spawned: 'consul' with pid 17
compute_1   | 2016-03-17 14:21:40,599 INFO spawned: 'rsyslog-conf' with pid 18
compute_1   | 2016-03-17 14:21:40,601 INFO spawned: 'sensu-api' with pid 19
compute_1   | 2016-03-17 14:21:40,603 INFO spawned: 'sensu-client' with pid 20
compute_1   | 2016-03-17 14:21:40,604 INFO spawned: 'slurm_update' with pid 21
compute_1   | 2016-03-17 14:21:40,614 INFO spawned: 'rsyslog' with pid 23
compute_1   | 2016-03-17 14:21:40,624 INFO spawned: 'slurmd' with pid 32
compute_1   | 2016-03-17 14:21:40,631 INFO spawned: 'munged' with pid 36
compute_1   | 2016-03-17 14:21:40,636 INFO spawned: 'watchpsutil' with pid 40
compute_1   | 2016-03-17 14:21:40,638 INFO spawned: 'sensu-server' with pid 44
compute_1   | 2016-03-17 14:21:40,639 INFO success: diamond entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
compute_1   | 2016-03-17 14:21:40,700 INFO exited: sensu-server (exit status 0; not expected)
compute_1   | 2016-03-17 14:21:41,627 INFO success: sensu-api entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
compute_1   | 2016-03-17 14:21:41,627 INFO success: sensu-client entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
compute_1   | 2016-03-17 14:21:41,627 INFO success: slurm_update entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
compute_1   | 2016-03-17 14:21:41,628 INFO success: rsyslog entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
compute_1   | 2016-03-17 14:21:41,628 INFO success: slurmd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
compute_1   | 2016-03-17 14:21:41,630 INFO success: munged entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
compute_1   | 2016-03-17 14:21:41,631 INFO exited: diamond (exit status 0; expected)
compute_1   | 2016-03-17 14:21:41,727 INFO success: watchpsutil entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
compute_1   | 2016-03-17 14:21:41,728 INFO spawned: 'sensu-server' with pid 693
compute_1   | 2016-03-17 14:21:41,742 INFO exited: sensu-server (exit status 0; not expected)
compute_1   | 2016-03-17 14:21:42,656 INFO exited: sensu-api (exit status 0; expected)
compute_1   | 2016-03-17 14:21:43,812 INFO spawned: 'sensu-server' with pid 748
compute_1   | 2016-03-17 14:21:43,824 INFO exited: sensu-server (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:21:45,050 INFO spawned: 'slurmctld' with pid 866
slurmctld_1 | 2016-03-17 14:21:45,091 INFO exited: slurmctld (exit status 0; not expected)
compute_1   | 2016-03-17 14:21:45,650 INFO success: consul entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
compute_1   | 2016-03-17 14:21:47,000 INFO spawned: 'sensu-server' with pid 800
compute_1   | 2016-03-17 14:21:47,018 INFO exited: sensu-server (exit status 0; not expected)
compute_1   | 2016-03-17 14:21:47,047 INFO gave up: sensu-server entered FATAL state, too many start retries too quickly
slurmctld_1 | 2016-03-17 14:21:53,551 INFO spawned: 'slurmctld' with pid 981
slurmctld_1 | 2016-03-17 14:21:53,591 INFO exited: slurmctld (exit status 0; not expected)
compute_1   | 2016-03-17 14:21:55,652 INFO success: rsyslog-conf entered RUNNING state, process has stayed up for > than 15 seconds (startsecs)
slurmctld_1 | 2016-03-17 14:22:03,534 INFO spawned: 'slurmctld' with pid 1119
slurmctld_1 | 2016-03-17 14:22:03,570 INFO exited: slurmctld (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:22:13,937 INFO spawned: 'slurmctld' with pid 1263
slurmctld_1 | 2016-03-17 14:22:13,978 INFO exited: slurmctld (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:22:25,045 INFO spawned: 'slurmctld' with pid 1415
slurmctld_1 | 2016-03-17 14:22:25,082 INFO exited: slurmctld (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:22:37,589 INFO spawned: 'slurmctld' with pid 1593
slurmctld_1 | 2016-03-17 14:22:37,629 INFO exited: slurmctld (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:22:50,951 INFO spawned: 'slurmctld' with pid 1782
slurmctld_1 | 2016-03-17 14:22:50,985 INFO exited: slurmctld (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:23:05,760 INFO spawned: 'slurmctld' with pid 1976
slurmctld_1 | 2016-03-17 14:23:05,798 INFO exited: slurmctld (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:23:16,647 INFO exited: sensu-client (exit status 1; not expected)
slurmctld_1 | 2016-03-17 14:23:17,649 INFO spawned: 'sensu-client' with pid 2134
slurmctld_1 | 2016-03-17 14:23:19,084 INFO success: sensu-client entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
slurmctld_1 | 2016-03-17 14:23:21,320 INFO spawned: 'slurmctld' with pid 2173
slurmctld_1 | 2016-03-17 14:23:21,354 INFO exited: slurmctld (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:23:37,887 INFO spawned: 'slurmctld' with pid 2399
slurmctld_1 | 2016-03-17 14:23:37,922 INFO exited: slurmctld (exit status 0; not expected)
compute_1   | 2016-03-17 14:23:43,122 INFO exited: sensu-client (exit status 1; not expected)
compute_1   | 2016-03-17 14:23:43,477 INFO spawned: 'sensu-client' with pid 2395
compute_1   | 2016-03-17 14:23:44,824 INFO success: sensu-client entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
slurmctld_1 | 2016-03-17 14:23:55,099 INFO spawned: 'slurmctld' with pid 2642
slurmctld_1 | 2016-03-17 14:23:55,133 INFO exited: slurmctld (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:24:13,569 INFO spawned: 'slurmctld' with pid 2886
slurmctld_1 | 2016-03-17 14:24:13,604 INFO exited: slurmctld (exit status 0; not expected)
@ChristianKniep
Copy link
Owner

I took the freedom to format you quotes. I'll have a look.
This week-end I am on my way to a conference and I have to take care of the slides first.
I hope to get to it at the end of next week. Please remind me if I haven't done so.

Thx for the feed-back! I appreciate it...

EDIT: Could you access the slurmctld instance and supervisorctl stop slurmctld plus /usr/local/sbin/slurmctld -D -v -c? Not sure if this is easy to do in Travis...

@ChristianKniep
Copy link
Owner

Hey @ocramz,

I renamed the fig.yml file to docker-compose.yml and fixed Consul environment variables.
Problem was, the Consul was not running a server, which blew up the slurm.conf creation.

➜  docker-compute git:(master) docker-compose up -d                                                                                                                                                                                                                                                                                      git:(master|)
Creating dockercompute_consul_1
Creating dockercompute_slurmctld_1
Creating dockercompute_compute_1
➜  docker-compute git:(master) docker exec -ti dockercompute_compute_1 sinfo                                                                                                                                                                                                                                                             git:(master|)
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      1   idle 2cafd9b00079
odd          up   infinite      1   idle 2cafd9b00079
➜  docker-compute git:(master) docker-compose scale compute=5                                                                                                                                                                                                                                                                            git:(master|)
Creating and starting 2 ... done
Creating and starting 3 ... done
Creating and starting 4 ... done
Creating and starting 5 ... done
➜  docker-compute git:(master) docker exec -ti dockercompute_compute_1 sinfo                                                                                                                                                                                                                                                             git:(master|)
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      1   idle 2cafd9b00079
all*         up   infinite      1    unk e3586b65af05
odd          up   infinite      1   idle 2cafd9b00079
odd          up   infinite      1    unk e3586b65af05
➜  docker-compute git:(master) docker exec -ti dockercompute_compute_1 sinfo                                                                                                                                                                                                                                                             git:(master|)
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      2   idle 0a0c3ede689e,ee478ce106b1
all*         up   infinite      3    unk 2cafd9b00079,6cb0e426299f,e3586b65af05
odd          up   infinite      2   idle 0a0c3ede689e,ee478ce106b1
odd          up   infinite      2    unk 2cafd9b00079,e3586b65af05
even         up   infinite      1    unk 6cb0e426299f
➜  docker-compute git:(master) docker exec -ti dockercompute_compute_1 sinfo                                                                                                                                                                                                                                                             git:(master|)
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      5   idle 0a0c3ede689e,2cafd9b00079,6cb0e426299f,e3586b65af05,ee478ce106b1
odd          up   infinite      4   idle 0a0c3ede689e,2cafd9b00079,e3586b65af05,ee478ce106b1
even         up   infinite      1   idle 6cb0e426299f
➜  docker-compute git:(master) docker exec -ti dockercompute_compute_1 srun -N5 hostname                                                                                                                                                                                                                                                 git:(master|)
ee478ce106b1
0a0c3ede689e
6cb0e426299f
e3586b65af05
2cafd9b00079
➜  docker-compute git:(master)

Please close the issue if it is solved for you as well.
Thx again for the feed-back - I am depending on it!

@ChristianKniep
Copy link
Owner

I enhanced the README, if you could walk through it and check if it's consistend... I am a bit biased. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants