| Summary: | reboot slurm controller node | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | RAMYA ERANNA <reranna> |
| Component: | Configuration | Assignee: | Marshall Garey <marshall> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 22.05.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | SLAC | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
RAMYA ERANNA
2023-10-04 17:25:51 MDT
Just reboot the slurmctld node. 1. Stop slurmctltd. Reboot the node. 2. There should be no impact. Jobs and steps run on compute nodes and will continue running. Completed jobs or steps will continue to retry sending the completion messages to the slurmctld until the slurmctld restarts. 3. Jobs and steps are state saved, so slurmctld will recover the job queue when it restarts. You do not need to do anything. Hi,
Thank you for confirming the running/ pending jobs in queue won't be affected due to reboot of slurmctld.
I'm worried about about the long running jobs in slurm like below
29638980 milano gadget4 tabel R 3-19:43:21 32 sdfmilan[011,013,024,026,033-040,047,053,060-063,069-072,101-102,111,113,204,209,212,214,218,224]
29830664 milano InDfr1 mdimauro R 1-16:05:50 5 sdfmilan[023,025,032,207,231]
29578676 milano InDfr3 mdimauro R 4-06:44:43 5 sdfmilan[019,068,127,130,203]
Users will not be able to submit any new jobs during the boot time. Am I right ?
Thank you
Ramya
No jobs (long running or not) should be affected. Slurm is designed so that you can restart all daemons (slurmdbd, slurmctld, slurmd, slurmrestd) without affecting jobs.
When slurmctld starts, it just needs to be able to read StateSaveLocation to recover the job queue.
> Users will not be able to submit any new jobs during the boot time. Am I right ?
Right, because job submission issues an RPC to the slurmctld.
Got it. Thank you for your quick support. Regards, Ramya Hi Team, We rebooted the slurmctld node. We see below error messages. Would you please check and suggest [2023-10-05T12:22:28.774] error: pack_msg: Invalid message version=6500, type:6500 [2023-10-05T12:22:28.775] error: auth_g_pack: protocol_version 6500 not supported [2023-10-05T12:22:28.775] error: slurm_send_node_msg: auth_g_pack: REQUEST_PERSIST_INIT has authentication error: No error [2023-10-05T12:22:28.775] error: slurm_persist_conn_open: failed to send persistent connection init message to sdfslurmdb:6819 [2023-10-05T12:22:28.775] error: Sending PersistInit msg: Protocol authentication error [2023-10-05T12:22:28.775] error: DBD_SEND_MULT_JOB_START failure: Protocol authentication error [2023-10-05T12:22:33.776] error: pack_msg: Invalid message version=6500, type:6500 [2023-10-05T12:22:33.776] error: auth_g_pack: protocol_version 6500 not supported [2023-10-05T12:22:33.776] error: slurm_send_node_msg: auth_g_pack: REQUEST_PERSIST_INIT has authentication error: No error [2023-10-05T12:22:33.776] error: slurm_persist_conn_open: failed to send persistent connection init message to sdfslurmdb:6819 [2023-10-05T12:22:33.776] error: Sending PersistInit msg: Protocol authentication error [2023-10-05T12:22:33.776] error: DBD_SEND_MULT_JOB_START failure: Protocol authentication error [2023-10-05T12:22:35.008] error: slurmdbd: Invalid message version=6500, type:1441 [2023-10-05T12:22:35.008] _job_complete: JobId=29923964 WEXITSTATUS 0 [2023-10-05T12:22:35.008] error: slurmdbd: Invalid message version=6500, type:1424 [2023-10-05T12:22:35.009] _job_complete: JobId=29923964 done [2023-10-05T12:22:35.013] error: slurmdbd: Invalid message version=6500, type:1441 [2023-10-05T12:22:35.174] _job_complete: JobId=29923933 WEXITSTATUS 0 [2023-10-05T12:22:35.175] error: slurmdbd: Invalid message version=6500, type:1424 [2023-10-05T12:22:35.175] _job_complete: JobId=29923933 done [2023-10-05T12:22:35.195] error: slurmdbd: Invalid message version=6500, type:1441 [2023-10-05T12:22:35.197] error: slurmdbd: Invalid message version=6500, type:1441 [2023-10-05T12:22:37.000] error: pack_msg: Invalid message version=6500, type:6500 [2023-10-05T12:22:37.000] error: auth_g_pack: protocol_version 6500 not supported [2023-10-05T12:22:37.000] error: slurm_send_node_msg: auth_g_pack: REQUEST_PERSIST_INIT has authentication error: No error [2023-10-05T12:22:37.000] error: slurm_persist_conn_open: failed to send persistent connection init message to sdfslurmdb:6819 [2023-10-05T12:22:37.000] error: Sending PersistInit msg: Protocol authentication error [2023-10-05T12:22:37.365] error: slurm_receive_msg [127.0.0.1:59996]: Zero Bytes were transmitted or received [2023-10-05T12:22:38.778] error: pack_msg: Invalid message version=6500, type:6500 [2023-10-05T12:22:38.778] error: auth_g_pack: protocol_version 6500 not supported [2023-10-05T12:22:38.778] error: slurm_send_node_msg: auth_g_pack: REQUEST_PERSIST_INIT has authentication error: No error [2023-10-05T12:22:38.778] error: slurm_persist_conn_open: failed to send persistent connection init message to sdfslurmdb:6819 [2023-10-05T12:22:38.779] error: Sending PersistInit msg: Protocol authentication error [2023-10-05T12:22:38.779] error: DBD_SEND_MULT_JOB_START failure: Protocol authentication error [2023-10-05T12:22:40.482] _slurm_rpc_submit_batch_job: JobId=29924014 InitPrio=8454 usec=528 [2023-10-05T12:22:40.518] error: slurmdbd: Invalid message version=6500, type:1441 [2023-10-05T12:22:40.518] _job_complete: JobId=29923963 WEXITSTATUS 0 [2023-10-05T12:22:40.518] error: slurmdbd: Invalid message version=6500, type:1424 [2023-10-05T12:22:40.519] _job_complete: JobId=29923963 done [2023-10-05T12:22:40.523] error: slurmdbd: Invalid message version=6500, type:1441 [2023-10-05T12:22:40.691] error: slurmdbd: Invalid message version=6500, type:1442 [2023-10-05T12:22:40.691] sched: Allocate JobId=29924014 NodeList=sdfrome038 #CPUs=1 Partition=roma [2023-10-05T12:22:40.691] error: slurmdbd: Invalid message version=6500, type:1442 [2023-10-05T12:22:43.059] _slurm_rpc_submit_batch_job: JobId=29924015 InitPrio=8454 usec=691 [2023-10-05T12:22:43.360] error: slurmdbd: Invalid message version=6500, type:1442 [2023-10-05T12:22:43.360] sched: _slurm_rpc_allocate_resources JobId=29924016 NodeList=sdfmilan232 usec=925 [2023-10-05T12:22:43.780] error: pack_msg: Invalid message version=6500, type:6500 [2023-10-05T12:22:43.781] error: auth_g_pack: protocol_version 6500 not supported [2023-10-05T12:22:43.781] error: slurm_send_node_msg: auth_g_pack: REQUEST_PERSIST_INIT has authentication error: No error [2023-10-05T12:22:43.781] error: slurm_persist_conn_open: failed to send persistent connection init message to sdfslurmdb:6819 [2023-10-05T12:22:43.781] error: Sending PersistInit msg: Protocol authentication error [2023-10-05T12:22:43.781] error: DBD_SEND_MULT_JOB_START failure: Protocol authentication error [2023-10-05T12:22:46.291] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=29595306 uid 17951 [2023-10-05T12:22:46.291] job_str_signal(3): invalid JobId=29595306 [2023-10-05T12:22:46.291] _slurm_rpc_kill_job: job_str_signal() uid=17951 JobId=29595306 sig=9 returned: Invalid job id specified [2023-10-05T12:22:46.452] error: slurmdbd: Invalid message version=6500, type:1442 [2023-10-05T12:22:48.782] error: pack_msg: Invalid message version=6500, type:6500 [2023-10-05T12:22:48.782] error: auth_g_pack: protocol_version 6500 not supported [2023-10-05T12:22:48.782] error: slurm_send_node_msg: auth_g_pack: REQUEST_PERSIST_INIT has authentication error: No error [2023-10-05T12:22:48.782] error: slurm_persist_conn_open: failed to send persistent connection init message to sdfslurmdb:6819 [2023-10-05T12:22:48.782] error: Sending PersistInit msg: Protocol authentication error [2023-10-05T12:22:48.782] error: DBD_SEND_MULT_JOB_START failure: Protocol authentication error [2023-10-05T12:22:49.049] error: pack_msg: Invalid message version=6500, type:6500 [2023-10-05T12:22:49.049] error: auth_g_pack: protocol_version 6500 not supported [2023-10-05T12:22:49.049] error: slurm_send_node_msg: auth_g_pack: REQUEST_PERSIST_INIT has authentication error: No error [2023-10-05T12:22:49.049] error: slurm_persist_conn_open: failed to send persistent connection init message to sdfslurmdb:6819 [2023-10-05T12:22:49.049] error: Sending PersistInit msg: Protocol authentication error [2023-10-05T12:22:49.592] error: slurmdbd: Invalid message version=6500, type:1441 [2023-10-05T12:22:49.592] _job_complete: JobId=29923965 WEXITSTATUS 0 [2023-10-05T12:22:49.592] error: slurmdbd: Invalid message version=6500, type:1424 [2023-10-05T12:22:49.593] _job_complete: JobId=29923965 done [2023-10-05T12:22:49.597] error: slurmdbd: Invalid message version=6500, type:1441 [2023-10-05T12:22:49.765] error: slurmdbd: Invalid message version=6500, type:1442 [2023-10-05T12:22:49.765] sched: Allocate JobId=29924015 NodeList=sdfrome038 #CPUs=1 Partition=roma [2023-10-05T12:22:49.765] error: slurmdbd: Invalid message version=6500, type:1442 [2023-10-05T12:22:50.049] error: slurmdbd: Invalid message version=6500, type:1425 [2023-10-05T12:22:50.049] error: slurmdbd: Invalid message version=6500, type:1425 [2023-10-05T12:22:50.049] error: slurmdbd: Invalid message version=6500, type:1425 [2023-10-05T12:22:50.049] error: slurmdbd: Invalid message version=6500, type:1425 [2023-10-05T12:22:50.049] error: slurmdbd: Invalid message version=6500, type:1425 [2023-10-05T12:22:50.049] error: slurmdbd: Invalid message version=6500, type:1425 [2023-10-05T12:22:50.049] error: slurmdbd: Invalid message version=6500, type:1425 [2023-10-05T12:22:50.049] error: slurmdbd: Invalid message version=6500, type:1425 [2023-10-05T12:22:50.050] error: slurmdbd: Invalid message version=6500, type:1425 [2023-10-05T12:22:50.050] error: slurmdbd: Invalid message version=6500, type:1425 [2023-10-05T12:22:50.050] error: slurmdbd: Invalid message version=6500, type:1425 [2023-10-05T12:22:50.828] _slurm_rpc_submit_batch_job: JobId=29924017 InitPrio=8454 usec=607 [2023-10-05T12:22:51.007] _slurm_rpc_submit_batch_job: JobId=29924018 InitPrio=8454 usec=874 [2023-10-05T12:22:52.535] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=29595229 uid 17951 [2023-10-05T12:22:52.535] job_str_signal(3): invalid JobId=29595229 [2023-10-05T12:22:52.535] _slurm_rpc_kill_job: job_str_signal() uid=17951 JobId=29595229 sig=9 returned: Invalid job id specified [2023-10-05T12:22:52.547] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=29595319 uid 17951 [2023-10-05T12:22:52.547] job_str_signal(3): invalid JobId=29595319 [2023-10-05T12:22:52.547] _slurm_rpc_kill_job: job_str_signal() uid=17951 JobId=29595319 sig=9 returned: Invalid job id specified [2023-10-05T12:22:52.569] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=29595248 uid 17951 [2023-10-05T12:22:52.569] job_str_signal(3): invalid JobId=29595248 [2023-10-05T12:22:52.569] _slurm_rpc_kill_job: job_str_signal() uid=17951 JobId=29595248 sig=9 returned: Invalid job id specified [2023-10-05T12:22:52.586] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=29595292 uid 17951 [2023-10-05T12:22:52.586] job_str_signal(3): invalid JobId=29595292 [2023-10-05T12:22:52.586] _slurm_rpc_kill_job: job_str_signal() uid=17951 JobId=29595292 sig=9 returned: Invalid job id specified [2023-10-05T12:22:52.604] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=29595300 uid 17951 [2023-10-05T12:22:52.604] job_str_signal(3): invalid JobId=29595300 [2023-10-05T12:22:52.604] _slurm_rpc_kill_job: job_str_signal() uid=17951 JobId=29595300 sig=9 returned: Invalid job id specified [2023-10-05T12:22:53.784] error: pack_msg: Invalid message version=6500, type:6500 [2023-10-05T12:22:53.784] error: auth_g_pack: protocol_version 6500 not supported [2023-10-05T12:22:53.784] error: slurm_send_node_msg: auth_g_pack: REQUEST_PERSIST_INIT has authentication error: No error [2023-10-05T12:22:53.784] error: slurm_persist_conn_open: failed to send persistent connection init message to sdfslurmdb:6819 [2023-10-05T12:22:53.784] error: Sending PersistInit msg: Protocol authentication error [2023-10-05T12:22:53.784] error: DBD_SEND_MULT_JOB_START failure: Protocol authentication error [2023-10-05T12:22:54.649] _slurm_rpc_submit_batch_job: JobId=29924019 InitPrio=8454 usec=833 [2023-10-05T12:22:55.000] error: pack_msg: Invalid message version=6500, type:6500 [2023-10-05T12:22:55.000] error: auth_g_pack: protocol_version 6500 not supported [2023-10-05T12:22:55.000] error: slurm_send_node_msg: auth_g_pack: REQUEST_PERSIST_INIT has authentication error: No error [2023-10-05T12:22:55.000] error: slurm_persist_conn_open: failed to send persistent connection init message to sdfslurmdb:6819 [2023-10-05T12:22:55.001] error: Sending PersistInit msg: Protocol authentication error [2023-10-05T12:22:58.562] error: slurmdbd: Invalid message version=6500, type:1441 [2023-10-05T12:22:58.562] _job_complete: JobId=29923968 WEXITSTATUS 0 [2023-10-05T12:22:58.562] error: slurmdbd: Invalid message version=6500, type:1424 [2023-10-05T12:22:58.563] _job_complete: JobId=29923968 done [2023-10-05T12:22:58.573] error: slurmdbd: Invalid message version=6500, type:1441 [2023-10-05T12:22:58.785] error: pack_msg: Invalid message version=6500, type:6500 [2023-10-05T12:22:58.786] error: auth_g_pack: protocol_version 6500 not supported [2023-10-05T12:22:58.786] error: slurm_send_node_msg: auth_g_pack: REQUEST_PERSIST_INIT has authentication error: No error [2023-10-05T12:22:58.786] error: slurm_persist_conn_open: failed to send persistent connection init message to sdfslurmdb:6819 [2023-10-05T12:22:58.786] error: Sending PersistInit msg: Protocol authentication error [2023-10-05T12:22:58.786] error: DBD_SEND_MULT_JOB_START failure: Protocol authentication error [2023-10-05T12:22:58.897] error: slurmdbd: Invalid message version=6500, type:1442 [2023-10-05T12:22:58.897] sched: Allocate JobId=29924017 NodeList=sdfrome038 #CPUs=1 Partition=roma [2023-10-05T12:22:58.897] error: slurmdbd: Invalid message version=6500, type:1442 [2023-10-05T12:22:58.898] error: slurmdbd: Invalid message version=6500, type:1442 [2023-10-05T12:22:58.898] sched: Allocate JobId=29924018 NodeList=sdfrome038 #CPUs=1 Partition=roma [2023-10-05T12:22:58.898] error: slurmdbd: Invalid message version=6500, type:1442 [2023-10-05T12:22:58.898] error: slurmdbd: Invalid message version=6500, type:1442 [2023-10-05T12:22:58.898] sched: Allocate JobId=29924019 NodeList=sdfrome038 #CPUs=1 Partition=roma [2023-10-05T12:22:58.898] error: slurmdbd: Invalid message version=6500, type:1442 [2023-10-05T12:22:59.470] error: slurmdbd: Invalid message version=6500, type:1441 [2023-10-05T12:22:59.471] _job_complete: JobId=29923966 WEXITSTATUS 0 [2023-10-05T12:22:59.471] error: slurmdbd: Invalid message version=6500, type:1424 [2023-10-05T12:22:59.471] _job_complete: JobId=29923966 done [2023-10-05T12:22:59.479] error: slurmdbd: Invalid message version=6500, type:1441 [2023-10-05T12:23:03.787] error: pack_msg: Invalid message version=6500, type:6500 [2023-10-05T12:23:03.788] error: auth_g_pack: protocol_version 6500 not supported [2023-10-05T12:23:03.788] error: slurm_send_node_msg: auth_g_pack: REQUEST_PERSIST_INIT has authentication error: No error [2023-10-05T12:23:03.788] error: slurm_persist_conn_open: failed to send persistent connection init message to sdfslurmdb:6819 [2023-10-05T12:23:03.788] error: Sending PersistInit msg: Protocol authentication error [2023-10-05T12:23:03.788] error: DBD_SEND_MULT_JOB_START failure: Protocol authentication error [2023-10-05T12:23:07.381] error: slurm_receive_msg [127.0.0.1:48298]: Zero Bytes were transmitted or received [2023-10-05T12:23:08.789] error: pack_msg: Invalid message version=6500, type:6500 [2023-10-05T12:23:08.789] error: auth_g_pack: protocol_version 6500 not supported [2023-10-05T12:23:08.789] error: slurm_send_node_msg: auth_g_pack: REQUEST_PERSIST_INIT has authentication error: No error [2023-10-05T12:23:08.789] error: slurm_persist_conn_open: failed to send persistent connection init message to sdfslurmdb:6819 [2023-10-05T12:23:08.789] error: Sending PersistInit msg: Protocol authentication error [2023-10-05T12:23:08.789] error: DBD_SEND_MULT_JOB_START failure: Protocol authentication error [2023-10-05T12:23:11.205] error: slurmdbd: Invalid message version=6500, type:1441 [2023-10-05T12:23:11.205] _job_complete: JobId=29923967 WEXITSTATUS 0 [2023-10-05T12:23:11.205] error: slurmdbd: Invalid message version=6500, type:1424 [2023-10-05T12:23:11.205] _job_complete: JobId=29923967 done [2023-10-05T12:23:11.233] error: slurmdbd: Invalid message version=6500, type:1441 [2023-10-05T12:23:13.000] error: pack_msg: Invalid message version=6500, type:6500 [2023-10-05T12:23:13.000] error: auth_g_pack: protocol_version 6500 not supported [2023-10-05T12:23:13.000] error: slurm_send_node_msg: auth_g_pack: REQUEST_PERSIST_INIT has authentication error: No error [2023-10-05T12:23:13.000] error: slurm_persist_conn_open: failed to send persistent connection init message to sdfslurmdb:6819 [2023-10-05T12:23:13.000] error: Sending PersistInit msg: Protocol authentication error Thanks Ramya restarting slurmdb and slurmctld helped to fix the errors. Thank you Ramya I have been Looking over this issue Ramya, and I have seen this happen only a few other times from a couple of previous sites. In those instances, they too reported that restarting the daemons fixed the issue. We have not been able to duplicate this ourselves so we are not sure what causes this to happen. Ramya, Have there been any other issues since the slurmctld node reboot? Hi, One more issue which we observed, some jobs went into runaway state. May I know why these jobs went into runaway state [reranna@sdfmgr002 ~]$ sacctmgr show RunAwayJobs NOTE: Runaway jobs are jobs that don't exist in the controller but have a start time and no end time in the database ID Name Partition Cluster State TimeSubmit TimeStart TimeEnd ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ------------------- 29904783 out ampere s3df RUNNING 2023-10-05T04:10:40 2023-10-05T04:10:48 Unknown 29922701 run.sh roma s3df RUNNING 2023-10-05T10:40:29 2023-10-05T10:40:30 Unknown 29923252 glide_ery+ roma s3df RUNNING 2023-10-05T11:25:51 2023-10-05T11:25:54 Unknown 29923253 glide_ery+ roma s3df RUNNING 2023-10-05T11:25:52 2023-10-05T11:25:54 Unknown 29923254 glide_ery+ roma s3df RUNNING 2023-10-05T11:25:52 2023-10-05T11:25:54 Unknown 29923255 glide_ery+ roma s3df RUNNING 2023-10-05T11:25:52 2023-10-05T11:25:54 Unknown 29923256 glide_ery+ roma s3df RUNNING 2023-10-05T11:25:52 2023-10-05T11:25:54 Unknown 29923257 glide_ery+ roma s3df RUNNING 2023-10-05T11:25:52 2023-10-05T11:25:54 Unknown 29923258 glide_ery+ roma s3df RUNNING 2023-10-05T11:25:52 2023-10-05T11:25:58 Unknown 29923260 glide_ery+ milano s3df RUNNING 2023-10-05T11:25:52 2023-10-05T11:25:54 Unknown 29923261 glide_ery+ milano s3df RUNNING 2023-10-05T11:25:52 2023-10-05T11:25:54 Unknown 29923263 glide_ery+ roma s3df RUNNING 2023-10-05T11:25:53 2023-10-05T11:25:58 Unknown 29923264 glide_ery+ roma s3df RUNNING 2023-10-05T11:25:53 2023-10-05T11:25:58 Unknown 29923265 glide_ery+ roma s3df RUNNING 2023-10-05T11:25:53 2023-10-05T11:26:00 Unknown 29923266 glide_ery+ roma s3df RUNNING 2023-10-05T11:25:53 2023-10-05T11:26:00 Unknown 29923267 glide_ery+ roma s3df RUNNING 2023-10-05T11:25:53 2023-10-05T11:26:00 Unknown 29923268 glide_ery+ roma s3df RUNNING 2023-10-05T11:25:53 2023-10-05T11:26:02 Unknown 29923269 glide_ery+ roma s3df RUNNING 2023-10-05T11:25:53 2023-10-05T11:26:02 Unknown 29923270 glide_ery+ roma s3df RUNNING 2023-10-05T11:25:54 2023-10-05T11:26:02 Unknown 29923271 glide_ery+ roma s3df RUNNING 2023-10-05T11:25:54 2023-10-05T11:26:06 Unknown 29923784 himem roma s3df RUNNING 2023-10-05T11:58:54 2023-10-05T11:58:54 Unknown 29923785 usdf_medi+ roma s3df RUNNING 2023-10-05T11:59:03 2023-10-05T11:59:04 Unknown 29923786 medium roma s3df RUNNING 2023-10-05T11:59:08 2023-10-05T11:59:11 Unknown 29923787 usdf_rubin roma s3df RUNNING 2023-10-05T11:59:08 2023-10-05T11:59:11 Unknown 29923788 usdf_test roma s3df RUNNING 2023-10-05T11:59:10 2023-10-05T11:59:11 Unknown 29923789 test milano s3df RUNNING 2023-10-05T11:59:24 2023-10-05T11:59:24 Unknown 29923790 usdf_himem roma s3df RUNNING 2023-10-05T11:59:29 2023-10-05T11:59:33 Unknown 29923791 usdf_medi+ roma s3df RUNNING 2023-10-05T11:59:34 2023-10-05T11:59:34 Unknown 29923792 rubin roma s3df RUNNING 2023-10-05T11:59:38 2023-10-05T11:59:39 Unknown 29923793 himem roma s3df RUNNING 2023-10-05T11:59:54 2023-10-05T11:59:54 Unknown 29923794 usdf_rubin roma s3df RUNNING 2023-10-05T12:00:09 2023-10-05T12:00:09 Unknown 29923795 usdf_test roma s3df RUNNING 2023-10-05T12:00:09 2023-10-05T12:00:13 Unknown 29923796 medium roma s3df RUNNING 2023-10-05T12:00:10 2023-10-05T12:00:13 Unknown 29923797 test milano s3df RUNNING 2023-10-05T12:00:24 2023-10-05T12:00:26 Unknown 29923798 usdf_himem roma s3df RUNNING 2023-10-05T12:00:30 2023-10-05T12:00:32 Unknown 29923799 usdf_medi+ roma s3df RUNNING 2023-10-05T12:00:34 2023-10-05T12:00:39 Unknown 29923800 interacti+ milano s3df RUNNING 2023-10-05T12:00:35 2023-10-05T12:00:35 Unknown 29923801 rubin roma s3df RUNNING 2023-10-05T12:00:38 2023-10-05T12:00:39 Unknown 29923802 himem roma s3df RUNNING 2023-10-05T12:00:54 2023-10-05T12:00:54 Unknown Would you like to fix these runaway jobs? (This will set the end time for each job to the latest out of the start, eligible, or submit times, and set the state to completed. Once corrected, this will trigger the rollup to reroll usage from before the earliest submit time of all the runaway jobs.) (You have 30 seconds to decide) (N/y): y [reranna@sdfmgr002 ~]$ Thanks Ramya Runaway jobs are defined as jobs that are not pending in the database, but do not exist in the controller. This can happen when job complete messages do not make it to the database. I can only guess as to what caused that to happen. That might happen due to slurmctld filling up its cache (check your slurmctld log for "RESTART SLURMDBD NOW" messages), or due to some network or filesystem issue. It could happen if the slurmctld did not recover those jobs when it restarted. Or possibly some other way. Hi, Thank you for your quick help. Please close the ticket Thank you Ramya Closing as infogiven per comment 12. Just to clarify comment 11: > Runaway jobs are defined as jobs that are not pending in the database, but do not exist in the controller. I forgot to add "and do not have an end time." So the complete definition is: Runaway jobs are defined as jobs that are not pending in the database and do not have an end time, but do not exist in the controller. |