Ticket 17834

Summary:	reboot slurm controller node
Product:	Slurm	Reporter:	RAMYA ERANNA <reranna>
Component:	Configuration	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	22.05.2
Hardware:	Linux
OS:	Linux
Site:	SLAC	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description RAMYA ERANNA 2023-10-04 17:25:51 MDT

Hi Team,

We have to perform slurmcontroller reboot due to some emergency maintenance activity. Would you please suggest the below

1. What are the steps to reboot a slurm controller node [slurmctld ]
2. what are the impacts on running/pending jobs
3. What are prevention steps to keep the running/pending jobs in queue after reboot.

Thank you
Ramya

Comment 1 Marshall Garey 2023-10-04 17:37:43 MDT

Just reboot the slurmctld node.

1. Stop slurmctltd. Reboot the node.
2. There should be no impact. Jobs and steps run on compute nodes and will continue running. Completed jobs or steps will continue to retry sending the completion messages to the slurmctld until the slurmctld restarts.
3. Jobs and steps are state saved, so slurmctld will recover the job queue when it restarts. You do not need to do anything.

Comment 2 RAMYA ERANNA 2023-10-04 18:03:02 MDT

Hi,

Thank you for confirming the running/ pending jobs in queue won't be affected due to reboot of slurmctld. 
I'm worried about about the long running jobs in slurm like below

          29638980    milano  gadget4    tabel  R 3-19:43:21     32 sdfmilan[011,013,024,026,033-040,047,053,060-063,069-072,101-102,111,113,204,209,212,214,218,224]
          29830664    milano   InDfr1 mdimauro  R 1-16:05:50      5 sdfmilan[023,025,032,207,231]
          29578676    milano   InDfr3 mdimauro  R 4-06:44:43      5 sdfmilan[019,068,127,130,203]

 
Users will not be able to submit any new jobs during the boot time. Am I right ?

Thank you
Ramya

Comment 3 Marshall Garey 2023-10-04 18:09:52 MDT

No jobs (long running or not) should be affected. Slurm is designed so that you can restart all daemons (slurmdbd, slurmctld, slurmd, slurmrestd) without affecting jobs.

When slurmctld starts, it just needs to be able to read StateSaveLocation to recover the job queue.

> Users will not be able to submit any new jobs during the boot time. Am I right ?
Right, because job submission issues an RPC to the slurmctld.

Comment 4 RAMYA ERANNA 2023-10-04 18:16:05 MDT

Got it.
Thank you for your quick support.

Regards,
Ramya

Comment 5 RAMYA ERANNA 2023-10-05 13:23:44 MDT

Hi Team,

We rebooted the slurmctld node. We see below error messages. Would you please check and suggest 

[2023-10-05T12:22:28.774] error: pack_msg: Invalid message version=6500, type:6500
[2023-10-05T12:22:28.775] error: auth_g_pack: protocol_version 6500 not supported
[2023-10-05T12:22:28.775] error: slurm_send_node_msg: auth_g_pack: REQUEST_PERSIST_INIT has  authentication error: No error
[2023-10-05T12:22:28.775] error: slurm_persist_conn_open: failed to send persistent connection init message to sdfslurmdb:6819
[2023-10-05T12:22:28.775] error: Sending PersistInit msg: Protocol authentication error
[2023-10-05T12:22:28.775] error: DBD_SEND_MULT_JOB_START failure: Protocol authentication error
[2023-10-05T12:22:33.776] error: pack_msg: Invalid message version=6500, type:6500
[2023-10-05T12:22:33.776] error: auth_g_pack: protocol_version 6500 not supported
[2023-10-05T12:22:33.776] error: slurm_send_node_msg: auth_g_pack: REQUEST_PERSIST_INIT has  authentication error: No error
[2023-10-05T12:22:33.776] error: slurm_persist_conn_open: failed to send persistent connection init message to sdfslurmdb:6819
[2023-10-05T12:22:33.776] error: Sending PersistInit msg: Protocol authentication error
[2023-10-05T12:22:33.776] error: DBD_SEND_MULT_JOB_START failure: Protocol authentication error
[2023-10-05T12:22:35.008] error: slurmdbd: Invalid message version=6500, type:1441
[2023-10-05T12:22:35.008] _job_complete: JobId=29923964 WEXITSTATUS 0
[2023-10-05T12:22:35.008] error: slurmdbd: Invalid message version=6500, type:1424
[2023-10-05T12:22:35.009] _job_complete: JobId=29923964 done
[2023-10-05T12:22:35.013] error: slurmdbd: Invalid message version=6500, type:1441
[2023-10-05T12:22:35.174] _job_complete: JobId=29923933 WEXITSTATUS 0
[2023-10-05T12:22:35.175] error: slurmdbd: Invalid message version=6500, type:1424
[2023-10-05T12:22:35.175] _job_complete: JobId=29923933 done
[2023-10-05T12:22:35.195] error: slurmdbd: Invalid message version=6500, type:1441
[2023-10-05T12:22:35.197] error: slurmdbd: Invalid message version=6500, type:1441
[2023-10-05T12:22:37.000] error: pack_msg: Invalid message version=6500, type:6500
[2023-10-05T12:22:37.000] error: auth_g_pack: protocol_version 6500 not supported
[2023-10-05T12:22:37.000] error: slurm_send_node_msg: auth_g_pack: REQUEST_PERSIST_INIT has  authentication error: No error
[2023-10-05T12:22:37.000] error: slurm_persist_conn_open: failed to send persistent connection init message to sdfslurmdb:6819
[2023-10-05T12:22:37.000] error: Sending PersistInit msg: Protocol authentication error
[2023-10-05T12:22:37.365] error: slurm_receive_msg [127.0.0.1:59996]: Zero Bytes were transmitted or received
[2023-10-05T12:22:38.778] error: pack_msg: Invalid message version=6500, type:6500
[2023-10-05T12:22:38.778] error: auth_g_pack: protocol_version 6500 not supported
[2023-10-05T12:22:38.778] error: slurm_send_node_msg: auth_g_pack: REQUEST_PERSIST_INIT has  authentication error: No error
[2023-10-05T12:22:38.778] error: slurm_persist_conn_open: failed to send persistent connection init message to sdfslurmdb:6819
[2023-10-05T12:22:38.779] error: Sending PersistInit msg: Protocol authentication error
[2023-10-05T12:22:38.779] error: DBD_SEND_MULT_JOB_START failure: Protocol authentication error
[2023-10-05T12:22:40.482] _slurm_rpc_submit_batch_job: JobId=29924014 InitPrio=8454 usec=528
[2023-10-05T12:22:40.518] error: slurmdbd: Invalid message version=6500, type:1441
[2023-10-05T12:22:40.518] _job_complete: JobId=29923963 WEXITSTATUS 0
[2023-10-05T12:22:40.518] error: slurmdbd: Invalid message version=6500, type:1424
[2023-10-05T12:22:40.519] _job_complete: JobId=29923963 done
[2023-10-05T12:22:40.523] error: slurmdbd: Invalid message version=6500, type:1441
[2023-10-05T12:22:40.691] error: slurmdbd: Invalid message version=6500, type:1442
[2023-10-05T12:22:40.691] sched: Allocate JobId=29924014 NodeList=sdfrome038 #CPUs=1 Partition=roma
[2023-10-05T12:22:40.691] error: slurmdbd: Invalid message version=6500, type:1442
[2023-10-05T12:22:43.059] _slurm_rpc_submit_batch_job: JobId=29924015 InitPrio=8454 usec=691
[2023-10-05T12:22:43.360] error: slurmdbd: Invalid message version=6500, type:1442
[2023-10-05T12:22:43.360] sched: _slurm_rpc_allocate_resources JobId=29924016 NodeList=sdfmilan232 usec=925
[2023-10-05T12:22:43.780] error: pack_msg: Invalid message version=6500, type:6500
[2023-10-05T12:22:43.781] error: auth_g_pack: protocol_version 6500 not supported
[2023-10-05T12:22:43.781] error: slurm_send_node_msg: auth_g_pack: REQUEST_PERSIST_INIT has  authentication error: No error
[2023-10-05T12:22:43.781] error: slurm_persist_conn_open: failed to send persistent connection init message to sdfslurmdb:6819
[2023-10-05T12:22:43.781] error: Sending PersistInit msg: Protocol authentication error
[2023-10-05T12:22:43.781] error: DBD_SEND_MULT_JOB_START failure: Protocol authentication error
[2023-10-05T12:22:46.291] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=29595306 uid 17951
[2023-10-05T12:22:46.291] job_str_signal(3): invalid JobId=29595306
[2023-10-05T12:22:46.291] _slurm_rpc_kill_job: job_str_signal() uid=17951 JobId=29595306 sig=9 returned: Invalid job id specified
[2023-10-05T12:22:46.452] error: slurmdbd: Invalid message version=6500, type:1442
[2023-10-05T12:22:48.782] error: pack_msg: Invalid message version=6500, type:6500
[2023-10-05T12:22:48.782] error: auth_g_pack: protocol_version 6500 not supported
[2023-10-05T12:22:48.782] error: slurm_send_node_msg: auth_g_pack: REQUEST_PERSIST_INIT has  authentication error: No error
[2023-10-05T12:22:48.782] error: slurm_persist_conn_open: failed to send persistent connection init message to sdfslurmdb:6819
[2023-10-05T12:22:48.782] error: Sending PersistInit msg: Protocol authentication error
[2023-10-05T12:22:48.782] error: DBD_SEND_MULT_JOB_START failure: Protocol authentication error
[2023-10-05T12:22:49.049] error: pack_msg: Invalid message version=6500, type:6500
[2023-10-05T12:22:49.049] error: auth_g_pack: protocol_version 6500 not supported
[2023-10-05T12:22:49.049] error: slurm_send_node_msg: auth_g_pack: REQUEST_PERSIST_INIT has  authentication error: No error
[2023-10-05T12:22:49.049] error: slurm_persist_conn_open: failed to send persistent connection init message to sdfslurmdb:6819
[2023-10-05T12:22:49.049] error: Sending PersistInit msg: Protocol authentication error
[2023-10-05T12:22:49.592] error: slurmdbd: Invalid message version=6500, type:1441
[2023-10-05T12:22:49.592] _job_complete: JobId=29923965 WEXITSTATUS 0
[2023-10-05T12:22:49.592] error: slurmdbd: Invalid message version=6500, type:1424
[2023-10-05T12:22:49.593] _job_complete: JobId=29923965 done
[2023-10-05T12:22:49.597] error: slurmdbd: Invalid message version=6500, type:1441
[2023-10-05T12:22:49.765] error: slurmdbd: Invalid message version=6500, type:1442
[2023-10-05T12:22:49.765] sched: Allocate JobId=29924015 NodeList=sdfrome038 #CPUs=1 Partition=roma
[2023-10-05T12:22:49.765] error: slurmdbd: Invalid message version=6500, type:1442
[2023-10-05T12:22:50.049] error: slurmdbd: Invalid message version=6500, type:1425
[2023-10-05T12:22:50.049] error: slurmdbd: Invalid message version=6500, type:1425
[2023-10-05T12:22:50.049] error: slurmdbd: Invalid message version=6500, type:1425
[2023-10-05T12:22:50.049] error: slurmdbd: Invalid message version=6500, type:1425
[2023-10-05T12:22:50.049] error: slurmdbd: Invalid message version=6500, type:1425
[2023-10-05T12:22:50.049] error: slurmdbd: Invalid message version=6500, type:1425
[2023-10-05T12:22:50.049] error: slurmdbd: Invalid message version=6500, type:1425
[2023-10-05T12:22:50.049] error: slurmdbd: Invalid message version=6500, type:1425
[2023-10-05T12:22:50.050] error: slurmdbd: Invalid message version=6500, type:1425
[2023-10-05T12:22:50.050] error: slurmdbd: Invalid message version=6500, type:1425
[2023-10-05T12:22:50.050] error: slurmdbd: Invalid message version=6500, type:1425
[2023-10-05T12:22:50.828] _slurm_rpc_submit_batch_job: JobId=29924017 InitPrio=8454 usec=607
[2023-10-05T12:22:51.007] _slurm_rpc_submit_batch_job: JobId=29924018 InitPrio=8454 usec=874
[2023-10-05T12:22:52.535] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=29595229 uid 17951
[2023-10-05T12:22:52.535] job_str_signal(3): invalid JobId=29595229
[2023-10-05T12:22:52.535] _slurm_rpc_kill_job: job_str_signal() uid=17951 JobId=29595229 sig=9 returned: Invalid job id specified
[2023-10-05T12:22:52.547] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=29595319 uid 17951
[2023-10-05T12:22:52.547] job_str_signal(3): invalid JobId=29595319
[2023-10-05T12:22:52.547] _slurm_rpc_kill_job: job_str_signal() uid=17951 JobId=29595319 sig=9 returned: Invalid job id specified
[2023-10-05T12:22:52.569] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=29595248 uid 17951
[2023-10-05T12:22:52.569] job_str_signal(3): invalid JobId=29595248
[2023-10-05T12:22:52.569] _slurm_rpc_kill_job: job_str_signal() uid=17951 JobId=29595248 sig=9 returned: Invalid job id specified
[2023-10-05T12:22:52.586] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=29595292 uid 17951
[2023-10-05T12:22:52.586] job_str_signal(3): invalid JobId=29595292
[2023-10-05T12:22:52.586] _slurm_rpc_kill_job: job_str_signal() uid=17951 JobId=29595292 sig=9 returned: Invalid job id specified
[2023-10-05T12:22:52.604] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=29595300 uid 17951
[2023-10-05T12:22:52.604] job_str_signal(3): invalid JobId=29595300
[2023-10-05T12:22:52.604] _slurm_rpc_kill_job: job_str_signal() uid=17951 JobId=29595300 sig=9 returned: Invalid job id specified
[2023-10-05T12:22:53.784] error: pack_msg: Invalid message version=6500, type:6500
[2023-10-05T12:22:53.784] error: auth_g_pack: protocol_version 6500 not supported
[2023-10-05T12:22:53.784] error: slurm_send_node_msg: auth_g_pack: REQUEST_PERSIST_INIT has  authentication error: No error
[2023-10-05T12:22:53.784] error: slurm_persist_conn_open: failed to send persistent connection init message to sdfslurmdb:6819
[2023-10-05T12:22:53.784] error: Sending PersistInit msg: Protocol authentication error
[2023-10-05T12:22:53.784] error: DBD_SEND_MULT_JOB_START failure: Protocol authentication error
[2023-10-05T12:22:54.649] _slurm_rpc_submit_batch_job: JobId=29924019 InitPrio=8454 usec=833
[2023-10-05T12:22:55.000] error: pack_msg: Invalid message version=6500, type:6500
[2023-10-05T12:22:55.000] error: auth_g_pack: protocol_version 6500 not supported
[2023-10-05T12:22:55.000] error: slurm_send_node_msg: auth_g_pack: REQUEST_PERSIST_INIT has  authentication error: No error
[2023-10-05T12:22:55.000] error: slurm_persist_conn_open: failed to send persistent connection init message to sdfslurmdb:6819
[2023-10-05T12:22:55.001] error: Sending PersistInit msg: Protocol authentication error
[2023-10-05T12:22:58.562] error: slurmdbd: Invalid message version=6500, type:1441
[2023-10-05T12:22:58.562] _job_complete: JobId=29923968 WEXITSTATUS 0
[2023-10-05T12:22:58.562] error: slurmdbd: Invalid message version=6500, type:1424
[2023-10-05T12:22:58.563] _job_complete: JobId=29923968 done
[2023-10-05T12:22:58.573] error: slurmdbd: Invalid message version=6500, type:1441
[2023-10-05T12:22:58.785] error: pack_msg: Invalid message version=6500, type:6500
[2023-10-05T12:22:58.786] error: auth_g_pack: protocol_version 6500 not supported
[2023-10-05T12:22:58.786] error: slurm_send_node_msg: auth_g_pack: REQUEST_PERSIST_INIT has  authentication error: No error
[2023-10-05T12:22:58.786] error: slurm_persist_conn_open: failed to send persistent connection init message to sdfslurmdb:6819
[2023-10-05T12:22:58.786] error: Sending PersistInit msg: Protocol authentication error
[2023-10-05T12:22:58.786] error: DBD_SEND_MULT_JOB_START failure: Protocol authentication error
[2023-10-05T12:22:58.897] error: slurmdbd: Invalid message version=6500, type:1442
[2023-10-05T12:22:58.897] sched: Allocate JobId=29924017 NodeList=sdfrome038 #CPUs=1 Partition=roma
[2023-10-05T12:22:58.897] error: slurmdbd: Invalid message version=6500, type:1442
[2023-10-05T12:22:58.898] error: slurmdbd: Invalid message version=6500, type:1442
[2023-10-05T12:22:58.898] sched: Allocate JobId=29924018 NodeList=sdfrome038 #CPUs=1 Partition=roma
[2023-10-05T12:22:58.898] error: slurmdbd: Invalid message version=6500, type:1442
[2023-10-05T12:22:58.898] error: slurmdbd: Invalid message version=6500, type:1442
[2023-10-05T12:22:58.898] sched: Allocate JobId=29924019 NodeList=sdfrome038 #CPUs=1 Partition=roma
[2023-10-05T12:22:58.898] error: slurmdbd: Invalid message version=6500, type:1442
[2023-10-05T12:22:59.470] error: slurmdbd: Invalid message version=6500, type:1441
[2023-10-05T12:22:59.471] _job_complete: JobId=29923966 WEXITSTATUS 0
[2023-10-05T12:22:59.471] error: slurmdbd: Invalid message version=6500, type:1424
[2023-10-05T12:22:59.471] _job_complete: JobId=29923966 done
[2023-10-05T12:22:59.479] error: slurmdbd: Invalid message version=6500, type:1441
[2023-10-05T12:23:03.787] error: pack_msg: Invalid message version=6500, type:6500
[2023-10-05T12:23:03.788] error: auth_g_pack: protocol_version 6500 not supported
[2023-10-05T12:23:03.788] error: slurm_send_node_msg: auth_g_pack: REQUEST_PERSIST_INIT has  authentication error: No error
[2023-10-05T12:23:03.788] error: slurm_persist_conn_open: failed to send persistent connection init message to sdfslurmdb:6819
[2023-10-05T12:23:03.788] error: Sending PersistInit msg: Protocol authentication error
[2023-10-05T12:23:03.788] error: DBD_SEND_MULT_JOB_START failure: Protocol authentication error
[2023-10-05T12:23:07.381] error: slurm_receive_msg [127.0.0.1:48298]: Zero Bytes were transmitted or received
[2023-10-05T12:23:08.789] error: pack_msg: Invalid message version=6500, type:6500
[2023-10-05T12:23:08.789] error: auth_g_pack: protocol_version 6500 not supported
[2023-10-05T12:23:08.789] error: slurm_send_node_msg: auth_g_pack: REQUEST_PERSIST_INIT has  authentication error: No error
[2023-10-05T12:23:08.789] error: slurm_persist_conn_open: failed to send persistent connection init message to sdfslurmdb:6819
[2023-10-05T12:23:08.789] error: Sending PersistInit msg: Protocol authentication error
[2023-10-05T12:23:08.789] error: DBD_SEND_MULT_JOB_START failure: Protocol authentication error
[2023-10-05T12:23:11.205] error: slurmdbd: Invalid message version=6500, type:1441
[2023-10-05T12:23:11.205] _job_complete: JobId=29923967 WEXITSTATUS 0
[2023-10-05T12:23:11.205] error: slurmdbd: Invalid message version=6500, type:1424
[2023-10-05T12:23:11.205] _job_complete: JobId=29923967 done
[2023-10-05T12:23:11.233] error: slurmdbd: Invalid message version=6500, type:1441
[2023-10-05T12:23:13.000] error: pack_msg: Invalid message version=6500, type:6500
[2023-10-05T12:23:13.000] error: auth_g_pack: protocol_version 6500 not supported
[2023-10-05T12:23:13.000] error: slurm_send_node_msg: auth_g_pack: REQUEST_PERSIST_INIT has  authentication error: No error
[2023-10-05T12:23:13.000] error: slurm_persist_conn_open: failed to send persistent connection init message to sdfslurmdb:6819
[2023-10-05T12:23:13.000] error: Sending PersistInit msg: Protocol authentication error


Thanks
Ramya

Comment 6 RAMYA ERANNA 2023-10-05 13:36:32 MDT

restarting slurmdb and slurmctld helped to fix the errors.

Thank you
Ramya

Comment 7 Jason Booth 2023-10-05 14:20:30 MDT

I have been Looking over this issue Ramya, and I have seen this happen only a few other times from a couple of previous sites. In those instances, they too reported that restarting the daemons fixed the issue. We have not been able to duplicate this ourselves so we are not sure what causes this to happen.

Comment 9 Marshall Garey 2023-10-11 12:47:59 MDT

Ramya,

Have there been any other issues since the slurmctld node reboot?

Comment 10 RAMYA ERANNA 2023-10-11 12:59:22 MDT

Hi,

One more issue which we observed, some jobs went into runaway state. May I know why these jobs went into runaway state

[reranna@sdfmgr002 ~]$ sacctmgr show RunAwayJobs
NOTE: Runaway jobs are jobs that don't exist in the controller but have a start time and no end time in the database
ID                 Name  Partition    Cluster      State          TimeSubmit           TimeStart             TimeEnd 
------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ------------------- 
29904783            out     ampere       s3df    RUNNING 2023-10-05T04:10:40 2023-10-05T04:10:48             Unknown 
29922701         run.sh       roma       s3df    RUNNING 2023-10-05T10:40:29 2023-10-05T10:40:30             Unknown 
29923252     glide_ery+       roma       s3df    RUNNING 2023-10-05T11:25:51 2023-10-05T11:25:54             Unknown 
29923253     glide_ery+       roma       s3df    RUNNING 2023-10-05T11:25:52 2023-10-05T11:25:54             Unknown 
29923254     glide_ery+       roma       s3df    RUNNING 2023-10-05T11:25:52 2023-10-05T11:25:54             Unknown 
29923255     glide_ery+       roma       s3df    RUNNING 2023-10-05T11:25:52 2023-10-05T11:25:54             Unknown 
29923256     glide_ery+       roma       s3df    RUNNING 2023-10-05T11:25:52 2023-10-05T11:25:54             Unknown 
29923257     glide_ery+       roma       s3df    RUNNING 2023-10-05T11:25:52 2023-10-05T11:25:54             Unknown 
29923258     glide_ery+       roma       s3df    RUNNING 2023-10-05T11:25:52 2023-10-05T11:25:58             Unknown 
29923260     glide_ery+     milano       s3df    RUNNING 2023-10-05T11:25:52 2023-10-05T11:25:54             Unknown 
29923261     glide_ery+     milano       s3df    RUNNING 2023-10-05T11:25:52 2023-10-05T11:25:54             Unknown 
29923263     glide_ery+       roma       s3df    RUNNING 2023-10-05T11:25:53 2023-10-05T11:25:58             Unknown 
29923264     glide_ery+       roma       s3df    RUNNING 2023-10-05T11:25:53 2023-10-05T11:25:58             Unknown 
29923265     glide_ery+       roma       s3df    RUNNING 2023-10-05T11:25:53 2023-10-05T11:26:00             Unknown 
29923266     glide_ery+       roma       s3df    RUNNING 2023-10-05T11:25:53 2023-10-05T11:26:00             Unknown 
29923267     glide_ery+       roma       s3df    RUNNING 2023-10-05T11:25:53 2023-10-05T11:26:00             Unknown 
29923268     glide_ery+       roma       s3df    RUNNING 2023-10-05T11:25:53 2023-10-05T11:26:02             Unknown 
29923269     glide_ery+       roma       s3df    RUNNING 2023-10-05T11:25:53 2023-10-05T11:26:02             Unknown 
29923270     glide_ery+       roma       s3df    RUNNING 2023-10-05T11:25:54 2023-10-05T11:26:02             Unknown 
29923271     glide_ery+       roma       s3df    RUNNING 2023-10-05T11:25:54 2023-10-05T11:26:06             Unknown 
29923784          himem       roma       s3df    RUNNING 2023-10-05T11:58:54 2023-10-05T11:58:54             Unknown 
29923785     usdf_medi+       roma       s3df    RUNNING 2023-10-05T11:59:03 2023-10-05T11:59:04             Unknown 
29923786         medium       roma       s3df    RUNNING 2023-10-05T11:59:08 2023-10-05T11:59:11             Unknown 
29923787     usdf_rubin       roma       s3df    RUNNING 2023-10-05T11:59:08 2023-10-05T11:59:11             Unknown 
29923788      usdf_test       roma       s3df    RUNNING 2023-10-05T11:59:10 2023-10-05T11:59:11             Unknown 
29923789           test     milano       s3df    RUNNING 2023-10-05T11:59:24 2023-10-05T11:59:24             Unknown 
29923790     usdf_himem       roma       s3df    RUNNING 2023-10-05T11:59:29 2023-10-05T11:59:33             Unknown 
29923791     usdf_medi+       roma       s3df    RUNNING 2023-10-05T11:59:34 2023-10-05T11:59:34             Unknown 
29923792          rubin       roma       s3df    RUNNING 2023-10-05T11:59:38 2023-10-05T11:59:39             Unknown 
29923793          himem       roma       s3df    RUNNING 2023-10-05T11:59:54 2023-10-05T11:59:54             Unknown 
29923794     usdf_rubin       roma       s3df    RUNNING 2023-10-05T12:00:09 2023-10-05T12:00:09             Unknown 
29923795      usdf_test       roma       s3df    RUNNING 2023-10-05T12:00:09 2023-10-05T12:00:13             Unknown 
29923796         medium       roma       s3df    RUNNING 2023-10-05T12:00:10 2023-10-05T12:00:13             Unknown 
29923797           test     milano       s3df    RUNNING 2023-10-05T12:00:24 2023-10-05T12:00:26             Unknown 
29923798     usdf_himem       roma       s3df    RUNNING 2023-10-05T12:00:30 2023-10-05T12:00:32             Unknown 
29923799     usdf_medi+       roma       s3df    RUNNING 2023-10-05T12:00:34 2023-10-05T12:00:39             Unknown 
29923800     interacti+     milano       s3df    RUNNING 2023-10-05T12:00:35 2023-10-05T12:00:35             Unknown 
29923801          rubin       roma       s3df    RUNNING 2023-10-05T12:00:38 2023-10-05T12:00:39             Unknown 
29923802          himem       roma       s3df    RUNNING 2023-10-05T12:00:54 2023-10-05T12:00:54             Unknown 

Would you like to fix these runaway jobs?
(This will set the end time for each job to the latest out of the start, eligible, or submit times, and set the state to completed.
Once corrected, this will trigger the rollup to reroll usage from before the earliest submit time of all the runaway jobs.)

 (You have 30 seconds to decide)
(N/y): y
[reranna@sdfmgr002 ~]$







Thanks
Ramya

Comment 11 Marshall Garey 2023-10-13 10:45:31 MDT

Runaway jobs are defined as jobs that are not pending in the database, but do not exist in the controller. This can happen when job complete messages do not make it to the database. I can only guess as to what caused that to happen. That might happen due to slurmctld filling up its cache (check your slurmctld log for "RESTART SLURMDBD NOW" messages), or due to some network or filesystem issue. It could happen if the slurmctld did not recover those jobs when it restarted. Or possibly some other way.

Comment 12 RAMYA ERANNA 2023-10-19 14:30:07 MDT

Hi,

Thank you for your quick help. Please close the ticket


Thank you
Ramya

Comment 13 Marshall Garey 2023-10-19 14:41:42 MDT

Closing as infogiven per comment 12.

Comment 14 Marshall Garey 2023-10-19 14:42:51 MDT

Just to clarify comment 11:

> Runaway jobs are defined as jobs that are not pending in the database, but do not exist in the controller.

I forgot to add "and do not have an end time."
So the complete definition is:

Runaway jobs are defined as jobs that are not pending in the database and do not have an end time, but do not exist in the controller.