Ticket 11855

Summary: Jobs left in PD state with reason BeginTime
Product: Slurm Reporter: Roy <proutyr1>
Component: SchedulingAssignee: Nate Rini <nate>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: bart, champ, nate, proutyr1, randy
Version: - Unsupported Older Versions   
Hardware: Linux   
OS: Linux   
Site: UMBC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf file
slurmd log from cnode011
slurmd log from cnode111
slurmctld log from head node.

Description Roy 2021-06-17 07:31:00 MDT
Created attachment 19991 [details]
slurm.conf file

Hi,

Running 19.05.8 -- there are plans to upgrade in the upcoming weeks.

In the meantime, we're seeing that all new jobs submitted to our system drop immediately into a Pending state with Reason=(BeginTime). 

I see in the slurmctld log a few curious entries:

Many requeues:
[2021-06-17T09:21:33.201] Requeuing JobId=3977254

Many failed prolog runs:
[2021-06-17T09:21:33.204] error: prolog_slurmctld JobId=3977980 prolog exit status 99:0

Many attempted allocations that still result in PD state:
[2021-06-17T09:21:32.791] sched: Allocate JobId=3977407 NodeList=cnode038 #CPUs=1 Partition=high_mem

Many issues with RPC counts:
[2021-06-17T09:21:30.716] sched: 233 pending RPCs at cycle end, consider configuring max_rpc_cnt

I've attached our slurm.conf, and please note we run SLURM along with Bright Computing. I will attach the log in a future message.

I have looked through some other bug reports on this issue and have not been able to find a solution that pertains to 19.05.8. I have attempted to hold an release jobs as well as pushing the starttime of jobs. Both cases lead test jobs to the same result of Pending with BeginTime as the reason.

We do run with a multifactor priority scheme using age. I see hints that this could be related.

Currently, no users can run new jobs on the system. We do still have some jobs running from before whatever hiccough started this.
Comment 1 Nate Rini 2021-06-17 08:58:44 MDT
Please attach the slurmctld log and at least a few slurmd logs.

Please try running trivial job:
> srun -vvv /usr/bin/uptime
Comment 3 Roy 2021-06-17 09:04:46 MDT
Here is the output:

[~]$ srun -vvv --partition=batch --account=test_conp --time=1 /usr/bin/uptime
srun: defined options
srun: -------------------- --------------------
srun: account             : test_conp
srun: partition           : batch
srun: time                : 00:01:00
srun: verbose             : 3
srun: -------------------- --------------------
srun: end of defined options
srun: debug:  propagating RLIMIT_CPU=18446744073709551615
srun: debug:  propagating RLIMIT_FSIZE=18446744073709551615
srun: debug:  propagating RLIMIT_DATA=18446744073709551615
srun: debug:  propagating RLIMIT_STACK=18446744073709551615
srun: debug:  propagating RLIMIT_CORE=0
srun: debug:  propagating RLIMIT_RSS=18446744073709551615
srun: debug:  propagating RLIMIT_NPROC=26000
srun: debug:  propagating RLIMIT_NOFILE=131072
srun: debug:  propagating RLIMIT_MEMLOCK=18446744073709551615
srun: debug:  propagating RLIMIT_AS=18446744073709551615
srun: debug:  propagating SLURM_PRIO_PROCESS=0
srun: debug:  propagating UMASK=0007
srun: debug2: srun PMI messages to port=38091
srun: debug:  Entering slurm_allocation_msg_thr_create()
srun: debug:  port from net_stream_listen is 38666
srun: debug:  Entering _msg_thr_internal
srun: debug:  Munge authentication plugin loaded
srun: Waiting for nodes to boot (delay looping 7650 times @ 0.100000 secs x index)
srun: debug:  Waited 0.100000 sec and still waiting: next sleep for 0.200000 sec
srun: debug2: eio_message_socket_accept: got message connection from 10.2.15.254:61406 7
srun: PrologSlurmctld failed, job killed
srun: debug2: eio_message_socket_accept: got message connection from 10.2.15.254:61408 7
srun: Force Terminated job 3978159
srun: error: Job allocation 3978159 has been revoked
[~]$

-----

[root log]# grep 3978159 /var/log/slurmctld
[2021-06-17T11:02:45.141] sched: _slurm_rpc_allocate_resources JobId=3978159 NodeList=cnode102 usec=1171
[2021-06-17T11:02:45.324] error: prolog_slurmctld JobId=3978159 prolog exit status 99:0
[2021-06-17T11:02:45.324] unable to requeue JobId=3978159: Only batch jobs are accepted or processed
[root log]#

Logs en route
Comment 4 Roy 2021-06-17 09:08:51 MDT
Created attachment 19992 [details]
slurmd log from cnode011
Comment 5 Roy 2021-06-17 09:09:29 MDT
Created attachment 19993 [details]
slurmd log from cnode111
Comment 6 Roy 2021-06-17 09:09:38 MDT
Created attachment 19994 [details]
slurmctld log from head node.

slurmctld log from head node.
Comment 7 Nate Rini 2021-06-17 09:09:46 MDT
(In reply to Roy from comment #3)
> srun: PrologSlurmctld failed, job killed
>
> PrologSlurmctld=/cm/local/apps/cmd/scripts/prolog-prejob

Looks like the Bright provided prolog script is failing (and causing all the jobs to re-queue).
Comment 8 Nate Rini 2021-06-17 09:11:04 MDT
(In reply to Nate Rini from comment #7)
> (In reply to Roy from comment #3)
> > srun: PrologSlurmctld failed, job killed
> >
> > PrologSlurmctld=/cm/local/apps/cmd/scripts/prolog-prejob
> 
> Looks like the Bright provided prolog script is failing (and causing all the
> jobs to re-queue).

Please comment out the PrologSlurmctld line in the slurm.conf and restart all of the Slurm daemons to see if jobs work without the script.
Comment 9 Roy 2021-06-17 09:18:57 MDT
Hi Nate, 

That did the trick, the previously PD jobs are now running. 

Should I move this issue to Bright to determine the cause or are there other diagnostics we could address here?
Comment 10 Nate Rini 2021-06-17 09:20:25 MDT
(In reply to Roy from comment #9)
> That did the trick, the previously PD jobs are now running. 
> 
> Should I move this issue to Bright to determine the cause or are there other
> diagnostics we could address here?

That will likely be the most effective route. Slurm honors the result of the prolog script and will requeue any job when it fails.
Comment 11 Roy 2021-06-17 09:21:47 MDT
Alright. Thank you for your help with this.
Comment 12 Nate Rini 2021-06-17 09:27:48 MDT
Roy

I'm going to close this ticket per your response. Please respond if you have any more questions.

We also generally suggest against running with the config:
> DebugFlags=NO_CONF_HASH

This could result in unexpected issues if one of the nodes gets out of sync with the slurm.conf on the controller.

Thanks,
--Nate