Ticket 5416

Summary: nodes failing to launch batch job
Product: Slurm Reporter: Michael Gutteridge <mrg>
Component: OtherAssignee: Brian Christiansen <brian>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 17.11.5   
Hardware: Linux   
OS: Linux   
Site: FHCRC - Fred Hutchinson Cancer Research Center Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Current slurm.conf 12 Jul 18
log snippet round job 66222894
controller log snippet around job 66222894
getgrouplist c prog

Description Michael Gutteridge 2018-07-12 13:33:35 MDT
Created attachment 7289 [details]
Current slurm.conf 12 Jul 18

We seem to be have frequent, but irregular, problems with jobs failing without clear indication of what's happened.  These jobs also appear to get requeued even though "JobRequeue" is set to 0.

We haven't been able to deterministically reproduce the problem: it seems to be largely random and does seem to have occurred on most nodes throughout the cluster.  There has been only one significant change which may be at fault- we've recently switched LDAP servers and these problems do seem to coincide with that change.  We're struggling to see the connection between the two, but I'm not ruling anything out.

What seems to happen is:

 - a job is allocated to a node.
 - some ~90 seconds later, the controller notes "_job_complete", usually "WEXITSTATUS=0" (though this is not consistent).
 - there is (sometimes) a socket timeout (send/recv) in the controller log around that same time
 - the controller seems to attempt requeue, even when the job doesn't request it or specifies "--no-requeue"
 - The controller then indicates "Node node1234 now responding".  "node1234" was the node assigned to this job.
 - The daemon notes "unable to confirm batch launch" and  "not able to requeue, cancelling"
 - Concurrent with this, the controller then logs first that there's a requeue request, which seems to be denied ("requested operation is disabled")
 - Then, the controller appears to requeue the job after an addition "message send" failure to the allocated node.
 - the job is then queued in a pending state, but never seems to be allocated a node.  If the job is cancelled, the job then transitions to a "node_fail" state.

In the controller logs, for job 66222894:

37936:[2018-07-11T14:06:34.272] sched: Allocate JobID=66222876_18(66222894) NodeList=gizmof94 #CPUs=4 Partition=campus
38042:[2018-07-11T14:07:29.745] _job_complete: JobID=66222876_18(66222894) State=0x1 NodeCnt=1 WEXITSTATUS 0
38043:[2018-07-11T14:07:29.746] _job_complete: requeue JobID=66222876_18(66222894) State=0x8000 NodeCnt=1 per user/system request
38044:[2018-07-11T14:07:29.748] _job_complete: JobID=66222876_18(66222894) State=0x8000 NodeCnt=1 done
38182:[2018-07-11T14:07:43.041] _slurm_rpc_requeue: 66222894: Requested operation is presently disabled
38183:[2018-07-11T14:07:43.045] error: Slurmd error running JobId=66222894 on node(s)=gizmof94: Message send failure
38184:[2018-07-11T14:07:43.117] Requeuing JobID=66222876_18(66222894) State=0x0 NodeCnt=0

On the daemon:

173199:[2018-07-11T14:07:43.035] error: Could not confirm batch launch for job 66222894, aborting request
173200:[2018-07-11T14:07:43.039] Could not launch job 66222894 and not able to requeue it, cancelling job

I've attached log snippets for a job that was impacted.  I confirmed with the job owner that he hadn't used any requeue option.  An earlier instance of this problem occurred with another user who had explicitly requested "--no-requeue".

The root cause would appear to be an unresponsive slurmd even though the daemon appears to be otherwise fine.  I'm still investigating what might be going wrong there.  While we've seen no other problems suggesting LDAP isn't working properly, is there something inside the daemon that would be impacted by problems with an LDAP server?  We are currently seeing hundreds and more of instances in the controller log indicating that nodes are not responding, though nodes never get marked down (our slurmd timeout is 2000s, which may be the reason why there).  There appear to be no other changes or issues on the underlying network or problems on the daemons or controller.

The requeues are concerning as well- at first guess, it sort of looks like it correctly does not requeue the job, but something then changes its mind.  Has this been noted by any others?  I looked at updates, but didn't see anything relevant.

I'll attach configs and logs.

Thanks

Michael
Comment 1 Michael Gutteridge 2018-07-12 13:34:15 MDT
Created attachment 7290 [details]
log snippet round job 66222894
Comment 2 Michael Gutteridge 2018-07-12 13:34:46 MDT
Created attachment 7291 [details]
controller log snippet around job 66222894
Comment 3 Brian Christiansen 2018-07-12 16:43:13 MDT
Created attachment 7293 [details]
getgrouplist c prog

Hey Michael,

I believe I have an idea of where and what's happening. I'm able to reproduce the situation by placing a sleep() in the launch batch job code in the slurmd.

What I think is happening:
1. slurmctld allocates nodes for job
2. slurmctld sends launch batch job request to the slurmd
3. slurmd receives the message
4. the slurmd attempts to get the user's groups.

e.g.
	/* lookup gids if they weren't sent by slurmctld */
	if (!req->ngids)
		req->ngids = group_cache_lookup(req->uid, req->gid,
						req->user_name, &req->gids);

5. group_cache_loopup() -> _group_cache_lookup_internal()
6. _group_cache_lookup_internal() will get the users groups if it hasn't seen the user before OR if the entries are older than GroupUpdateTime (default:600 seconds).

	/* Cache lookup failed or entry value was too old, fetch new
	 * value and insert it into cache.  */
	while (getgrouplist(entry->username, entry->gid,
			    entry->gids, &entry->ngids) == -1) {

7. I'm guessing that getgrouplist is taking a long time (greater than MessageTimeout).
8. The slurmd tries to respond to the slurmctld who is waiting for a response on how the launch of the batch job went. But the slurmctld's launch request timed out so it's not there. So you get:

error: Could not confirm batch launch for job 66222894, aborting request

and then attempts to clean up and tells the slurmctld that the job failed. First attempting to the requeue the job -- which the slurmctld replies that the job couldn't be requeued because the job is not requeueable.

_slurm_rpc_requeue: 66222894: Requested operation is presently disabled

9. But the job is still requeued.

Requeuing JobID=66222876_18(66222894) State=0x0 NodeCnt=0

We need to look into why this is happening.

If the user were to submit another job to the node in the next 10 minutes, the jobs should run fine because the slurmd will use the cached entry, but if another job comes in after the 10 minutes the same situation will repeat.


So..... to help diagnose this and see if this is really the issue, can you run the attached c program (the test example from getgrouplist man page)?

e.g.
brian@lappy:~/ctests/getgrp$ time ./a.out brian 0
getgrouplist() returned -1; ngroups = 9

real    0m0.003s
user    0m0.000s
sys     0m0.003s

brian@lappy:~/ctests/getgrp$ time ./a.out brian 9
ngroups = 9
1003 (brian)
...

real    0m0.003s
user    0m0.000s
sys     0m0.003s



Let us know if this helps.

Thanks,
Brian
Comment 4 Brian Christiansen 2018-07-12 16:48:06 MDT
After you confirm the LDAP issues, you may want to look at using the LaunchParameters=send_gids. This might solve your issue now, but you might want to confirm any LDAP issues first.

https://slurm.schedmd.com/slurm.conf.html#OPT_send_gids
Comment 5 Michael Gutteridge 2018-07-13 13:29:56 MDT
Hi

Thanks for the code- it did point out a serious issue with LDAP server performance that was masked by nscd (to get the bad numbers I would drop nscd caches with a restart of the service).  Anyway, enumerating group entries could take upwards of 100 seconds.

We're reverting to the old servers which should return us to our previous (hopefully) good state.  This change is going through our systems now so I should be able to all-clear soon.  We'll also be turning on `send_gids` as I think that will really help with some of our users' jobs (we have a lot of folks who submit thousands of jobs at a go)

I take it you're thinking the requeueing behavior might be a bug with slurm?  When we get group enumeration fixed I think we'll be out of the woods and wouldn't _need_ that fixed, but it'd be nice to get that working the way we'd expect.

Thanks for the help!

Michael
Comment 6 Brian Christiansen 2018-07-13 15:40:42 MDT
Good to hear! Glad to help.

For the requeue issue, Slurm is intentionally requeueing the job because it thinks that the job never launched so it's trying to reschedule the job. Slurm will only requeue the job like this in 3 situations -- 1. the batch job launch failed (this one), 2. the batch script is missing before launch and 3. the batch job credential failed to create.

Are you good to close the bug?
Comment 7 Michael Gutteridge 2018-07-13 15:49:20 MDT
(In reply to Brian Christiansen from comment #6)
> Good to hear! Glad to help.
> 
> For the requeue issue, Slurm is intentionally requeueing the job because it
> thinks that the job never launched so it's trying to reschedule the job.
> Slurm will only requeue the job like this in 3 situations -- 1. the batch
> job launch failed (this one), 2. the batch script is missing before launch
> and 3. the batch job credential failed to create.

Ok- that makes sense.  Only thing was that it never did seem to re-run the job, though that may have been due to something else.  Maybe I'll try to replicate some other time.

> Are you good to close the bug?

Yep- I think so.  Thanks for the help.

Michael
Comment 8 Brian Christiansen 2018-07-13 16:02:58 MDT
In my situation the job runs after the requeue. Let me know if you see something different.

Thanks,
Brian