Ticket 3676

Summary:	Jobs Scheduled But Isn't
Product:	Slurm	Reporter:	Paul Edmon <pedmon>
Component:	Scheduling	Assignee:	Danny Auble <da>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	16.05.10
Hardware:	Linux
OS:	Linux
Site:	Harvard University	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	17.02.4 17.11.0-pre1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf slurm log Log for holy2a18306

Description Paul Edmon 2017-04-07 09:25:30 MDT

I'm seeing the most confusing bug. The scheduler says it has scheduled a job:

[2017-04-07T11:01:15.708] sched: _slurm_rpc_allocate_resources JobId=85390249 NodeList=holy2a18306 usec=8432382
[2017-04-07T11:01:20.688] prolog_running_decr: Configuration for job 85390249 is complete
[2017-04-07T11:01:20.689] Extending job 85390249 time limit by 5 secs for configuration

It shows up as running in squeue and all the metrics but then if I go to the node where the job was supposed to run I see nothing. Eventually after a block of time it will:

from holy2a18306:
Apr 7 11:07:36 holy2a18306 slurmd[20970]: _run_prolog: prolog with lock for job 85390249 ran for 0 seconds
Apr 7 11:07:37 holy2a18306 slurmstepd[22256]: task/cgroup: /slurm/uid_556792/job_85390249: alloc=4000MB mem.limit=4000MB memsw.limit=8000MB
Apr 7 11:07:37 holy2a18306 slurmstepd[22256]: task/cgroup: /slurm/uid_556792/job_85390249/step_extern: alloc=4000MB mem.limit=4000MB memsw.limit=8000MB
Apr 7 11:07:41 holy2a18306 slurmd[20970]: launch task 85390249.0 request from 556792.40273@10.242.104.142 (port 39059)
Apr 7 11:07:41 holy2a18306 slurmd[20970]: lllp_distribution jobid [85390249] implicit auto binding: cores,one_thread, dist 1

but you can see the massive delay, like 6 minutes. I have no idea what is causing the delay. The scheduler seems to think things are fine but clearly this isn't. It seems infrequent as well. In most cases it will just schedule fine, but for a period of time it won't.

I've never seen this before. It's super weird. It definitely did not happen in 16.05.9 as we never saw this until this week when we upgraded to 16.05.10-2, so some change in 16.05.10-2 introduced this bug.

While not completely debilitating to the scheduler, I'm not sure if it will get eventually permanently stuck in the state where the master thinks things are running but they actually aren't. Plus the delay period seems to be significant like 10's of minutes. I don't know what the occurrence rate is of this but this is the 2nd time it has happened this week that we know of.

Comment 1 Paul Edmon 2017-04-07 09:46:00 MDT

The delay state is still on going, so it appears to be the default state of the scheduler now.  We do run a slurmctld_prolog but its fast and the results of that prolog are appearing as normal.  For instance I currently have several jobs I have submitted:

[root@holy2a23301 kovac_lab]# /usr/bin/squeue -u root
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          85391328     kovac runscrip     root  R       2:54      1 holy2a23310
          85391309     kovac runscrip     root  R       3:06      1 holy2a23310
          85391322     kovac runscrip     root  R       3:06      1 holy2a23310
          85391300     kovac runscrip     root  R       3:20      1 holy2a23310
          85391295     kovac runscrip     root  R       3:27      1 holy2a23310
          85391288     kovac runscrip     root  R       3:43      1 holy2a23310
          85391282     kovac runscrip     root  R       4:18      1 holy2a23310
          85391263     kovac runscrip     root  R       4:30      1 holy2a23310
          85391248     kovac runscrip     root  R       4:32      1 holy2a23310
          85391246     kovac runscrip     root  R       4:50      1 holy2a23310
          85391242     kovac runscrip     root  R       4:59      1 holy2a23310
          85391232     kovac runscrip     root  R       5:03      1 holy2a23310
          85391224     kovac runscrip     root  R       5:05      1 holy2a23310
          85391194     kovac runscrip     root  R       5:54      1 holy2a23310

but the node itself shows nothing running, nor have any logs from the jobs dumped:

[root@holy2a23301 kovac_lab]# ls
runscript

Even though our prolog script has run and printed out the runscript like it was supposed to:

[root@holy-slurm02 328]# ls -ltr
total 16
-rw-r--r-- 1 slurm slurm_users 9187 Apr  7 10:57 85389328
-rw-r--r-- 1 slurm slurm_users    0 Apr  7 11:03 85390328
-rw-r--r-- 1 slurm slurm_users 2338 Apr  7 11:39 85391328

the time stamp there is the time that the job was originally processed by the scheduler.  So some delay is occuring between the scheduling and the actual running of the job on the node.

Comment 2 Paul Edmon 2017-04-07 13:42:37 MDT

Just an update, we decided to downgrade our slurm version from 16.05.10-2 to 16.05.9.  Let you know if it reoccurs but we are pretty convinced it is a bug in 16.05.10-2.

Comment 3 Tim Wickberg 2017-04-07 13:47:09 MDT

There's a few things that may explain this; especially between those two point releases the possible culprits are relatively few and far between.

Would it be possible to get a larger chunk of the logs, and an updated version of your slurm.conf? The one I have on file is a bit dated, and I'd like to rule out a few possible complications.

- Tim

Comment 4 Paul Edmon 2017-04-07 13:55:27 MDT

Created attachment 4321 [details]
slurm.conf

Comment 5 Paul Edmon 2017-04-07 13:56:05 MDT

Created attachment 4322 [details]
slurm log

Comment 6 Paul Edmon 2017-04-07 13:57:43 MDT

Added our conf and a log of the last part of today which includes the period during which there were delays.  So far as we can tell nothing obvious showed up in the logs regarding the delays.  slurmctld saw everything operating fine.  There was just a delay between when slurmctld said it launched the job and the job appearing on the node, as I excerpted before.

Comment 7 Danny Auble 2017-04-11 13:58:28 MDT

Paul, has this happen since the downgrade?

Comment 8 Paul Edmon 2017-04-11 14:04:39 MDT

Nope.  16.05.9 is working as expected and no significant delays detected.  So it is definitely something in 16.05.10-2 that was causing the issue.  Whether it was a bug or a weird interaction with our set up.  I can't really say which.  It seemed periodic as it wasn't a constant effect.  Really strange.  It's entirely possible that this is unique to our environment and 16.05.10-2.

I'm satisfied with running on 16.05.9 as a fix as it seems to be working fine.  Our plan was to upgrade to 17.02.x at our next maintenance window in June/July so we will bypass 16.05 completely with that one as such 16.05.10-2 won't be a concern for us anymore.

Comment 9 Danny Auble 2017-04-11 14:35:32 MDT

That sounds good Paul.  I don't think there is much in 16.05.10-2 that you will be missing out on.  Would you mind if we drop the severity of the bug now you at least have a work around?  I am assuming you downgraded all your cluster (not just the slurmds) to 16.05.9 as well is that correct?

I would be interested in finding out what the causer of this situation is though as 17.02 will most likely contain that code.

I believe I have narrowed it down to 3 commits that are at least suspect.

f6d42fdbb
b988531d6
e58c22828

You happen to know if the jobs that were delayed where interactive jobs (salloc) or batch jobs (sbatch)?

Could you send your slurmd log from holy2a18306?

Comment 10 Paul Edmon 2017-04-12 08:13:20 MDT

Created attachment 4340 [details]
Log for holy2a18306

Comment 11 Paul Edmon 2017-04-12 08:16:45 MDT

Sure, I've just added the log for holy2a18306.  It should cover the same period as the slurmctld log I sent.

The lag was for both srun and sbatch.  Didn't matter which it was.  The delay though was only temporary.  It would last for around 20-30 min and then disappear and then reappear seemingly at random.  Since there were no errors in the logs and no obvious problems I couldn't get a fix on why it was happening.  There were no issues with our network either.

When we downgraded to 16.05.9 we did it across the cluster so slurmd, slurmctld, and slurmdbd.  So everything is now at 16.05.9.

You can downgrade this bug report now as we have a fix which was our swapping back to 16.05.9.

Comment 13 Danny Auble 2017-05-17 11:34:55 MDT

Paul I am wondering what your upgrade to 17.02 plans are.  I would be interested to see if this is fixed there or not.  As we haven't been able to reproduce this we are at a loss for why it is happening.  There was only so many lines that changed from 16.05.9 -> .10.

A slightly less intrusive approach might be to patch your .9 with the commits mentioned in comment 9 one at a time to see if they are indeed suspect or not.

At the moment I am leaning towards

f6d42fdbb

or

e58c22828

If you can reproduce on a test system that would also be helpful as you would be able to test 17.02 to see if things are fixed or not.

Thanks!

Comment 14 Paul Edmon 2017-05-22 08:35:47 MDT

As it stands I was going to roll 17.02.3 on our test cluster some time this week.  I will see if I can replicate the problem there.  Our full production upgrade will probably be in a month or so.

Comment 15 Danny Auble 2017-05-22 09:59:41 MDT

Excellent.  I will await your tests.  I am hoping you can recreate with 16.05.10 and then not with 17.02.3.

Comment 16 Danny Auble 2017-06-14 14:56:11 MDT

Paul did you trip up on this on with 17.02?

Comment 17 Paul Edmon 2017-06-15 08:15:18 MDT

Not in the testing I have done.  So we are going to go ahead with the 
upgrade and if it comes up again we will let you know.

-Paul Edmon-


On 06/14/2017 04:56 PM, bugs@schedmd.com wrote:
>
> *Comment # 16 <https://bugs.schedmd.com/show_bug.cgi?id=3676#c16> on 
> bug 3676 <https://bugs.schedmd.com/show_bug.cgi?id=3676> from Danny 
> Auble <mailto:da@schedmd.com> *
> Paul did you trip up on this on with 17.02?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 18 Danny Auble 2017-06-15 08:29:57 MDT

Sweet! Would you be ok to close this bug and reopen of it shows up again?

Comment 19 Paul Edmon 2017-06-15 08:30:44 MDT

Yup, go ahead.

-Paul Edmon-


On 06/15/2017 10:29 AM, bugs@schedmd.com wrote:
>
> *Comment # 18 <https://bugs.schedmd.com/show_bug.cgi?id=3676#c18> on 
> bug 3676 <https://bugs.schedmd.com/show_bug.cgi?id=3676> from Danny 
> Auble <mailto:da@schedmd.com> *
> Sweet! Would you be ok to close this bug and reopen of it shows up again?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 20 Danny Auble 2017-06-15 08:35:26 MDT

Sounds good