Ticket 1192 - GPU double allocation
Summary: GPU double allocation
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 14.03.7
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Moe Jette
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2014-10-20 07:41 MDT by Josko Plazonic
Modified: 2014-10-21 04:34 MDT (History)
3 users (show)

See Also:
Site: Princeton (PICSciE)
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 14.03.10
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Josko Plazonic 2014-10-20 07:41:58 MDT
Hi,

we had a report of a job which is getting allocated the same GPU as another already running job.  Out of 4 GPUs on that box only #1 is in use and indeed both the already running job and the newly submitted/running job have the CUDA_VISIBLE_DEVICES=1.  

Any idea why is the new job getting that same GPU?  We were toying with the theory that this was because we restarted slurmctld a number of times while adding new nodes and that while it kept the count of used devices it forgot which ones were used.  So when time came to give out the next one it decided it was #1 rather then 0.

Is this supposed to be handled correctly? 

Thanks!
Comment 1 Moe Jette 2014-10-20 07:55:24 MDT
This should work fine. Can you provide your configuration file(s), slurmctld log file, identify the job IDs, and any details about what was happening?
Comment 2 Josko Plazonic 2014-10-20 08:44:06 MDT
Well, it looks like then might be hard to debug post fact - we have a number of jobs that failed due to this (as we run GPUs in exclusive mode):

311632           SCIMAJ        gpu        eeb          1     FAILED      1:0 
311632.batch      batch                   eeb          1     FAILED      1:0 
311633           SCIMAJ        gpu        eeb          1     FAILED      1:0 
311633.batch      batch                   eeb          1     FAILED      1:0 
311634           SCIMAJ        gpu        eeb          1     FAILED      1:0 
311634.batch      batch                   eeb          1     FAILED      1:0 
311635           SCIMAJ        gpu        eeb          1     FAILED      1:0 
311635.batch      batch                   eeb          1     FAILED      1:0 
311636           SCIMAJ        gpu        eeb          1     FAILED      1:0 
311636.batch      batch                   eeb          1     FAILED      1:0 
311637           SCIMAJ        gpu        eeb          1     FAILED      1:0 
311637.batch      batch                   eeb          1     FAILED      1:0 
311638           SCIMAJ        gpu        eeb          1     FAILED      1:0 
311638.batch      batch                   eeb          1     FAILED      1:0 
311639           SCIMAJ        gpu        eeb          1     FAILED      1:0 
311639.batch      batch                   eeb          1     FAILED      1:0 
311640           SCIMAJ        gpu        eeb          1     FAILED      1:0 
311640.batch      batch                   eeb          1     FAILED      1:0 
311641           SCIMAJ        gpu        eeb          1     FAILED      1:0 

but there is zero info about these jobs in slurmctld (we clearly do not have it set to be too verbose) or in slurmd.

Sadly the job that was causing the trouble has ended in the meantime and we have no other nodes where GPUs are not allocated in sequence starting from 0.

Otherwise we are setup with gres.conf like:
# For GPU id 0, 0000:02:00.0 cpus are 00ff
Name=gpu File=/dev/nvidia0 CPUs=0-7
# For GPU id 1, 0000:03:00.0 cpus are 00ff
Name=gpu File=/dev/nvidia1 CPUs=0-7
# For GPU id 2, 0000:83:00.0 cpus are ff00
Name=gpu File=/dev/nvidia2 CPUs=8-15
# For GPU id 3, 0000:84:00.0 cpus are ff00
Name=gpu File=/dev/nvidia3 CPUs=8-15

Relevant lines from slurm.conf:
GresTypes=gpu
NodeName=tiger-r11n[1-16] RealMemory=64000 Weight=1  State=UNKNOWN Gres=gpu:4
NodeName=tiger-r12n[1-16] RealMemory=64000 Weight=1  State=UNKNOWN Gres=gpu:4
NodeName=tiger-r13n[1-16] RealMemory=64000 Weight=1  State=UNKNOWN Gres=gpu:4
NodeName=tiger-r14n[3-4] RealMemory=64000 Weight=1  State=UNKNOWN Gres=gpu:4

I assume you are saying that GPU allocations should survive restart of slurmctld? Does it save the state to disk or is it querying nodes?

I guess at this point there is little any of us can do so unless you have other ideas it is ok to close for now. What would be very useful is some suggestions on how to debug this if it comes up again? Raise debugging on slurmctld? Which one?  slurmd?

Thanks!
Comment 3 Moe Jette 2014-10-20 08:56:07 MDT
I'll try to reproduce given the information provided, but it's not much to go on.

The gres allocated to each job are saved as part of the job state, so node state should be synchronized after a reconfiguration or restart.
Comment 4 Josko Plazonic 2014-10-20 09:05:51 MDT
Only if easy - otherwise we'd be more then ok with waiting to see if it resurfaces and with instructions on what to do next time as far as logging/details.  We just didn't have enough time to poke around before the problem was gone.

But we definitely had it - /proc/PROCID/environ for running job was clearly saying CUDA_VIS..=1 (he wasn't overriding it in his script) and all new jobs were landing with same setting as well (we had the other guy add debugging statements to see why were his jobs failing).

So somehow it did happen.
Comment 5 Moe Jette 2014-10-20 09:27:00 MDT
I've been able to reproduce this. In my experience, that's a big part of getting the problem fixed.
Comment 6 Moe Jette 2014-10-20 11:57:41 MDT
I don't have a fix yet, but I did determine where this is happening. Its the gres_plugin_job_clear() function that clears job state in order to support job requeue, but seems to be clearing critical job information when the slurmctld restarts.
Comment 7 Josko Plazonic 2014-10-21 01:57:40 MDT
Fantastic! Great work (as usual, you are spoiling us here).  And here I was feeling guilty for not managing to debug it enough before the problem went away...
Comment 8 Moe Jette 2014-10-21 04:34:22 MDT
(In reply to Josko Plazonic from comment #7)
> Fantastic! Great work (as usual, you are spoiling us here).  And here I was
> feeling guilty for not managing to debug it enough before the problem went
> away...

We have a first rate team here to deliver top quality service. If you have a problem, your first contact here may well be the code author rather than someone trying to keep you away from a developer.

I've found and fixed the problem. The fix will be in v14.03.10 when tagged (probably not for several weeks), but the patch is on a few lines available to you now here:
https://github.com/SchedMD/slurm/commit/1209a664840a431428658c0950b16078af7aff63