Hi, we had a report of a job which is getting allocated the same GPU as another already running job. Out of 4 GPUs on that box only #1 is in use and indeed both the already running job and the newly submitted/running job have the CUDA_VISIBLE_DEVICES=1. Any idea why is the new job getting that same GPU? We were toying with the theory that this was because we restarted slurmctld a number of times while adding new nodes and that while it kept the count of used devices it forgot which ones were used. So when time came to give out the next one it decided it was #1 rather then 0. Is this supposed to be handled correctly? Thanks!
This should work fine. Can you provide your configuration file(s), slurmctld log file, identify the job IDs, and any details about what was happening?
Well, it looks like then might be hard to debug post fact - we have a number of jobs that failed due to this (as we run GPUs in exclusive mode): 311632 SCIMAJ gpu eeb 1 FAILED 1:0 311632.batch batch eeb 1 FAILED 1:0 311633 SCIMAJ gpu eeb 1 FAILED 1:0 311633.batch batch eeb 1 FAILED 1:0 311634 SCIMAJ gpu eeb 1 FAILED 1:0 311634.batch batch eeb 1 FAILED 1:0 311635 SCIMAJ gpu eeb 1 FAILED 1:0 311635.batch batch eeb 1 FAILED 1:0 311636 SCIMAJ gpu eeb 1 FAILED 1:0 311636.batch batch eeb 1 FAILED 1:0 311637 SCIMAJ gpu eeb 1 FAILED 1:0 311637.batch batch eeb 1 FAILED 1:0 311638 SCIMAJ gpu eeb 1 FAILED 1:0 311638.batch batch eeb 1 FAILED 1:0 311639 SCIMAJ gpu eeb 1 FAILED 1:0 311639.batch batch eeb 1 FAILED 1:0 311640 SCIMAJ gpu eeb 1 FAILED 1:0 311640.batch batch eeb 1 FAILED 1:0 311641 SCIMAJ gpu eeb 1 FAILED 1:0 but there is zero info about these jobs in slurmctld (we clearly do not have it set to be too verbose) or in slurmd. Sadly the job that was causing the trouble has ended in the meantime and we have no other nodes where GPUs are not allocated in sequence starting from 0. Otherwise we are setup with gres.conf like: # For GPU id 0, 0000:02:00.0 cpus are 00ff Name=gpu File=/dev/nvidia0 CPUs=0-7 # For GPU id 1, 0000:03:00.0 cpus are 00ff Name=gpu File=/dev/nvidia1 CPUs=0-7 # For GPU id 2, 0000:83:00.0 cpus are ff00 Name=gpu File=/dev/nvidia2 CPUs=8-15 # For GPU id 3, 0000:84:00.0 cpus are ff00 Name=gpu File=/dev/nvidia3 CPUs=8-15 Relevant lines from slurm.conf: GresTypes=gpu NodeName=tiger-r11n[1-16] RealMemory=64000 Weight=1 State=UNKNOWN Gres=gpu:4 NodeName=tiger-r12n[1-16] RealMemory=64000 Weight=1 State=UNKNOWN Gres=gpu:4 NodeName=tiger-r13n[1-16] RealMemory=64000 Weight=1 State=UNKNOWN Gres=gpu:4 NodeName=tiger-r14n[3-4] RealMemory=64000 Weight=1 State=UNKNOWN Gres=gpu:4 I assume you are saying that GPU allocations should survive restart of slurmctld? Does it save the state to disk or is it querying nodes? I guess at this point there is little any of us can do so unless you have other ideas it is ok to close for now. What would be very useful is some suggestions on how to debug this if it comes up again? Raise debugging on slurmctld? Which one? slurmd? Thanks!
I'll try to reproduce given the information provided, but it's not much to go on. The gres allocated to each job are saved as part of the job state, so node state should be synchronized after a reconfiguration or restart.
Only if easy - otherwise we'd be more then ok with waiting to see if it resurfaces and with instructions on what to do next time as far as logging/details. We just didn't have enough time to poke around before the problem was gone. But we definitely had it - /proc/PROCID/environ for running job was clearly saying CUDA_VIS..=1 (he wasn't overriding it in his script) and all new jobs were landing with same setting as well (we had the other guy add debugging statements to see why were his jobs failing). So somehow it did happen.
I've been able to reproduce this. In my experience, that's a big part of getting the problem fixed.
I don't have a fix yet, but I did determine where this is happening. Its the gres_plugin_job_clear() function that clears job state in order to support job requeue, but seems to be clearing critical job information when the slurmctld restarts.
Fantastic! Great work (as usual, you are spoiling us here). And here I was feeling guilty for not managing to debug it enough before the problem went away...
(In reply to Josko Plazonic from comment #7) > Fantastic! Great work (as usual, you are spoiling us here). And here I was > feeling guilty for not managing to debug it enough before the problem went > away... We have a first rate team here to deliver top quality service. If you have a problem, your first contact here may well be the code author rather than someone trying to keep you away from a developer. I've found and fixed the problem. The fix will be in v14.03.10 when tagged (probably not for several weeks), but the patch is on a few lines available to you now here: https://github.com/SchedMD/slurm/commit/1209a664840a431428658c0950b16078af7aff63