| Summary: | Shared=Exclusive give access to all resources including GRES resources | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Amit Kumar <ahkumar> |
| Component: | Configuration | Assignee: | Director of Support <support> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | brian |
| Version: | 17.02.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | SMU | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | Slurmctld log | ||
|
Description
Amit Kumar
2017-07-17 10:52:42 MDT
Amit,
There isn't a way to do this using the standard gres config settings but you can do it with a job submit plugin. I made one using the "JobSubmitPlugins=lua". There is a sample script in "contribs/lua/job_submit.lua" that you can modify and put in your etc dir (where the slurm.conf file is). I think the best way to go about it is first check to see if the job has the correct partition (the nodes with the gpus), and if it does, set the gres:1. Like this:
if job_desc.partition == "shared" then
job_desc.gres = "gpu:1"
job_desc.gres_alloc = "gpu:1"
job_desc.req = "gpu:1"
end
That should make it so users only need submit simple jobs like:
sbatch job.sh
and the gpu gres will get applied automatically.
Hope that helps.
Tim
Hi Tim, I had tried job_submit plugin but I was not successful. I followed your instructions but I still don't see GPU gres being allocated. I had compiled with lua option and I do I have slurm-lua installed on my head/controller nodes, but not login nodes could this be an issue? My job_submit.lua is on /etc/slurm on all nodes and JobSubmitPlugins=lua is set in slurm confirguration file, and also ran scontrol reconfigure. Am I missing anything? Below here is another effort that did not go through. Please advise. Thank you, Amit [ahkumar@login01 ~]$ scontrol show no tp001 [ This was an sbatch job with a sleep command] NodeName=tp001 Arch=x86_64 CoresPerSocket=18 CPUAlloc=36 CPUErr=0 CPUTot=36 CPULoad=0.01 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:1 NodeAddr=tp001 NodeHostName=tp001 OS=Linux RealMemory=256000 AllocMem=256000 FreeMem=238023 Sockets=2 Boards=1 State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=ahkumar(507002) MCS_label=N/A Partitions=gpgpu-1 BootTime=2017-07-15T19:43:12 SlurmdStartTime=2017-07-15T19:44:42 CfgTRES=cpu=36,mem=250G,gres/gpu=1 AllocTRES=cpu=36,mem=250G CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s [ahkumar@login01 ~]$ scontrol show no p031 [ This was an srun job with sleep command] NodeName=p031 Arch=x86_64 CoresPerSocket=18 CPUAlloc=36 CPUErr=0 CPUTot=36 CPULoad=0.01 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:1 NodeAddr=p031 NodeHostName=p031 OS=Linux RealMemory=256000 AllocMem=256000 FreeMem=248155 Sockets=2 Boards=1 State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=ahkumar(507002) MCS_label=N/A Partitions=gpgpu-1 BootTime=2017-07-15T19:41:01 SlurmdStartTime=2017-07-17T13:02:46 CfgTRES=cpu=36,mem=250G,gres/gpu=1 AllocTRES=cpu=36,mem=250G CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Addition: below is my lua script:
function slurm_job_submit(job_desc, part_list, submit_uid)
if job_desc.account == nil then
local account = "ahkumar"
slurm.log_info("slurm_job_submit: job from uid %u, setting default account value: %s",
submit_uid, account)
job_desc.account = account
end
if job_desc.partition == "gpgpu-1" then
job_desc.gres = "gpu:1"
job_desc.gres_alloc = "gpu:1"
job_desc.req = "gpu:1"
end
return slurm.SUCCESS
end
function slurm_job_modify(job_desc, job_rec, part_list, modify_uid)
if job_desc.comment == nil then
local comment = "GPUNODE"
slurm.log_info("slurm_job_modify: for job %u from uid %u, setting default comment value: %s",
job_rec.job_id, modify_uid, comment)
job_desc.comment = comment
end
return slurm.SUCCESS
end
slurm.log_info("initialized")
return slurm.SUCCESS
(In reply to Amit Kumar from comment #2) > Hi Tim, > > I had tried job_submit plugin but I was not successful. I followed your > instructions but I still don't see GPU gres being allocated. I had compiled > with lua option and I do I have slurm-lua installed on my head/controller > nodes, but not login nodes could this be an issue? This shouldn't matter because the controller node is what runs the lua script. I may have mislead you slightly in my last response because, unless you specify the partition in the sbatch, the job_desc.partition is probably going to be "nil". Meaning, the default partition has not yet been applied to the job. I'm betting if you submit your job with "sbatch -p gpgpu-1 ...", your script will work. To apply the default partition if it wasn't specified in the sbatch, you can add some code like this before the other logic: --Set the default partition if not already set if job_desc.partition == nil then for name, part in pairs(part_list) do if part.flag_default ~= 0 then job_desc.partition = part.name break; end end end if job_desc.partition == "gpgpu-1" then slurm.log_info("slurm_job_submit: Setting gres") job_desc.gres = "gpu:1" job_desc.gres_alloc = "gpu:1" job_desc.req = "gpu:1" end I thought I had tested this case before but when I just retested it I noticed my gres was only getting applied with I used the "-p" with sbatch. I added this code and it worked just doing an sbatch with no partition specified. Sorry about that. If you're having problem even after applying this code, here's some things that should help troubleshooting your script. First, make sure you see this in the slurmctld.log: job_submit.lua: initialized If you see this entry, the controller found the lua script and loaded it fine. Second, check for compilation errors in the slurmctld.log. Lastly, the biggest help will be to add some logging to your script so you can check variables and conditions: slurm.log_info("slurm_job_submit: Calling job_submit.lua") See the lua file in contribs/lua for more logging examples. Look for the logging in the slurmctld.log. Hope that helps. Tim Created attachment 4952 [details]
Slurmctld log
Hi Tim, If you look at the my command I did included -p option for my submissions yet I have not been able to find "job_submit.lua: initialized" line in the slurmctld log. I have even raised to debug levels to 5 but did not find that specific initialized line in the log file: Attached is a snippet of the log scheduling the job with -p option Please let us know if there is alternate way to debug this. I have added debug commands in the lua script but no success. probably doing something silly. Thank you, Amit Amit, Dumb question. Since you added your job_submit.lua script to the /etc directory, have you restarted the controller? You'll need to restart the controller for it to get picked up. After restarting, look for: job_submit.lua: initialized line in the log. Even with logs set to debug2, that line will be like 18th line from the beginning of the log: debug: sched: slurmctld starting Otherwise you will see an error if the controller can't find it. So either way, you should see some mention of the job_submit.lua script in the beginning of the log after restarting. Tim Hi Tim, Apologies for the delay got caught up with production issues. Your dumb question was the smartest answer for me, I have been using reconfigure so frequently that I have forgotten there is a restart of the controller as well. All is well and my plugin works great as restart. Now I want to expand on this script to do more. Where would I find the job submission attributes defined in the document? Or I should print the lua object to get that? I wonder what attributes are modifiable in the plugin. Thank you, Amit Here's the documentation for the job submit plugin (see "Lua Functions"): https://slurm.schedmd.com/job_submit_plugins.html To see the objects, you can look at "struct job_descriptor" & "struct job_record" in the slurm.h and "struct part_record" in slurmctld.h. Hope that helps. Tim Got it!! Thank you for all of your help!! Amit |