| Summary: | PrologFlags=Contain significantly changing job activity on compute nodes | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | David Baker <d.j.baker> |
| Component: | slurmd | Assignee: | Jason Booth <jbooth> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | d.j.baker |
| Version: | 18.08.0 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | OCF | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | Southampton University |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
Our slurm.conf
Our slurm.conf -- this is config used on our test node taskprolog cgroup.conf slurmd.log-20181218.gz |
||
Hi David, I have looked over the issue but I do not have enough information to determine a root cause. Would you set SlurmdDebug=debug3 on a compute node and re-test with PrologFlags=Contain. Please then attach the slurmd.log to the issue. -Jason Hi Jason, Thank you for your email. I have switched up the slurmd debug level as you suggested, and restarted the slurmd on a compute node. I then submitted a job to that node and watched things develop. As expected the prolog fails and the job dies prematurely. This is job #244126, and I have included the extract from the slurmd logs from starting the slurmd to job #244126 completing. Best regards, David [2018-12-13T09:52:32.680] slurmd started on Thu, 13 Dec 2018 09:52:32 +0000 [2018-12-13T09:52:32.680] debug: attempting to run health_check [/usr/sbin/nhc] [2018-12-13T09:52:33.954] CPUs=40 Boards=1 Sockets=2 Cores=20 Threads=1 Memory=193080 TmpDisk=96540 Uptime=69802 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) [2018-12-13T09:52:33.954] debug3: Trying to load plugin /usr/lib64/slurm/acct_gather_energy_none.so [2018-12-13T09:52:33.954] debug: AcctGatherEnergy NONE plugin loaded [2018-12-13T09:52:33.954] debug3: Success. [2018-12-13T09:52:33.955] debug3: Trying to load plugin /usr/lib64/slurm/acct_gather_profile_none.so [2018-12-13T09:52:33.955] debug: AcctGatherProfile NONE plugin loaded [2018-12-13T09:52:33.955] debug3: Success. [2018-12-13T09:52:33.955] debug3: Trying to load plugin /usr/lib64/slurm/acct_gather_interconnect_none.so [2018-12-13T09:52:33.955] debug: AcctGatherInterconnect NONE plugin loaded [2018-12-13T09:52:33.955] debug3: Success. [2018-12-13T09:52:33.955] debug3: Trying to load plugin /usr/lib64/slurm/acct_gather_filesystem_none.so [2018-12-13T09:52:33.955] debug: AcctGatherFilesystem NONE plugin loaded [2018-12-13T09:52:33.955] debug3: Success. [2018-12-13T09:52:33.955] debug2: No acct_gather.conf file (/etc/slurm/acct_gather.conf) [2018-12-13T09:52:33.956] debug2: _handle_node_reg_resp: slurmctld sent back 9 TRES. [2018-12-13T09:52:53.013] debug3: in the service_connection [2018-12-13T09:52:53.013] debug2: got this type of message 4005 [2018-12-13T09:52:53.013] debug2: Processing RPC: REQUEST_BATCH_JOB_LAUNCH [2018-12-13T09:52:53.013] debug2: _group_cache_lookup_internal: no entry found for djb1 [2018-12-13T09:52:53.013] debug3: state for jobid 244120: ctime:1544694634 revoked:1544694634 expires:1544694754 [2018-12-13T09:52:53.013] debug3: destroying job 244120 state [2018-12-13T09:53:12.067] debug3: in the service_connection [2018-12-13T09:53:12.067] debug2: got this type of message 1008 [2018-12-13T09:53:43.107] error: Waiting for JobId=244126 prolog has failed, giving up after 50 sec [2018-12-13T09:53:43.108] Could not launch job 244126 and not able to requeue it, cancelling job [2018-12-13T09:53:43.115] debug3: CPUs=40 Boards=1 Sockets=2 Cores=20 Threads=1 Memory=193080 TmpDisk=96540 Uptime=69871 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) [2018-12-13T09:53:43.251] debug3: in the service_connection [2018-12-13T09:53:43.251] debug2: got this type of message 6011 [2018-12-13T09:53:43.251] debug2: Processing RPC: REQUEST_TERMINATE_JOB [2018-12-13T09:53:43.251] debug: _rpc_terminate_job, uid = 84625 [2018-12-13T09:53:43.251] debug: credential for job 244126 revoked [2018-12-13T09:53:43.251] debug2: No steps in jobid 244126 to send signal 18 [2018-12-13T09:53:43.251] debug2: No steps in jobid 244126 to send signal 15 [2018-12-13T09:53:43.251] debug2: set revoke expiration for jobid 244126 to 1544694943 UTS [2018-12-13T09:53:43.252] debug: Waiting for job 244126's prolog to complete [2018-12-13T09:53:43.252] debug: Finished wait for job 244126's prolog to complete [2018-12-13T09:53:43.252] debug: [job 244126] attempting to run epilog [/etc/slurm/slurm.epilog.clean] [2018-12-13T09:53:53.582] debug: completed epilog for jobid 244126 [2018-12-13T09:53:53.583] debug3: slurm_send_only_controller_msg: sent 192 [2018-12-13T09:53:53.583] debug: Job 244126: sent epilog complete msg: rc = 0 ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: 12 December 2018 20:53 To: Baker D.J. Subject: [Bug 6223] PrologFlags=Contain significantly changing job activity on compute nodes Comment # 1<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6223%23c1&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Ca21eaf40f6d04a425d4008d66073ceb2%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=kUbLCutm3MBWxyph8Nehacfucm8TVsKa8Tm8HLFtIyw%3D&reserved=0> on bug 6223<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6223&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Ca21eaf40f6d04a425d4008d66073ceb2%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=v0HjuO9l2c8oGfKrdo%2FgjCHa5Lxe0Tm2LiyEfnZkNiA%3D&reserved=0> from Jason Booth<mailto:jbooth@schedmd.com> Hi David, I have looked over the issue but I do not have enough information to determine a root cause. Would you set SlurmdDebug=debug3 on a compute node and re-test with PrologFlags=Contain. Please then attach the slurmd.log to the issue. -Jason ________________________________ You are receiving this mail because: * You reported the bug. Hi David, I have not been able to duplicate this issue locally. The TaskProlog in your attached slurm.conf is commented out but I would like to have you confirm if the slurm.conf on the compute node is the same as the one you have attached. If it differs then please attach that slurm.conf. -Jason Created attachment 8645 [details]
Our slurm.conf -- this is config used on our test node
Sorry for the confusion over slurm.conf. I thought I had updated it on my Windows machine before attaching it to this ticket. I have now downloaded the correct slurm.conf and attached a copy this morning (14/12).
Hi Jason, My apologies for attaching the wrong slurm.conf. I have now attached the correct copy to the ticket/bug. The task prolog should not have been commented out. As a matter of interest which version of slurm are you using to test PrologFlags=contain? Best regards, David ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: 13 December 2018 23:33 To: Baker D.J. Subject: [Bug 6223] PrologFlags=Contain significantly changing job activity on compute nodes Comment # 4<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6223%23c4&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cf9de8098143c46b3883308d66153597e%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=vKMY2KxHqNtrlnGPD0CMAw0SRc%2BXF2YgU6nQ8ucRZNs%3D&reserved=0> on bug 6223<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6223&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cf9de8098143c46b3883308d66153597e%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=rXFk7yqHQCAoy6dCvRMSBhKYNLQthOBsHD4axXwXJ00%3D&reserved=0> from Jason Booth<mailto:jbooth@schedmd.com> Hi David, I have not been able to duplicate this issue locally. The TaskProlog in your attached slurm.conf is commented out but I would like to have you confirm if the slurm.conf on the compute node is the same as the one you have attached. If it differs then please attach that slurm.conf. -Jason ________________________________ You are receiving this mail because: * You reported the bug. David,
Would you also attach your /etc/slurm/taskprolog from that node.
> As a matter of interest which version of slurm are you using to test PrologFlags=contain?
18.08.0 and master. This works in both versions.
-Jason
Created attachment 8658 [details] taskprolog Hi Jason, My taskprolog (and my cgroup.conf) files are attached. Best regards, David ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: 14 December 2018 20:51 To: Baker D.J. Subject: [Bug 6223] PrologFlags=Contain significantly changing job activity on compute nodes Comment # 7<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6223%23c7&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cc5bb8f29b86f4cdc909508d66205e087%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=K7rLD6GoE%2BWZzrrrVYJREVltLMxfEJhaJZW2AScRkt8%3D&reserved=0> on bug 6223<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6223&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cc5bb8f29b86f4cdc909508d66205e087%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=q2FdlPLQrlIXKfyf6eDyqDQkuV4Th%2BtJZO2%2B9BFmr8Y%3D&reserved=0> from Jason Booth<mailto:jbooth@schedmd.com> David, Would you also attach your /etc/slurm/taskprolog from that node. > As a matter of interest which version of slurm are you using to test PrologFlags=contain? 18.08.0 and master. This works in both versions. -Jason ________________________________ You are receiving this mail because: * You reported the bug. * You are on the CC list for the bug. Created attachment 8659 [details]
cgroup.conf
Hi Jason, I've been taking a look at the PrologFlags=contain issue this morning. I don't think the contents of the task prolog is relevant. I changed to a simpler task prolog to no avail. When I set this option it does appear that the usual job cgroups aren't created under /sys/fs/cgroup. So, for example, I would expect to find, for example, "cpuset/slurm/uid_<uid>/job_<id>". Such entries aren't created. I submit a job to the test node, however there is no evidence of the job actually executing on the node. It seems that the prolog is timed out after 50s (as shown in the slurmd log). If you are testing under slurm 18.08.0 then it would be interesting to know what's different on your system, and more to the point what stalls the creation of the cgroups on my test node. Everything works as expected without "prologflag=contain". Best regards, David ________________________________ From: Baker D.J. Sent: 14 December 2018 21:51 To: bugs@schedmd.com Subject: Re: [Bug 6223] PrologFlags=Contain significantly changing job activity on compute nodes Hi Jason, My taskprolog (and my cgroup.conf) files are attached. Best regards, David ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: 14 December 2018 20:51 To: Baker D.J. Subject: [Bug 6223] PrologFlags=Contain significantly changing job activity on compute nodes Comment # 7<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6223%23c7&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cc5bb8f29b86f4cdc909508d66205e087%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=K7rLD6GoE%2BWZzrrrVYJREVltLMxfEJhaJZW2AScRkt8%3D&reserved=0> on bug 6223<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6223&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cc5bb8f29b86f4cdc909508d66205e087%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=q2FdlPLQrlIXKfyf6eDyqDQkuV4Th%2BtJZO2%2B9BFmr8Y%3D&reserved=0> from Jason Booth<mailto:jbooth@schedmd.com> David, Would you also attach your /etc/slurm/taskprolog from that node. > As a matter of interest which version of slurm are you using to test PrologFlags=contain? 18.08.0 and master. This works in both versions. -Jason ________________________________ You are receiving this mail because: * You reported the bug. * You are on the CC list for the bug.
> I've been taking a look at the PrologFlags=contain issue this morning. I don't think the contents of the task prolog is relevant. I changed to a simpler taskprolog to no avail. When I set this option it does appear that the usual job cgroups aren't created under /sys/fs/cgroup. So, for example, I would expect to find, for example, "cpuset/slurm/uid_<uid>/job_<id>". Such entries aren't created.
This is interesting and makes me think that you may be missing some required packages. Can you let me know what cgroup packages you have installed and which distribution you are running? Please also send me your kernel version.
-Jason
Hi Jason, The setup on the compute nodes is... Distribution -- RH 7.4 Kernel -- 3.10.0-693.el7.x86_64 cgroup rpms -- libcgroup and libcgroup-tools Best regards, David [root@red017 ~]# cat /proc/cgroups #subsys_name hierarchy num_cgroups enabled cpuset 7 2 1 cpu 8 58 1 cpuacct 8 58 1 memory 4 59 1 devices 2 59 1 freezer 5 2 1 net_cls 3 1 1 blkio 10 58 1 perf_event 11 1 1 hugetlb 9 1 1 pids 6 1 1 net_prio 3 1 1 [root@red017 ~]# less /etc/slurm/cgroup.conf ### # # Slurm cgroup support configuration file # # See man slurm.conf and man cgroup.conf for further # information on cgroup configuration parameters #-- CgroupAutomount=yes CgroupReleaseAgentDir="/etc/slurm/cgroup" ConstrainCores=yes ConstrainRAMSpace=yes ConstrainDevices=yes TaskAffinity=no CgroupMountpoint=/sys/fs/cgroup ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: 17 December 2018 22:45 To: Baker D.J. Subject: [Bug 6223] PrologFlags=Contain significantly changing job activity on compute nodes Comment # 11<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6223%23c11&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7C9a225656ebeb44f16af108d664715239%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=iqM8DY0VayrI0zBtXCcCAGGUMCJqFJsksNutg4ilm0g%3D&reserved=0> on bug 6223<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6223&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7C9a225656ebeb44f16af108d664715239%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=%2FLbbZtqgiNXPzMPSG4XDckLN%2BlHcwxB7MeofjQg0y%2Fw%3D&reserved=0> from Jason Booth<mailto:jbooth@schedmd.com> > I've been taking a look at the PrologFlags=contain issue this morning. I don't think the contents of the task prolog is relevant. I changed to a simpler taskprolog to no avail. When I set this option it does appear that the usual job cgroups aren't created under /sys/fs/cgroup. So, for example, I would expect to find, for example, "cpuset/slurm/uid_<uid>/job_<id>". Such entries aren't created. This is interesting and makes me think that you may be missing some required packages. Can you let me know what cgroup packages you have installed and which distribution you are running? Please also send me your kernel version. -Jason ________________________________ You are receiving this mail because: * You reported the bug. * You are on the CC list for the bug. David, Thanks for that additional information. Would you also attach the slurmd.log file. I would be interested to know if there are any additional errors surrounding the creation of the cgroup. -Jason Created attachment 8691 [details] slurmd.log-20181218.gz Hi Jason, I have attached the slurmd log from yesterday when I was running a number of test jobs. Best regards, David ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: 18 December 2018 16:50:52 To: Baker D.J. Subject: [Bug 6223] PrologFlags=Contain significantly changing job activity on compute nodes Comment # 13<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6223%23c13&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7C188dbed5f02e4a9a188808d66508fb4b%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=t%2BDh9Wxlwff6Cdu7xDduiXQvBtkBP5Jno0QLHxr5vXc%3D&reserved=0> on bug 6223<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6223&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7C188dbed5f02e4a9a188808d66508fb4b%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=%2FK%2Bk6n6Nh0DVx7RBknw4ZXZIXe8AOAzayRDM9Zta3PQ%3D&reserved=0> from Jason Booth<mailto:jbooth@schedmd.com> David, Thanks for that additional information. Would you also attach the slurmd.log file. I would be interested to know if there are any additional errors surrounding the creation of the cgroup. -Jason ________________________________ You are receiving this mail because: * You are on the CC list for the bug. * You reported the bug. Hi David, Would you check that "PrologFlags=contain,alloc" is found in both the slurmctlds and slurmds slurm.conf. This needs to be in both slurm.conf files, in fact, I also want to point out that the entire cluster should have the same slurm.conf otherwise you will see strange behavior such as with communication. While reviewing the logs I noticed there was no "Processing RPC: REQUEST_LAUNCH_PROLOG"). These show up in the slurmd.log at debug2 or greater. It would also be good to have the corresponding slurmctld.logs for 2018-12-17. -Jason Hi Jason, First of all, my apologies and thank you for that explanation. The key issue here was that I did not have PrologFlags=contain in the slurm.conf on the master host. Adding that statement into the slurm.conf on a compute node (and not on the slurmctld host) was confusing slurm and as a result jobs could not start properly. After getting PrologFlags to work I moved on to take a look at pam_slurm_adopt.so. Could I please ask a few questions re the plugin -- it's the key reason for implementing PrologFlags=contain? I decided to follow the instructions here -- https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-prologflags. So, to switch from pam_slurm.so to pam_slurm_adopt.so I did the following... -- commented out 'account required pam_slurm.so' in system/paasword-auth files -- Added these lines to /etc/pam.d/sshd # - PAM config for Slurm - BEGIN account sufficient pam_slurm_adopt.so account required pam_access.so # - PAM config for Slurm - END -- Added these lines to /etc/security/access.conf + : root : ALL - : ALL : ALL -- do I need the file /etc/pam.d/slurm? Does the above make sense? I can certainly login to a node when I have a job running on it. At other times ssh access to that node is barred. That aspect of pam_slurm_adopt.so appears to be working as expected. How can I best check that my ssh session is constrained by the resources allocated to the slurm job? I submitted a job requesting 1 cpu/core to my test node, then ssh'ed into the node and finally tried to run a mpi job in the ssh session. I was surprised to see that I could apparently use all the cpu cores on the node to run my mpi job. Does that make sense? Best regards, David ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: 18 December 2018 22:07 To: Baker D.J. Subject: [Bug 6223] PrologFlags=Contain significantly changing job activity on compute nodes Comment # 16<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6223%23c16&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7C525c42c0ba2f41bf048a08d665353c8a%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=MUf35tvohHdb9mogoX0kjFtgHFH2yPbCPB6btpozlHw%3D&reserved=0> on bug 6223<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6223&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7C525c42c0ba2f41bf048a08d665353c8a%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=q651YnSSuzsnkWJ%2Bhx0OlBgR45WeCGdMH5pJbq%2FqmeY%3D&reserved=0> from Jason Booth<mailto:jbooth@schedmd.com> Hi David, Would you check that "PrologFlags=contain,alloc" is found in both the slurmctlds and slurmds slurm.conf. This needs to be in both slurm.conf files, in fact, I also want to point out that the entire cluster should have the same slurm.conf otherwise you will see strange behavior such as with communication. While reviewing the logs I noticed there was no "Processing RPC: REQUEST_LAUNCH_PROLOG"). These show up in the slurmd.log at debug2 or greater. It would also be good to have the corresponding slurmctld.logs for 2018-12-17. -Jason ________________________________ You are receiving this mail because: * You reported the bug. * You are on the CC list for the bug. > The key issue here was that I did not have PrologFlags=contain in the slurm.conf on the master host. Adding that statement into the slurm.conf on a compute node (and not on the slurmctld host) was confusing slurm and as a result jobs could not start properly. Great. I am glad this resolved the issue. I will proceed to close this out since your last question was answered in bug 6130. |
Created attachment 8597 [details] Our slurm.conf Hello, I wondered if someone could please help us to understand why the PrologFlags=contain flag is causing jobs to fail and draining compute nodes. I'm currently experimenting with PrologFlags=contain. I've found that the addition of this flag in the slurm.conf radically changes the behaviour of jobs on the compute nodes. When PrologFlags=contain is *commented out* in the slurm.conf jobs are assigned to the compute node and start/execute as expected. Here is the relevant extract from the slurmd logs on that node.. [2018-12-12T09:51:40.748] _run_prolog: run job script took usec=4 [2018-12-12T09:51:40.748] _run_prolog: prolog with lock for job 243317 ran for 0 seconds [2018-12-12T09:51:40.748] Launching batch job 243317 for UID 57337 [2018-12-12T09:51:40.762] [243317.batch] task/cgroup: /slurm/uid_57337/job_243317: alloc=0MB mem.limit=193080MB memsw.limit=unlimited [2018-12-12T09:51:40.763] [243317.batch] task/cgroup: /slurm/uid_57337/job_243317/step_batch: alloc=0MB mem.limit=193080MB memsw.limit=unlimited When PrologFlags=contain is *activated* I find the following... -- I don't see the "_run_prolog" and the "task/cgroup" messages in the slurmd logs -- The job prolog fails, the job fails and the job output is owned by root -- The compute node is drained. sinfo -lN | grep red017 .... red017 1 batch* drained 40 2:20:1 190000 0 1 (null) batch job complete f Here is the extract from the slurd logs [2018-12-12T09:56:54.564] error: Waiting for JobId=243321 prolog has failed, giving up after 50 sec [2018-12-12T09:56:54.565] Could not launch job 243321 and not able to requeue it, cancelling job I have attached a copy of the slurm.conf. Best regards, David