Ticket 6223

Summary:	PrologFlags=Contain significantly changing job activity on compute nodes
Product:	Slurm	Reporter:	David Baker <d.j.baker>
Component:	slurmd	Assignee:	Jason Booth <jbooth>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	d.j.baker
Version:	18.08.0
Hardware:	Linux
OS:	Linux
Site:	OCF	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	Southampton University
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Our slurm.conf Our slurm.conf -- this is config used on our test node taskprolog cgroup.conf slurmd.log-20181218.gz

Description David Baker 2018-12-12 03:32:08 MST

Created attachment 8597 [details]
Our slurm.conf

Hello,

I wondered if someone could please help us to understand why the PrologFlags=contain flag is causing jobs to fail and draining compute nodes.

I'm currently experimenting with PrologFlags=contain. I've found that the addition of this flag in the slurm.conf radically changes the behaviour of jobs on the compute nodes.

When PrologFlags=contain is *commented out* in the slurm.conf jobs are assigned to the compute node and start/execute as expected. Here is the relevant extract from the slurmd logs on that node..

[2018-12-12T09:51:40.748] _run_prolog: run job script took usec=4
[2018-12-12T09:51:40.748] _run_prolog: prolog with lock for job 243317 ran for 0 seconds
[2018-12-12T09:51:40.748] Launching batch job 243317 for UID 57337
[2018-12-12T09:51:40.762] [243317.batch] task/cgroup: /slurm/uid_57337/job_243317: alloc=0MB mem.limit=193080MB memsw.limit=unlimited
[2018-12-12T09:51:40.763] [243317.batch] task/cgroup: /slurm/uid_57337/job_243317/step_batch: alloc=0MB mem.limit=193080MB memsw.limit=unlimited

When PrologFlags=contain is *activated* I find the following...

-- I don't see the "_run_prolog" and the "task/cgroup" messages in the slurmd logs
-- The job prolog fails, the job fails and the job output is owned by root
-- The compute node is drained.

sinfo -lN | grep red017 ....
red017         1     batch*     drained   40   2:20:1 190000        0      1   (null) batch job complete f

Here is the extract from the slurd logs

[2018-12-12T09:56:54.564] error: Waiting for JobId=243321 prolog has failed, giving up after 50 sec
[2018-12-12T09:56:54.565] Could not launch job 243321 and not able to requeue it, cancelling job

I have attached a copy of the slurm.conf. 

Best regards,
David

Comment 1 Jason Booth 2018-12-12 13:53:00 MST

Hi David,

 I have looked over the issue but I do not have enough information to determine a root cause. Would you set SlurmdDebug=debug3 on a compute node and re-test with PrologFlags=Contain. Please then attach the slurmd.log to the issue.

-Jason

Comment 3 David Baker 2018-12-13 03:02:07 MST

Hi Jason,


Thank you for your email. I have switched up the slurmd debug level as you suggested, and restarted the slurmd on a compute node. I then submitted a job to that node and watched things develop. As expected the prolog fails and the job dies prematurely. This is job #244126, and I have included the extract from the slurmd logs from starting the slurmd  to job #244126 completing.


Best regards,

David


[2018-12-13T09:52:32.680] slurmd started on Thu, 13 Dec 2018 09:52:32 +0000
[2018-12-13T09:52:32.680] debug:  attempting to run health_check [/usr/sbin/nhc]
[2018-12-13T09:52:33.954] CPUs=40 Boards=1 Sockets=2 Cores=20 Threads=1 Memory=193080 TmpDisk=96540 Uptime=69802 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2018-12-13T09:52:33.954] debug3: Trying to load plugin /usr/lib64/slurm/acct_gather_energy_none.so
[2018-12-13T09:52:33.954] debug:  AcctGatherEnergy NONE plugin loaded
[2018-12-13T09:52:33.954] debug3: Success.
[2018-12-13T09:52:33.955] debug3: Trying to load plugin /usr/lib64/slurm/acct_gather_profile_none.so
[2018-12-13T09:52:33.955] debug:  AcctGatherProfile NONE plugin loaded
[2018-12-13T09:52:33.955] debug3: Success.
[2018-12-13T09:52:33.955] debug3: Trying to load plugin /usr/lib64/slurm/acct_gather_interconnect_none.so
[2018-12-13T09:52:33.955] debug:  AcctGatherInterconnect NONE plugin loaded
[2018-12-13T09:52:33.955] debug3: Success.
[2018-12-13T09:52:33.955] debug3: Trying to load plugin /usr/lib64/slurm/acct_gather_filesystem_none.so
[2018-12-13T09:52:33.955] debug:  AcctGatherFilesystem NONE plugin loaded
[2018-12-13T09:52:33.955] debug3: Success.
[2018-12-13T09:52:33.955] debug2: No acct_gather.conf file (/etc/slurm/acct_gather.conf)
[2018-12-13T09:52:33.956] debug2: _handle_node_reg_resp: slurmctld sent back 9 TRES.
[2018-12-13T09:52:53.013] debug3: in the service_connection
[2018-12-13T09:52:53.013] debug2: got this type of message 4005
[2018-12-13T09:52:53.013] debug2: Processing RPC: REQUEST_BATCH_JOB_LAUNCH
[2018-12-13T09:52:53.013] debug2: _group_cache_lookup_internal: no entry found for djb1
[2018-12-13T09:52:53.013] debug3: state for jobid 244120: ctime:1544694634 revoked:1544694634 expires:1544694754
[2018-12-13T09:52:53.013] debug3: destroying job 244120 state
[2018-12-13T09:53:12.067] debug3: in the service_connection
[2018-12-13T09:53:12.067] debug2: got this type of message 1008
[2018-12-13T09:53:43.107] error: Waiting for JobId=244126 prolog has failed, giving up after 50 sec
[2018-12-13T09:53:43.108] Could not launch job 244126 and not able to requeue it, cancelling job
[2018-12-13T09:53:43.115] debug3: CPUs=40 Boards=1 Sockets=2 Cores=20 Threads=1 Memory=193080 TmpDisk=96540 Uptime=69871 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2018-12-13T09:53:43.251] debug3: in the service_connection
[2018-12-13T09:53:43.251] debug2: got this type of message 6011
[2018-12-13T09:53:43.251] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[2018-12-13T09:53:43.251] debug:  _rpc_terminate_job, uid = 84625
[2018-12-13T09:53:43.251] debug:  credential for job 244126 revoked
[2018-12-13T09:53:43.251] debug2: No steps in jobid 244126 to send signal 18
[2018-12-13T09:53:43.251] debug2: No steps in jobid 244126 to send signal 15
[2018-12-13T09:53:43.251] debug2: set revoke expiration for jobid 244126 to 1544694943 UTS
[2018-12-13T09:53:43.252] debug:  Waiting for job 244126's prolog to complete
[2018-12-13T09:53:43.252] debug:  Finished wait for job 244126's prolog to complete
[2018-12-13T09:53:43.252] debug:  [job 244126] attempting to run epilog [/etc/slurm/slurm.epilog.clean]
[2018-12-13T09:53:53.582] debug:  completed epilog for jobid 244126
[2018-12-13T09:53:53.583] debug3: slurm_send_only_controller_msg: sent 192
[2018-12-13T09:53:53.583] debug:  Job 244126: sent epilog complete msg: rc = 0


________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: 12 December 2018 20:53
To: Baker D.J.
Subject: [Bug 6223] PrologFlags=Contain significantly changing job activity on compute nodes


Comment # 1<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6223%23c1&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Ca21eaf40f6d04a425d4008d66073ceb2%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=kUbLCutm3MBWxyph8Nehacfucm8TVsKa8Tm8HLFtIyw%3D&reserved=0> on bug 6223<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6223&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Ca21eaf40f6d04a425d4008d66073ceb2%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=v0HjuO9l2c8oGfKrdo%2FgjCHa5Lxe0Tm2LiyEfnZkNiA%3D&reserved=0> from Jason Booth<mailto:jbooth@schedmd.com>

Hi David,

 I have looked over the issue but I do not have enough information to determine
a root cause. Would you set SlurmdDebug=debug3 on a compute node and re-test
with PrologFlags=Contain. Please then attach the slurmd.log to the issue.

-Jason

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 4 Jason Booth 2018-12-13 16:33:09 MST

Hi David,

 I have not been able to duplicate this issue locally. The TaskProlog in your attached slurm.conf is commented out but I would like to have you confirm if the slurm.conf on the compute node is the same as the one you have attached. If it differs then please attach that slurm.conf. 

-Jason

Comment 5 David Baker 2018-12-14 03:37:03 MST

Created attachment 8645 [details]
Our slurm.conf -- this is config used on our test node

Sorry for the confusion over slurm.conf. I thought I had updated it on my Windows machine before attaching it to this ticket. I have now downloaded the correct slurm.conf and attached a copy this morning (14/12).

Comment 6 David Baker 2018-12-14 03:44:25 MST

Hi Jason,


My apologies for attaching the wrong slurm.conf. I have now attached the correct copy to the ticket/bug. The task prolog should not have been commented out.


As a matter of interest which version of slurm are you using to test PrologFlags=contain?


Best regards,

David


________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: 13 December 2018 23:33
To: Baker D.J.
Subject: [Bug 6223] PrologFlags=Contain significantly changing job activity on compute nodes


Comment # 4<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6223%23c4&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cf9de8098143c46b3883308d66153597e%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=vKMY2KxHqNtrlnGPD0CMAw0SRc%2BXF2YgU6nQ8ucRZNs%3D&reserved=0> on bug 6223<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6223&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cf9de8098143c46b3883308d66153597e%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=rXFk7yqHQCAoy6dCvRMSBhKYNLQthOBsHD4axXwXJ00%3D&reserved=0> from Jason Booth<mailto:jbooth@schedmd.com>

Hi David,

 I have not been able to duplicate this issue locally. The TaskProlog in your
attached slurm.conf is commented out but I would like to have you confirm if
the slurm.conf on the compute node is the same as the one you have attached. If
it differs then please attach that slurm.conf.

-Jason

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 7 Jason Booth 2018-12-14 13:51:08 MST

David,

 Would you also attach your /etc/slurm/taskprolog from that node.

> As a matter of interest which version of slurm are you using to test PrologFlags=contain?

18.08.0 and master. This works in both versions.

-Jason

Comment 8 David Baker 2018-12-14 14:51:49 MST

Created attachment 8658 [details]
taskprolog

Hi Jason,


My taskprolog (and my cgroup.conf) files are attached.


Best regards,

David


________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: 14 December 2018 20:51
To: Baker D.J.
Subject: [Bug 6223] PrologFlags=Contain significantly changing job activity on compute nodes


Comment # 7<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6223%23c7&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cc5bb8f29b86f4cdc909508d66205e087%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=K7rLD6GoE%2BWZzrrrVYJREVltLMxfEJhaJZW2AScRkt8%3D&reserved=0> on bug 6223<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6223&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cc5bb8f29b86f4cdc909508d66205e087%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=q2FdlPLQrlIXKfyf6eDyqDQkuV4Th%2BtJZO2%2B9BFmr8Y%3D&reserved=0> from Jason Booth<mailto:jbooth@schedmd.com>

David,

 Would you also attach your /etc/slurm/taskprolog from that node.

> As a matter of interest which version of slurm are you using to test PrologFlags=contain?

18.08.0 and master. This works in both versions.

-Jason

________________________________
You are receiving this mail because:

  *   You reported the bug.
  *   You are on the CC list for the bug.

Comment 9 David Baker 2018-12-14 14:51:49 MST

Created attachment 8659 [details]
cgroup.conf

Comment 10 David Baker 2018-12-17 07:48:24 MST

Hi Jason,


I've been taking a look at the PrologFlags=contain issue this morning. I don't think the contents of the task prolog is relevant. I changed to a simpler task prolog to no avail. When I set this option it does appear that the usual job cgroups aren't created under /sys/fs/cgroup. So, for example, I would expect to find, for example, "cpuset/slurm/uid_<uid>/job_<id>". Such entries aren't created.


I submit a job to the test node, however there is no evidence of the job actually executing on the node. It seems that the prolog is timed out after 50s (as shown in the slurmd log). If you are testing under slurm 18.08.0 then it would be interesting to know what's different on your system, and more to the point what stalls the creation of the cgroups on my test node. Everything works as expected without "prologflag=contain".


Best regards,

David





________________________________
From: Baker D.J.
Sent: 14 December 2018 21:51
To: bugs@schedmd.com
Subject: Re: [Bug 6223] PrologFlags=Contain significantly changing job activity on compute nodes


Hi Jason,


My taskprolog (and my cgroup.conf) files are attached.


Best regards,

David


________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: 14 December 2018 20:51
To: Baker D.J.
Subject: [Bug 6223] PrologFlags=Contain significantly changing job activity on compute nodes


Comment # 7<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6223%23c7&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cc5bb8f29b86f4cdc909508d66205e087%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=K7rLD6GoE%2BWZzrrrVYJREVltLMxfEJhaJZW2AScRkt8%3D&reserved=0> on bug 6223<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6223&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cc5bb8f29b86f4cdc909508d66205e087%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=q2FdlPLQrlIXKfyf6eDyqDQkuV4Th%2BtJZO2%2B9BFmr8Y%3D&reserved=0> from Jason Booth<mailto:jbooth@schedmd.com>

David,

 Would you also attach your /etc/slurm/taskprolog from that node.

> As a matter of interest which version of slurm are you using to test PrologFlags=contain?

18.08.0 and master. This works in both versions.

-Jason

________________________________
You are receiving this mail because:

  *   You reported the bug.
  *   You are on the CC list for the bug.

Comment 11 Jason Booth 2018-12-17 15:45:17 MST


> I've been taking a look at the PrologFlags=contain issue this morning. I don't think the contents of the task prolog is relevant. I changed to a simpler taskprolog to no avail. When I set this option it does appear that the usual job cgroups aren't created under /sys/fs/cgroup. So, for example, I would expect to find, for example, "cpuset/slurm/uid_<uid>/job_<id>". Such entries aren't created.


This is interesting and makes me think that you may be missing some required packages. Can you let me know what cgroup packages you have installed and which distribution you are running? Please also send me your kernel version.


-Jason

Comment 12 David Baker 2018-12-18 03:21:36 MST

Hi Jason,


The setup on the compute nodes is...


Distribution -- RH 7.4

Kernel -- 3.10.0-693.el7.x86_64

cgroup rpms -- libcgroup and libcgroup-tools


Best regards,

David


[root@red017 ~]# cat /proc/cgroups
#subsys_name    hierarchy       num_cgroups     enabled
cpuset  7       2       1
cpu     8       58      1
cpuacct 8       58      1
memory  4       59      1
devices 2       59      1
freezer 5       2       1
net_cls 3       1       1
blkio   10      58      1
perf_event      11      1       1
hugetlb 9       1       1
pids    6       1       1
net_prio        3       1       1



[root@red017 ~]# less /etc/slurm/cgroup.conf
###
#
# Slurm cgroup support configuration file
#
# See man slurm.conf and man cgroup.conf for further
# information on cgroup configuration parameters
#--
CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm/cgroup"

ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainDevices=yes
TaskAffinity=no

CgroupMountpoint=/sys/fs/cgroup



________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: 17 December 2018 22:45
To: Baker D.J.
Subject: [Bug 6223] PrologFlags=Contain significantly changing job activity on compute nodes


Comment # 11<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6223%23c11&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7C9a225656ebeb44f16af108d664715239%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=iqM8DY0VayrI0zBtXCcCAGGUMCJqFJsksNutg4ilm0g%3D&reserved=0> on bug 6223<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6223&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7C9a225656ebeb44f16af108d664715239%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=%2FLbbZtqgiNXPzMPSG4XDckLN%2BlHcwxB7MeofjQg0y%2Fw%3D&reserved=0> from Jason Booth<mailto:jbooth@schedmd.com>

> I've been taking a look at the PrologFlags=contain issue this morning. I don't think the contents of the task prolog is relevant. I changed to a simpler taskprolog to no avail. When I set this option it does appear that the usual job cgroups aren't created under /sys/fs/cgroup. So, for example, I would expect to find, for example, "cpuset/slurm/uid_<uid>/job_<id>". Such entries aren't created.


This is interesting and makes me think that you may be missing some required
packages. Can you let me know what cgroup packages you have installed and which
distribution you are running? Please also send me your kernel version.


-Jason

________________________________
You are receiving this mail because:

  *   You reported the bug.
  *   You are on the CC list for the bug.

Comment 13 Jason Booth 2018-12-18 09:50:52 MST

David,

 Thanks for that additional information. Would you also attach the slurmd.log file. I would be interested to know if there are any additional errors surrounding the creation of the cgroup.

-Jason

Comment 14 David Baker 2018-12-18 10:19:39 MST

Created attachment 8691 [details]
slurmd.log-20181218.gz

Hi Jason,


I have attached the slurmd log from yesterday when I was running a number of test jobs.


Best regards,

David

________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: 18 December 2018 16:50:52
To: Baker D.J.
Subject: [Bug 6223] PrologFlags=Contain significantly changing job activity on compute nodes


Comment # 13<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6223%23c13&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7C188dbed5f02e4a9a188808d66508fb4b%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=t%2BDh9Wxlwff6Cdu7xDduiXQvBtkBP5Jno0QLHxr5vXc%3D&reserved=0> on bug 6223<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6223&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7C188dbed5f02e4a9a188808d66508fb4b%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=%2FK%2Bk6n6Nh0DVx7RBknw4ZXZIXe8AOAzayRDM9Zta3PQ%3D&reserved=0> from Jason Booth<mailto:jbooth@schedmd.com>

David,

 Thanks for that additional information. Would you also attach the slurmd.log
file. I would be interested to know if there are any additional errors
surrounding the creation of the cgroup.

-Jason

________________________________
You are receiving this mail because:

  *   You are on the CC list for the bug.
  *   You reported the bug.

Comment 16 Jason Booth 2018-12-18 15:07:42 MST

Hi David,

 Would you check that "PrologFlags=contain,alloc" is found in both the slurmctlds and slurmds slurm.conf. This needs to be in both slurm.conf files, in fact, I also want to point out that the entire cluster should have the same slurm.conf otherwise you will see strange behavior such as with communication.

While reviewing the logs I noticed there was no "Processing RPC: REQUEST_LAUNCH_PROLOG").

These show up in the slurmd.log at debug2 or greater. 

It would also be good to have the corresponding slurmctld.logs for 2018-12-17.


-Jason

Comment 17 David Baker 2018-12-19 09:28:40 MST

Hi Jason,


First of all, my apologies and thank you for that explanation. The key issue here was that I did not have PrologFlags=contain in the slurm.conf on the master host. Adding that statement into the slurm.conf on a compute node (and not on the slurmctld host) was confusing slurm and as a result jobs could not start properly.


After getting PrologFlags to work I moved on to take a look at pam_slurm_adopt.so. Could I please ask a few questions re the plugin -- it's the key reason for implementing PrologFlags=contain? I decided to follow the instructions here -- https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-prologflags.


So, to switch from pam_slurm.so to pam_slurm_adopt.so I did the following...


-- commented out 'account     required      pam_slurm.so' in system/paasword-auth files


-- Added these lines to /etc/pam.d/sshd

# - PAM config for Slurm - BEGIN
account    sufficient   pam_slurm_adopt.so
account    required     pam_access.so
# - PAM config for Slurm - END


-- Added these lines to /etc/security/access.conf

+ : root   : ALL
- : ALL    : ALL

-- do I need the file /etc/pam.d/slurm?


Does the above make sense? I can certainly login to a node when I have a job running on it. At other times ssh access to that node is barred. That aspect of pam_slurm_adopt.so appears to be working as expected.


How can I best check that my ssh session is constrained by the resources allocated to the slurm job? I  submitted a job requesting 1 cpu/core to my test node, then ssh'ed into the node and finally tried to run a mpi job in the ssh session. I was surprised to see that I could apparently use all the cpu cores on the node to run my mpi job. Does that make sense?


Best regards,

David

________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: 18 December 2018 22:07
To: Baker D.J.
Subject: [Bug 6223] PrologFlags=Contain significantly changing job activity on compute nodes


Comment # 16<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6223%23c16&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7C525c42c0ba2f41bf048a08d665353c8a%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=MUf35tvohHdb9mogoX0kjFtgHFH2yPbCPB6btpozlHw%3D&reserved=0> on bug 6223<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6223&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7C525c42c0ba2f41bf048a08d665353c8a%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=q651YnSSuzsnkWJ%2Bhx0OlBgR45WeCGdMH5pJbq%2FqmeY%3D&reserved=0> from Jason Booth<mailto:jbooth@schedmd.com>

Hi David,

 Would you check that "PrologFlags=contain,alloc" is found in both the
slurmctlds and slurmds slurm.conf. This needs to be in both slurm.conf files,
in fact, I also want to point out that the entire cluster should have the same
slurm.conf otherwise you will see strange behavior such as with communication.

While reviewing the logs I noticed there was no "Processing RPC:
REQUEST_LAUNCH_PROLOG").

These show up in the slurmd.log at debug2 or greater.

It would also be good to have the corresponding slurmctld.logs for 2018-12-17.


-Jason

________________________________
You are receiving this mail because:

  *   You reported the bug.
  *   You are on the CC list for the bug.

Comment 19 Jason Booth 2018-12-20 09:59:35 MST

> The key issue here was that I did not have PrologFlags=contain in the slurm.conf on the master host. Adding that statement into the slurm.conf on a compute node (and not on the slurmctld host) was confusing slurm and as a result jobs could not start properly.

Great. I am glad this resolved the issue. I will proceed to close this out since your last question was answered in bug 6130.