Ticket 6130

Summary: Trying to get pam_slurm_adopt to work
Product: Slurm Reporter: David Baker <d.j.baker>
Component: OtherAssignee: Marshall Garey <marshall>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 18.08.0   
Hardware: Linux   
OS: Linux   
Site: OCF Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: Southampton University
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf

Description David Baker 2018-11-29 09:56:56 MST
Hello, 

Could you help me setup the pam_slurm_adopt plugin, please.

I am hoping to evaluate pam_slurm_adopt, but I am having difficulty getting it to work on our cluster. We have been using the pam_slurm plugin without any issues -- it works as expected. In due course it would be useful to be able to upgrade up. 

On a test node I have commented out the reference to pam_slurm, and edited the following files...

/etc/security/access.conf -- added 
+ : root (hpcadmins) : ALL
- : ALL : ALL

/etc/pam.d/sshd -- added the following at the bottom..
# pam_selinux.so close should be the first session rule
session    required     pam_selinux.so close
session    required     pam_loginuid.so
# pam_selinux.so open should only be followed by sessions to be executed in the user context

I have put jobs on the test node and tried to login, however I can't...

[djb1@cyan51 slurm]$ ssh red212
Access denied by pam_slurm_adopt: you have no active jobs on this node
Authentication failed.
Comment 1 David Baker 2018-11-30 04:35:44 MST
Hello,

Here's an important update to my ticket.This morning I notice that I should really have "PrologFlags=contain" set in my slurm.conf. I edited the slurm.conf, and pushed it out across the cluster. As I was pushing this change out I noticed that the compute nodes were starting to go offline (draining). I submitted a couple of test jobs myself and saw that their compute nodes were drained (reason "batch job complete f"). Also I saw that my job output was owned by root. This seems odd since I read that "PrologFlags=contain" should be harmless.

Needless to say, I reversed the change and push out the original slurm.conf out.
Comment 2 Marshall Garey 2018-12-03 12:50:33 MST
PrologFlags=contain definitely shouldn't have caused nodes to drain. Can you attach your slurm.conf file and a slurmd log file of a node that drained from the time when you set PrologFlags=contain and saw nodes draining to when you removed PrologFlags=contain?
Comment 3 David Baker 2018-12-04 09:51:57 MST
Created attachment 8514 [details]
slurm.conf

Thank you for your email. I've attached the slurm.conf from the cluster. PrologFlags=contain was commented out before this configuration was pushed out to the cluster a second time (following the issue I've described).

The slurmd log from an affected node is not particularly interested during this period. The modified slurm.conf (with PrologFlags=contain active) was pushed out to the nodes  just after 10:00 and then was pushed out just after 11:00 (which PrologFlags=contain commented out). The slurmd log extract from this period is shown below.

The issue was triggered at 10:34 by a job (submitted by me -- I had a reservation on the node) starting on that node -- the prolog fails very soon after the job starts. I noticed that the output returned from the job was, for some reason, owned by root.

Best regards,
David

[2018-11-30T10:02:12.391] Slurmd shutdown completing
[2018-11-30T10:03:12.423] Message aggregation disabled
[2018-11-30T10:03:12.425] CPU frequency setting not configured for this node
[2018-11-30T10:03:12.428] slurmd version 18.08.0 started
[2018-11-30T10:03:12.428] slurmd started on Fri, 30 Nov 2018 10:03:12 +0000
[2018-11-30T10:03:13.703] CPUs=40 Boards=1 Sockets=2 Cores=20 Threads=1 Memory=193080 TmpDisk=96540 Uptime=4384777 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2018-11-30T10:09:42.439] error: slurm_receive_msg_and_forward: Zero Bytes were transmitted or received
[2018-11-30T10:09:42.450] error: service_connection: slurm_receive_msg: Zero Bytes were transmitted or received
[2018-11-30T10:19:42.511] error: slurm_receive_msg_and_forward: Zero Bytes were transmitted or received
[2018-11-30T10:19:42.521] error: service_connection: slurm_receive_msg: Zero Bytes were transmitted or received
[2018-11-30T10:29:42.500] error: slurm_receive_msg_and_forward: Zero Bytes were transmitted or received
[2018-11-30T10:29:42.510] error: service_connection: slurm_receive_msg: Zero Bytes were transmitted or received
[2018-11-30T10:34:49.033] error: Waiting for JobId=229593 prolog has failed, giving up after 50 sec
[2018-11-30T10:34:49.035] Could not launch job 229593 and not able to requeue it, cancelling job
[2018-11-30T10:39:42.584] error: slurm_receive_msg_and_forward: Zero Bytes were transmitted or received
[2018-11-30T10:39:42.594] error: service_connection: slurm_receive_msg: Zero Bytes were transmitted or received
[2018-11-30T10:49:42.400] error: slurm_receive_msg_and_forward: Zero Bytes were transmitted or received
[2018-11-30T10:49:42.410] error: service_connection: slurm_receive_msg: Zero Bytes were transmitted or received
[2018-11-30T10:59:42.607] error: slurm_receive_msg_and_forward: Zero Bytes were transmitted or received
[2018-11-30T10:59:42.617] error: service_connection: slurm_receive_msg: Zero Bytes were transmitted or received
[2018-11-30T10:59:59.983] Job 229658 already killed, do not launch batch job
[2018-11-30T11:00:00.003] Job 229657 already killed, do not launch batch job
[2018-11-30T11:02:28.654] Slurmd shutdown completing
[2018-11-30T11:03:28.684] Message aggregation disabled
[2018-11-30T11:03:28.686] CPU frequency setting not configured for this node
[2018-11-30T11:03:28.689] slurmd version 18.08.0 started



________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: 03 December 2018 19:50
To: Baker D.J.
Subject: [Bug 6130] Trying to get pam_slurm_adopt to work


Comment # 2<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6130%23c2&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7C6aef998c7bd04c9f539308d6595897aa%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=ypR1TpFH27JU7S1ddpHR98zBc6s8infKVIdQWk4Xgkg%3D&reserved=0> on bug 6130<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6130&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7C6aef998c7bd04c9f539308d6595897aa%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=zgpcAea5%2F73vos4q7dnhWBKkRq7xlAS0kAL1OxcCglI%3D&reserved=0> from Marshall Garey<mailto:marshall@schedmd.com>

PrologFlags=contain definitely shouldn't have caused nodes to drain. Can you
attach your slurm.conf file and a slurmd log file of a node that drained from
the time when you set PrologFlags=contain and saw nodes draining to when you
removed PrologFlags=contain?

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 4 David Baker 2018-12-07 02:05:34 MST
Hello,


I wondered if you had any updates on this ticket, please? I attached our slurm.conf and some slurmd logs from a node to an email on Monday and have not heard anything since. Do you think that this behaviour is a bug in Slurm 18.08.0 or just a weird interaction with my prologue script?

It would be good to be able to move ahead with this, if possible. We are keen to implement pam_slurm_adopt.so that user ssh sessions are contained within the envelop of their slurm job. On the other hand, I understand that it may not be straight forward to sort the "PrologFlags" issue, however it would be good to heard any updates on your progress so far.

Best regards,
David

________________________________
From: Baker D.J.
Sent: 04 December 2018 16:51
To: bugs@schedmd.com
Subject: Re: [Bug 6130] Trying to get pam_slurm_adopt to work


Thank you for your email. I've attached the slurm.conf from the cluster. PrologFlags=contain was commented out before this configuration was pushed out to the cluster a second time (following the issue I've described).

The slurmd log from an affected node is not particularly interested during this period. The modified slurm.conf (with PrologFlags=contain active) was pushed out to the nodes  just after 10:00 and then was pushed out just after 11:00 (which PrologFlags=contain commented out). The slurmd log extract from this period is shown below.

The issue was triggered at 10:34 by a job (submitted by me -- I had a reservation on the node) starting on that node -- the prolog fails very soon after the job starts. I noticed that the output returned from the job was, for some reason, owned by root.

Best regards,
David

[2018-11-30T10:02:12.391] Slurmd shutdown completing
[2018-11-30T10:03:12.423] Message aggregation disabled
[2018-11-30T10:03:12.425] CPU frequency setting not configured for this node
[2018-11-30T10:03:12.428] slurmd version 18.08.0 started
[2018-11-30T10:03:12.428] slurmd started on Fri, 30 Nov 2018 10:03:12 +0000
[2018-11-30T10:03:13.703] CPUs=40 Boards=1 Sockets=2 Cores=20 Threads=1 Memory=193080 TmpDisk=96540 Uptime=4384777 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2018-11-30T10:09:42.439] error: slurm_receive_msg_and_forward: Zero Bytes were transmitted or received
[2018-11-30T10:09:42.450] error: service_connection: slurm_receive_msg: Zero Bytes were transmitted or received
[2018-11-30T10:19:42.511] error: slurm_receive_msg_and_forward: Zero Bytes were transmitted or received
[2018-11-30T10:19:42.521] error: service_connection: slurm_receive_msg: Zero Bytes were transmitted or received
[2018-11-30T10:29:42.500] error: slurm_receive_msg_and_forward: Zero Bytes were transmitted or received
[2018-11-30T10:29:42.510] error: service_connection: slurm_receive_msg: Zero Bytes were transmitted or received
[2018-11-30T10:34:49.033] error: Waiting for JobId=229593 prolog has failed, giving up after 50 sec
[2018-11-30T10:34:49.035] Could not launch job 229593 and not able to requeue it, cancelling job
[2018-11-30T10:39:42.584] error: slurm_receive_msg_and_forward: Zero Bytes were transmitted or received
[2018-11-30T10:39:42.594] error: service_connection: slurm_receive_msg: Zero Bytes were transmitted or received
[2018-11-30T10:49:42.400] error: slurm_receive_msg_and_forward: Zero Bytes were transmitted or received
[2018-11-30T10:49:42.410] error: service_connection: slurm_receive_msg: Zero Bytes were transmitted or received
[2018-11-30T10:59:42.607] error: slurm_receive_msg_and_forward: Zero Bytes were transmitted or received
[2018-11-30T10:59:42.617] error: service_connection: slurm_receive_msg: Zero Bytes were transmitted or received
[2018-11-30T10:59:59.983] Job 229658 already killed, do not launch batch job
[2018-11-30T11:00:00.003] Job 229657 already killed, do not launch batch job
[2018-11-30T11:02:28.654] Slurmd shutdown completing
[2018-11-30T11:03:28.684] Message aggregation disabled
[2018-11-30T11:03:28.686] CPU frequency setting not configured for this node
[2018-11-30T11:03:28.689] slurmd version 18.08.0 started



________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: 03 December 2018 19:50
To: Baker D.J.
Subject: [Bug 6130] Trying to get pam_slurm_adopt to work


Comment # 2<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6130%23c2&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7C6aef998c7bd04c9f539308d6595897aa%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=ypR1TpFH27JU7S1ddpHR98zBc6s8infKVIdQWk4Xgkg%3D&reserved=0> on bug 6130<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6130&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7C6aef998c7bd04c9f539308d6595897aa%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=zgpcAea5%2F73vos4q7dnhWBKkRq7xlAS0kAL1OxcCglI%3D&reserved=0> from Marshall Garey<mailto:marshall@schedmd.com>

PrologFlags=contain definitely shouldn't have caused nodes to drain. Can you
attach your slurm.conf file and a slurmd log file of a node that drained from
the time when you set PrologFlags=contain and saw nodes draining to when you
removed PrologFlags=contain?

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 5 Marshall Garey 2018-12-07 12:02:56 MST
I believe the prolog failed because of the communication issues, not because of your job. From your slurm.conf it looks like it wasn't a job prolog, but perhaps the taskprolog. I don't know why the output file is owned by root. A guess is that the file hadn't been created yet, so the slurmstepd (running as root) created the file, then wrote to it. But I haven't seen that, so I don't know for sure.

I also haven't been able to reproduce setting PrologFlags=contain causing those connection issues or job failures. It really shouldn't have anything to do with communication errors, which I think is the real problem here.

Am I right in assuming you saw these communication error messages only after you pushed out the change to slurm.conf, then they went away as soon as you reverted the change and pushed out the old slurm.conf?
Comment 6 Marshall Garey 2018-12-07 12:03:42 MST
These are the errors I'm referring to:

[2018-11-30T10:09:42.439] error: slurm_receive_msg_and_forward: Zero Bytes were transmitted or received
[2018-11-30T10:09:42.450] error: service_connection: slurm_receive_msg: Zero Bytes were transmitted or received
Comment 7 David Baker 2018-12-11 10:16:19 MST
Hello,


Apologies that I haven't had much time to look at the PrologFlags problem. I can confirm that the messages in the slurm logs. That is -- the communication messages as shown below. They appear in the slurmd log files whatever. That is, these messages appear even if the "PrologFlags=contain" is commented out.


Presumably, there is no report of a bug relating to the use of that PrologFlag with slurm v18. Furthermore, I gather that you cannot reproduce this issue. Any thoughts regarding where/what next?, please Am I correct in thinking that I will need the PrologFlags=contain in place and work for pam_slurm_adopt.so to work as expected?


The only reason for being interested in pam_slurm_adopt.so is because we would like to run user jobs on a set of compute nodes in a shared environment. In other words we would like to stack small jobs from different users on a set of nodes (controlled by our "serial" partition). Obviously, users may need to login into the nodes, however we don't want users to be able to run processes via an ssh session that are potentially unconstrained by their slurm job. In other words, it is possible that a user could logon to one of the nodes via ssh and bring that node down by swamping the memory. Currently we are using pam_slurm.co. Thoughts on this, please?


Best regards,

David


[2018-12-11T16:06:07.180] error: slurm_receive_msg_and_forward: Zero Bytes were transmitted or received
[2018-12-11T16:06:07.190] error: service_connection: slurm_receive_msg: Zero Bytes were transmitted or received
[2018-12-11T16:16:07.135] error: slurm_receive_msg_and_forward: Zero Bytes were transmitted or received
[2018-12-11T16:16:07.145] error: service_connection: slurm_receive_msg: Zero Bytes were transmitted or received
[2018-12-11T16:26:07.160] error: slurm_receive_msg_and_forward: Zero Bytes were transmitted or received
[2018-12-11T16:26:07.170] error: service_connection: slurm_receive_msg: Zero Bytes were transmitted or received




________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: 07 December 2018 19:03
To: Baker D.J.
Subject: [Bug 6130] Trying to get pam_slurm_adopt to work


Comment # 6<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6130%23c6&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7C114be38a5aa342013ba908d65c76b574%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=tarrCH2du7IYOS7e2ysY3qKdVAoFmhvFHdGGDTMZ2EM%3D&reserved=0> on bug 6130<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6130&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7C114be38a5aa342013ba908d65c76b574%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=7f6wcA7YPUDvZlbftowgvxC3Mc17ufLxB7zwMi3EouU%3D&reserved=0> from Marshall Garey<mailto:marshall@schedmd.com>

These are the errors I'm referring to:

[2018-11-30T10:09:42.439] error: slurm_receive_msg_and_forward: Zero Bytes were
transmitted or received
[2018-11-30T10:09:42.450] error: service_connection: slurm_receive_msg: Zero
Bytes were transmitted or received

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 8 Marshall Garey 2018-12-11 10:31:48 MST
Yes, you absolutely need PrologFlags=contain for pam_slurm_adopt to work. Since the error messages appear with or without PrologFlags=contain, do you also see nodes ever draining even without PrologFlags=contain, with these error messages nearby the draining event? I really think PrologFlags=contain has nothing to do with the draining. But nodes draining is something you'll want to investigate further if it keeps happening, probably in a new ticket if we want to keep this one focused on pam_slurm_adopt.

I suggest re-adding PrologFlags=contain to slurm.conf and reconfiguring (no need to restart the daemons) and see what happens. It *shouldn't* change the behavior of jobs, except for adding the "extern" step to newly started jobs. Assuming it doesn't cause any problems, continue following the installation instructions on the pam_slurm_adopt web page, and let me know if you run into problems getting it set up.

https://slurm.schedmd.com/pam_slurm_adopt.html


> The only reason for being interested in pam_slurm_adopt.so is because we
> would like to run user jobs on a set of compute nodes in a shared
> environment. In other words we would like to stack small jobs from different
> users on a set of nodes (controlled by our "serial" partition). Obviously,
> users may need to login into the nodes, however we don't want users to be
> able to run processes via an ssh session that are potentially unconstrained
> by their slurm job. In other words, it is possible that a user could logon
> to one of the nodes via ssh and bring that node down by swamping the memory.
> Currently we are using pam_slurm.co. Thoughts on this, please?

Sounds good. pam_slurm_adopt will do what you want.
Comment 9 David Baker 2018-12-20 05:44:53 MST
Hello,


It looks like pam_slurm_adopt is finally working. It bars access to compute nodes until a job is allocated/running on a node and only then can the job owner ssh in. I have been testing the plugin on my shared partition (where memory is treated as a resource). As an exercise I started an interactive session using slurm (default mem/core is 4300 MB ). I then logged into the compute node via ssh and ran an executable that allocates memory (for no good reason). I can allocate 4 GB, but if I try to allocate 40 GB then the executable fails with a bus error. I checked, by the way, that I could allocate 40 GB when the node was free. So, in other words, it looks like my ssh session is being constrained by my slurm job.


Looks like we can now close both this ticket and #6223.


I finally wanted to double checking that my node setup for pam_slurm_adopt is sane, please -- see below.


Best regards,

David


I decided to follow the instructions here -- https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-prologflags.


So, to switch from pam_slurm.so to pam_slurm_adopt.so I did the following...


-- commented out 'account     required      pam_slurm.so' in system/paasword-auth files


-- Added these lines to /etc/pam.d/sshd

# - PAM config for Slurm - BEGIN
account    sufficient   pam_slurm_adopt.so
account    required     pam_access.so
# - PAM config for Slurm - END


-- Added these lines to /etc/security/access.conf

+ : root   : ALL
- : ALL    : ALL

-- do I need the file /etc/pam.d/slurm?


________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: 11 December 2018 17:31
To: Baker D.J.
Subject: [Bug 6130] Trying to get pam_slurm_adopt to work


Comment # 8<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6130%23c8&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7C63f8f865f989403e449f08d65f8e891b%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=kO5nAFg5ZaTjND795VdjegwnyhjjsPG7XWsG9BHoGMI%3D&reserved=0> on bug 6130<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6130&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7C63f8f865f989403e449f08d65f8e891b%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=r4g0D0XPI6PgMWU0Ab2crclyyyXjuNDrUa4eb0B%2BpFs%3D&reserved=0> from Marshall Garey<mailto:marshall@schedmd.com>

Yes, you absolutely need PrologFlags=contain for pam_slurm_adopt to work. Since
the error messages appear with or without PrologFlags=contain, do you also see
nodes ever draining even without PrologFlags=contain, with these error messages
nearby the draining event? I really think PrologFlags=contain has nothing to do
with the draining. But nodes draining is something you'll want to investigate
further if it keeps happening, probably in a new ticket if we want to keep this
one focused on pam_slurm_adopt.

I suggest re-adding PrologFlags=contain to slurm.conf and reconfiguring (no
need to restart the daemons) and see what happens. It *shouldn't* change the
behavior of jobs, except for adding the "extern" step to newly started jobs.
Assuming it doesn't cause any problems, continue following the installation
instructions on the pam_slurm_adopt web page, and let me know if you run into
problems getting it set up.

https://slurm.schedmd.com/pam_slurm_adopt.html<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fpam_slurm_adopt.html&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7C63f8f865f989403e449f08d65f8e891b%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=Wp8ahfPsG%2BYTaSyH2Hh4UqzZd0NiEL7BiUYPLOLZkYU%3D&reserved=0>


> The only reason for being interested in pam_slurm_adopt.so is because we
> would like to run user jobs on a set of compute nodes in a shared
> environment. In other words we would like to stack small jobs from different
> users on a set of nodes (controlled by our "serial" partition). Obviously,
> users may need to login into the nodes, however we don't want users to be
> able to run processes via an ssh session that are potentially unconstrained
> by their slurm job. In other words, it is possible that a user could logon
> to one of the nodes via ssh and bring that node down by swamping the memory.
> Currently we are using pam_slurm.co. Thoughts on this, please?

Sounds good. pam_slurm_adopt will do what you want.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 10 Marshall Garey 2018-12-20 09:18:27 MST
Awesome, I'm glad to hear it's working.

Your setup for pam_slurm_adopt looks fine. However, pam_slurm_adopt already ignores root (it always allows root). So, if root is the only user you want as the exception for pam_slurm_adopt, you don't actually need pam_access.so below pam_slurm_adopt.so. If you get rid of pam_access.so, make sure to change pam_slurm_adopt.so from "sufficient" to "required." If you anticipate adding additional users to the exception list in pam_access, then you can leave it as-is.

You shouldn't need the file /etc/pam.d/slurm for pam_slurm_adopt.

Just in case you didn't know already, we have updated documentation for pam_slurm_adopt here:

https://slurm.schedmd.com/pam_slurm_adopt.html
Comment 11 Marshall Garey 2018-12-20 10:00:51 MST
From bug 6223 comment 17:

"How can I best check that my ssh session is constrained by the resources allocated to the slurm job? I  submitted a job requesting 1 cpu/core to my test node, then ssh'ed into the node and finally tried to run a mpi job in the ssh session. I was surprised to see that I could apparently use all the cpu cores on the node to run my mpi job. Does that make sense?"

In order to constrain cpus, you need ConstrainCores=yes in cgroup.conf and likely task/affinity as well as task/cgroup. I suspect you don't have ConstrainCores=yes in cgroup.conf, since you don't have task/affinity. Adding task/affinity to slurm.conf will require a restart of the slurmctld and slurmd daemons.
Comment 12 David Baker 2018-12-21 07:29:43 MST
Hello,


Thank you. I can confirm that my processes started in an ssh session are being constrained by the slurm job. I had "ConstrainCores=yes" configured, but had neglected to set the "affinity" plugin in slurm.conf. I set the affinity plugin today, pushed out the slurm.conf and restarted the slurm processes across the cluster.


I decided to test things at a fairly low level. That is, I started a number of interactive jobs each with a specific core requirement. Let's say one of the jobs requested 10 cores. I then ssh'ed into the assigned node and started a set  (>10 in this example)  of processes/scripts each mindlessly looping around a couple of nested do loops in bash. No matter how many times I fired up the script I saw that all the processes were constrained to just 10 cores on the node. So, yes, I am constraining processes on my assigned cores.


I should really fire up another bug/ticket report to ask this, but I wondered if someone could please remind me when our SchedMD contract comes to an end and what the approximate cost of renewing the contract is.


Best regards,

David


[2018-12-21T13:50:47.500] task/affinity: job 250925 CPU input mask for node: 0x0000000001
[2018-12-21T13:50:47.500] task/affinity: job 250925 CPU final HW mask for node: 0x0000000001
[2018-12-21T13:50:47.500] _run_prolog: run job script took usec=3
[2018-12-21T13:50:47.500] _run_prolog: prolog with lock for job 250925 ran for 0 seconds
[2018-12-21T13:50:47.513] [250925.extern] task/cgroup: /slurm/uid_57337/job_250925: alloc=4300MB mem.limit=4300MB memsw.limit=unlimited
[2018-12-21T13:50:47.513] [250925.extern] task/cgroup: /slurm/uid_57337/job_250925/step_extern: alloc=4300MB mem.limit=4300MB memsw.limit=unlimited
[2018-12-21T13:50:47.515] Launching batch job 250925 for UID 57337
[2018-12-21T13:50:47.525] [250925.batch] task/cgroup: /slurm/uid_57337/job_250925: alloc=4300MB mem.limit=4300MB memsw.limit=unlimited
[2018-12-21T13:50:47.525] [250925.batch] task/cgroup: /slurm/uid_57337/job_250925/step_batch: alloc=4300MB mem.limit=4300MB memsw.limit=unlimited
[2018-12-21T13:50:47.528] [250925.batch] task_p_pre_launch: Using sched_affinity for tasks



________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: 20 December 2018 17:00
To: Baker D.J.
Subject: [Bug 6130] Trying to get pam_slurm_adopt to work


Comment # 11<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6130%23c11&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cc94f1ae8cdb9454221d908d6669cb3e3%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=GPuZFBFSLpL2LndYpD%2BNfX9jc9GQyva7D7%2BBg54A3UE%3D&reserved=0> on bug 6130<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6130&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cc94f1ae8cdb9454221d908d6669cb3e3%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=pBVZPpBj2jMG03FpMY8ZzJC9L84yiXGKulbwf7HPhBA%3D&reserved=0> from Marshall Garey<mailto:marshall@schedmd.com>

From bug 6223 comment 17<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6223%23c17&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cc94f1ae8cdb9454221d908d6669cb3e3%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=OEU8zTT%2FeNRi4eKlzD0DMUJg6k6Ua9AYULXxvhA5ERU%3D&reserved=0>:

"How can I best check that my ssh session is constrained by the resources
allocated to the slurm job? I  submitted a job requesting 1 cpu/core to my test
node, then ssh'ed into the node and finally tried to run a mpi job in the ssh
session. I was surprised to see that I could apparently use all the cpu cores
on the node to run my mpi job. Does that make sense?"

In order to constrain cpus, you need ConstrainCores=yes in cgroup.conf and
likely task/affinity as well as task/cgroup. I suspect you don't have
ConstrainCores=yes in cgroup.conf, since you don't have task/affinity. Adding
task/affinity to slurm.conf will require a restart of the slurmctld and slurmd
daemons.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 13 Marshall Garey 2018-12-21 09:10:05 MST
You can contact Jacob Jenson at jacob@schedmd.com for information about the contract. I don't actually know.

I'll close this ticket as infogiven. If you have more questions, just respond and it'll re-open automatically.