Ticket 11511

Summary: Providing debugging sessions for users with full access to the resources on the node
Product: Slurm Reporter: Nuance HPC Grid Admins <gridadmins>
Component: ConfigurationAssignee: Ben Roberts <ben>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: - Unsupported Older Versions   
Hardware: Linux   
OS: Linux   
Site: Nuance Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: CentOS
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf file from cluster needing debug queue

Description Nuance HPC Grid Admins 2021-05-03 13:57:34 MDT
Created attachment 19275 [details]
slurm.conf file from cluster needing debug queue

Hello,

  We are running several Slurm clusters in Azure orchestrated by Cyclecloud.  We are currently running Slurm 19.05.8 and Cyclecloud 7.9.9.  

  Our users are requesting the ability to open debugging sessions for jobs on both CPU and GPU compute resources.  The initial suggestion was to use an interactive srun session:

  srun -p gpu --jobid <jobid to debug> --pty bash -i

At various times, we run into conflicts with name resolution because Cyclecloud and Slurm do not use the same names for the hosts/nodes, inability to add a step to a jobid, or unable to access the resource allocation for a job.

In particular, for GPU jobs that have requested generic resources for GPUs deviced, the srun session is unable to see the GPU devices on the node, or the devices allocated to the job even when using the --jobid argument.  nvidia-smi reports no devices found.

What we would like to do is possibly configure an additional partition for debugging sessions that would be available on every node, but would not count against the available resources in the main partitions.  This queue would only be used for logging in a debugging running jobs.  It would need to have access to all available hardware resources on the running nodes just to debug.

In our cgroup.conf, we have the following:

CgroupAutomount=no
#CgroupMountpoint=/sys/fs/cgroup
ConstrainCores=yes
ConstrainRamSpace=yes
ConstrainSwapSpace=yes
TaskAffinity=no
ConstrainDevices=yes

For our partitions from our cyclecloud.conf file (included in slurm.conf)  We have:

PartitionName=cpu Nodes=cpu1hpc-pg0-[1-40],cpu1htc-[1-40],cpu2hpc-[1-40],cpu2htc-[1-40],cpu3hpc-[1-36],cpu3htc-[1-36] Default=YES DefMemPerCPU=6809 MaxTime=INFINITE State=UP
PartitionName=gpu Nodes=gpu3hpc-[1-10],gpu3htc-[1-10],gpu4hpc-[1-10],gpu4htc-[1-10] Default=NO DefMemPerCPU=16343 MaxTime=INFINITE State=UP


Would there be any way to configure a third partition that doesn't use the cgroup constraints?
Comment 2 Ben Roberts 2021-05-03 16:03:11 MDT
Hello,

For the problem you're describing, we do have a solution that allows users to ssh to the node(s) their job is on and have the ssh session be joined to the running job that was started by their user.  This allows them to have access to all the resources that were requested by the job.  This functionality is handled by the pam_slurm_adopt module.  The thing I'm unsure of is whether there is anything that would prevent this from working in a cloud environment.  I looked at it a little and I think that as long as you're able to look up usernames from the cloud nodes, using Azure AD Connect or some similar service, that it should be able to handle the requests correctly.  You would also obviously need to have appropriate ports open to be able to ssh to the nodes from your network.  The image used for the nodes would also need to be modified to have the appropriate changes made to the pam files.  

We have a guide on configuring pam_slurm_adopt that shows a good amount of detail of what needs to be done.  I would recommend starting by reading through that to make sure it meets your requirements and that you understand the process.  If you have questions about it please let me know.
https://slurm.schedmd.com/pam_slurm_adopt.html

Thanks,
Ben
Comment 3 Nuance HPC Grid Admins 2021-05-04 09:50:31 MDT
So, there are several issues we see with using straight ssh in our Cyclecloud/slurm environment:

1.  We are using Cyclecloud/Slurm to have a completely ephemeral compute environment.  The ability for a node to remain provisioned is dictated by Slurm's knowledge of jobs assigned to the node.  An ssh session to a compute node does not inform Slurm that there is a need to keep a node provisioned.  Thus the desire to schedule an interactive job
2. By using ssh, we would be telling our users it is acceptable to login to the compute nodes outside of the scheduler's knowledge.  We have locked out this type of access in our other environments as we do have users who will run tools/jobs on nodes that would impact the performance of jobs.  We do not want to make ssh access an option.
3.  Due to how the Slurm/Cyclecloud combination works, Slurm does not know the compute nodes' host names.  Slurm node names and Cyclecloud VM host names do not align.  So again, this becomes an issue of users have to query Slurm to get the IP address of the compute node that they need to login to.
Comment 4 Ben Roberts 2021-05-05 13:01:40 MDT
It sounds like I misunderstood your request initially.  The limitations you bring up are correct.  Using pam_slurm_adopt to ssh to the nodes is intended to happen when there is already a job on the node.  It makes it so that your ssh session is adopted into the current job so you have access to the resources requested by that job.  If you're not always going to connect to nodes with active jobs on them then this wouldn't work.  

I'm not sure that we can meet all the requirements you're asking for with a third partition, but you can configure a partition that overlaps the existing ones.  There isn't a way to disable cgroup enforcement for a particular partition.  It sounds like you want to submit jobs to nodes that currently have running jobs, which is why I thought pam_slurm_adopt might be what you were looking for.  You can configure the partitions to oversubscribe the nodes, which would allow you to have jobs from different partitions run together.  

Here's an example showing jobs from different partitions running on the same processors on the same node.  I submit two jobs from different partitions that print out the contents of the cpuset.cpus from the cgroups directory for that job.

$ sbatch -pdebug -wnode01 -n24 --mem=100M --oversubscribe --wrap='cat /sys/fs/cgroup/cpuset/slurm_node01/uid_1000/job_$SLURM_JOBID/cpuset.cpus; sleep 10'         
Submitted batch job 26529

$ sbatch -pgpu -wnode01 -n24 --mem=100M --oversubscribe --wrap='cat /sys/fs/cgroup/cpuset/slurm_node01/uid_1000/job_$SLURM_JOBID/cpuset.cpus; sleep 10'     
Submitted batch job 26530



These jobs are both running at the same time.

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             26529     debug     wrap      ben  R       0:05      1 node01
             26530       gpu     wrap      ben  R       0:02      1 node01



Both jobs show in the output file that they are using CPUs 0-23.

$ cat slurm-26529.out
0-23

$ cat slurm-26530.out
0-23


I ran this example with the --oversubscribe flag included.  You can also add the 'OverSubscribe=FORCE' parameter on the partition definition to have this behavior turned on all the time.  

Let me know if this sounds more like what you were looking for.

Thanks,
Ben
Comment 5 Nuance HPC Grid Admins 2021-05-11 13:39:36 MDT
Hi Ben,

  Yes, the oversubscribe option is what we would be looking to do for a debug queue.  The question becomes, can we have this oversubscribed partition that configures cgroups any job in this partition to have access to all resources?  For example, always map in all GPU devices on a GPU node to any debug partition job?
Comment 6 Ben Roberts 2021-05-11 16:37:03 MDT
I'm afraid I have to back-pedal with the oversubscribe option.  When I typed up my previous response I forgot that oversubscription only applies to CPUs and not GPUs.  When you try to add GPUs to the request the second job will remain queued until the first finishes.  One option might be to define the GPUs as a GRES with the no_consume option set, but this would eliminate the cgroup enforcement on the GPUs, which would be a problem if there are going to be multiple GPU jobs on the same node.  

There is also MPS functionality that allows GPU sharing, but it's not quite what you're looking for.  There is a limitation that you can only have one GPU using MPS at a time and it works by sharing the time on the GPU.  If you're trying to debug an active job that's not going to help.
https://slurm.schedmd.com/gres.html#MPS_Management

If neither attaching to the existing job with srun nor using SSH to access the node works for you then I'm afraid I don't have a good solution for this.  The closest would be to disable cgroups to allow the jobs to have access to all resources, which I don't recommend since it sounds like you do have multiple GPU jobs on these nodes at once.  Did you already open a ticket for the problem attaching to the existing job with srun?  Is that something we can look into further to try and make work?

Thanks,
Ben
Comment 7 Ben Roberts 2021-06-04 11:17:44 MDT
I wanted to follow up with you.  I know there wasn't a good option to do exactly what you are looking for, but did one of the potential workarounds sound like they might come close enough to work for you?  Let me know if you have any questions about this.

Thanks,
Ben
Comment 8 Ben Roberts 2021-07-02 11:10:03 MDT
I haven't heard any follow up questions about this so I assume the information I sent helped.  I'll close this ticket, but let us know if there's anything else we can do to help.

Thanks,
Ben