|
Description
ruth.a.braun
2020-12-16 12:16:01 MST
Created attachment 17210 [details]
slurm.conf.121720.rab.txt
Would you please reproduce this with debug2 configured for the slurmd on the node you are testing with?
Added the debug (used debug3 if that's ok), but was unable to allocate cpus on a compute node in this partition. There are two users who have cpus allocated but not all of them are in use. My job goes to pending resources for some reason. I could kill their jobs, but would like help to determine why my resources are pending. Also note that the other partition we defined is not having this same issue. The other partition is OverSubscribe=EXCLUSIVE (the one that works ok and has a lot more servers in it); while the problem partition "devel" contains 4 servers and is not. *** Ticket 10466 has been marked as a duplicate of this ticket. *** Hi Ruth, Can you attach your slurm.conf? What Linux distro and kernel are you running on? Just a note: doing `slurmd -D` will not show you the stepd logs. Instead, it is recommended to run slurmd in the background and to actively monitor the slurmd.log during debugging, since that will include all the logs emitted by the steps. Could you reproduce the problem and then attach the relevant portions of your slurmd.log and slurmctld.log (rather than from the slurmd in the foreground)? We recently fixed a similar cgroup-related error, so I would recommend upgrading to 20.02.6 to see if that solves the issue. Thanks -Michael The server running slurmctld and slurmdbd is: # uname -a Linux clnschedsvr1.hpc.na.xom.com 3.10.0-957.5.1.el7.x86_64 #1 SMP Wed Dec 19 10:46:58 EST 2018 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.7 (Maipo) GPU compute nodes are running: Red Hat Enterprise Linux ComputeNode release 7.6 (Maipo) # uname -r 3.10.0-957.27.2.el7.x86_64 Slurm.conf attached – please do not publish to others Best Regards, Ruth A. Braun EMRE High Performance Computing – Sr. IT Analyst Fuels Lubricants and Chemicals IT (FLCIT) ExxonMobil Technical Computing Company 1545 Route 22 East - Clinton CCS18 Annandale, NJ 08801 908 335 3694 Tel Problem, questions, need help? Open a ticket using this goto link: http://goto/EMREHPCTICKET From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Thursday, December 17, 2020 1:39 PM To: Braun, Ruth A <ruth.a.braun@exxonmobil.com> Subject: [Bug 10460] srun error External Email - Think Before You Click Comment # 5<https://bugs.schedmd.com/show_bug.cgi?id=10460#c5> on bug 10460<https://bugs.schedmd.com/show_bug.cgi?id=10460> from Michael Hinton<mailto:hinton@schedmd.com> Hi Ruth, Can you attach your slurm.conf? What Linux distro and kernel are you running on? Just a note: doing `slurmd -D` will not show you the stepd logs. Instead, it is recommended to run slurmd in the background and to actively monitor the slurmd.log during debugging, since that will include all the logs emitted by the steps. Could you reproduce the problem and then attach the relevant portions of your slurmd.log and slurmctld.log (rather than from the slurmd in the foreground)? We recently fixed a similar cgroup-related error, so I would recommend upgrading to 20.02.6 to see if that solves the issue. Thanks -Michael ________________________________ You are receiving this mail because: * You reported the bug. Created attachment 17211 [details]
debug3-test-e8002.txt
Created attachment 17212 [details]
ruthctld.log
I see the three attachments I emailed back above. Let me know if you need anything else. Ruth 12/17 Hi Ruth,
> Meanwhile, if I should upgrade (this cluster is not in production yet so
> I could do what I want)… should I just go directly to the latest release 20.11.1.?
You could do that, but I would recommend upgrading minor versions for now (20.02.5 --> 20.02.6) because that can be easily done in place without needing to upgrade the database or tweak your configuration. Minor version upgrades only contain bug fixes and don't introduce new features or breaking changes.
Hi Michael, Ok, I’ll work on the upgrade to 20.02.6 today. I am out of office Christmas week, but working today and plan to check in periodically. Please continue to send info on the interpretation of my issue (and suggestions) . I’ll be back in the office 12/28. Regards, Ruth From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Thursday, December 17, 2020 7:42 PM To: Braun, Ruth A <ruth.a.braun@exxonmobil.com> Subject: [Bug 10460] srun error External Email - Think Before You Click Comment # 10<https://bugs.schedmd.com/show_bug.cgi?id=10460#c10> on bug 10460<https://bugs.schedmd.com/show_bug.cgi?id=10460> from Michael Hinton<mailto:hinton@schedmd.com> Hi Ruth, > Meanwhile, if I should upgrade (this cluster is not in production yet so > I could do what I want)… should I just go directly to the latest release 20.11.1.? You could do that, but I would recommend upgrading minor versions for now (20.02.5 --> 20.02.6) because that can be easily done in place without needing to upgrade the database or tweak your configuration. Minor version upgrades only contain bug fixes and don't introduce new features or breaking changes. ________________________________ You are receiving this mail because: * You reported the bug. Ruth,
On the nodes emitting the errors, *while a job causing the error is still running*, could you please run the following commands and paste the output here?:
find /sys/fs/cgroup/ -name cpuset.cpus -exec echo '{}' \; -exec cat '{}' \;
find /sys/fs/cgroup/ -name cpuset.mems -exec echo '{}' \; -exec cat '{}' \;
This will double check to see if cgroup stuff is being set and propagated correctly. We thought we fixed this in 20.02.6 and 20.11.0, but it's possible it did not get fixed completely.
-Michael
Alternatively, upgrade to 20.02.6 to see if that fixes things, and if not, then do what I asked in comment 14. Will do! Best Regards, Ruth Ruth A. Braun EMRE High Performance Computing – Sr. IT Analyst Fuels Lubricants and Chemicals IT (FLCIT) ExxonMobil Technical Computing Company 1545 Route 22 East - Clinton CCS18 Annandale, NJ 08801 908 335 3694 Tel Problem, questions, need help? Open a ticket using this goto link: http://goto/EMREHPCTICKET From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Tuesday, December 22, 2020 12:09 PM To: Braun, Ruth A <ruth.a.braun@exxonmobil.com> Subject: [Bug 10460] srun error External Email - Think Before You Click Michael Hinton<mailto:hinton@schedmd.com> changed bug 10460<https://bugs.schedmd.com/show_bug.cgi?id=10460> What Removed Added CC felip.moll@schedmd.com<mailto:felip.moll@schedmd.com> Comment # 14<https://bugs.schedmd.com/show_bug.cgi?id=10460#c14> on bug 10460<https://bugs.schedmd.com/show_bug.cgi?id=10460> from Michael Hinton<mailto:hinton@schedmd.com> Ruth, On the nodes emitting the errors, *while a job causing the error is still running*, could you please run the following commands and paste the output here?: find /sys/fs/cgroup/ -name cpuset.cpus -exec echo '{}' \; -exec cat '{}' \; find /sys/fs/cgroup/ -name cpuset.mems -exec echo '{}' \; -exec cat '{}' \; This will double check to see if cgroup stuff is being set and propagated correctly. We thought we fixed this in 20.02.6 and 20.11.0, but it's possible it did not get fixed completely. -Michael ________________________________ You are receiving this mail because: * You reported the bug.
Sorry this took so long but here is output on a compute node that's running a job and at slurm-20.02.6-1
[root@e4001 ~]# find /sys/fs/cgroup/ -name cpuset.cpus -exec echo '{}' \; -exec cat '{}' \;
/sys/fs/cgroup/cpuset/weka/cpuset.cpus
1
/sys/fs/cgroup/cpuset/system/cpuset.cpus
0,2-55
/sys/fs/cgroup/cpuset/cpuset.cpus
0-55
[root@e4001 ~]# find /sys/fs/cgroup/ -name cpuset.mems -exec echo '{}' \; -exec cat '{}' \;
/sys/fs/cgroup/cpuset/weka/cpuset.mems
0
/sys/fs/cgroup/cpuset/system/cpuset.mems
0-1
/sys/fs/cgroup/cpuset/cpuset.mems
0-1
# rpm -qa|grep slurm
slurm-20.02.6-1.el7.x86_64
slurm-slurmd-20.02.6-1.el7.x86_64
slurm-pam_slurm-20.02.6-1.el7.x86_64
slurm-perlapi-20.02.6-1.el7.x86_64
slurm-devel-20.02.6-1.el7.x86_64
slurm-libpmi-20.02.6-1.el7.x86_64
slurm-torque-20.02.6-1.el7.x86_64
slurm-contribs-20.02.6-1.el7.x86_64
slurm-example-configs-20.02.6-1.el7.x86_64
[root@e4001 ~]# date
Sat Jan 2 09:03:07 EST 2021
Please use this set of output instead of my last post:
With 20.02.6-1 now running...
User xurabraun gets error:
[xurabraun@vlogin003 ~]$ srun -p devel -N 1 -n 8 --pty bash
[xurabraun@SLURM]$ srun: error: e4002: task 1: Exited with exit code 1
[xurabraun@SLURM]$ hostname
e4002.noether
[xurabraun@SLURM]$ date
Sat Jan 2 09:23:40 EST 2021
(root ssh to compute node e4002) to perform find commands while xurabraun job is still running)
[root@e4002 ~]# date
Sat Jan 2 09:24:40 EST 2021
[root@e4002 ~]# find /sys/fs/cgroup/ -name cpuset.cpus -exec echo '{}' \; -exec cat '{}' \;
/sys/fs/cgroup/cpuset/weka/cpuset.cpus
1
/sys/fs/cgroup/cpuset/system/cpuset.cpus
0,2-55
/sys/fs/cgroup/cpuset/cpuset.cpus
0-55
[root@e4002 ~]# find /sys/fs/cgroup/ -name cpuset.mems -exec echo '{}' \; -exec cat '{}' \;
/sys/fs/cgroup/cpuset/weka/cpuset.mems
0
/sys/fs/cgroup/cpuset/system/cpuset.mems
0-1
/sys/fs/cgroup/cpuset/cpuset.mems
0-1
(In reply to ruth.a.braun from comment #18) > [root@e4002 ~]# find /sys/fs/cgroup/ -name cpuset.cpus -exec echo '{}' \; > -exec cat '{}' \; > /sys/fs/cgroup/cpuset/weka/cpuset.cpus > 1 > /sys/fs/cgroup/cpuset/system/cpuset.cpus > 0,2-55 > /sys/fs/cgroup/cpuset/cpuset.cpus > 0-55 It appears that Weka is using cgroups to reserve CPU 1 on that node. However, Slurm doesn't know about this, and so when the job runs on the node, it tries to set the CPU affinity for CPU 1 and fails: [2020-12-17T14:21:42.676] [384.0] task_p_pre_launch: Using sched_affinity for tasks [2020-12-17T14:21:42.677] [384.0] sched_setaffinity(18992,128,0x2) failed: Invalid argument [2020-12-17T14:21:42.677] [384.0] debug: task_g_pre_launch: task/affinity: Unspecified error [2020-12-17T14:21:42.677] [384.0] error: Failed to invoke task plugins: task_p_pre_launch error See the "sched_setaffinity(18992,128,0x2)"? The task is trying to set the CPU affinity for CPU 1 (mask 0x2), but that CPU is already taken by Weka. So it produces an EINVAL error. From https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html: "sched_setaffinity(2) ... EINVAL The affinity bit mask mask contains no processors that are currently physically on the system and permitted to the thread according to any restrictions that may be imposed by cpuset cgroups or the "cpuset" mechanism described in cpuset(7)." I think the solution here is to work with Weka to stop it from reserving a CPU. Another solution is to tell Slurm that CPU 1 is off limits for that node, so that it doesn't allocate it to tasks. You can do this I think with the "CPUSpecList" parameter in slurm.conf. -Michael Michael , Message received. Wondering also why the gpu partition does not show this issue (just the partition devel). Fix help: could you specify what entry I make for slurm.conf, gres.conf and/or the cgroup.conf files? For example, would I add this to the Nodename Difinition? Nodename=DEFAULT CPUs=56 Boards=1 SocketsPerBoard=2 CoresPerSocket=28 ThreadsPerCore=1 RealMemory=386450 CpuSpecList=0x2 [cid:image001.png@01D6E402.049BB870] Best Regards, Ruth Ruth A. Braun EMRE High Performance Computing – Sr. IT Analyst Fuels Lubricants and Chemicals IT (FLCIT) ExxonMobil Technical Computing Company 1545 Route 22 East - Clinton CCS18 Annandale, NJ 08801 908 335 3694 Tel Problem, questions, need help? Open a ticket using this goto link: http://goto/EMREHPCTICKET From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Tuesday, January 5, 2021 1:01 PM To: Braun, Ruth A <ruth.a.braun@exxonmobil.com> Subject: [Bug 10460] srun error External Email - Think Before You Click Comment # 21<https://bugs.schedmd.com/show_bug.cgi?id=10460#c21> on bug 10460<https://bugs.schedmd.com/show_bug.cgi?id=10460> from Michael Hinton<mailto:hinton@schedmd.com> (In reply to ruth.a.braun from comment #18<show_bug.cgi?id=10460#c18>) > [root@e4002 ~]# find /sys/fs/cgroup/ -name cpuset.cpus -exec echo '{}' \; > -exec cat '{}' \; > /sys/fs/cgroup/cpuset/weka/cpuset.cpus > 1 > /sys/fs/cgroup/cpuset/system/cpuset.cpus > 0,2-55 > /sys/fs/cgroup/cpuset/cpuset.cpus > 0-55 It appears that Weka is using cgroups to reserve CPU 1 on that node. However, Slurm doesn't know about this, and so when the job runs on the node, it tries to set the CPU affinity for CPU 1 and fails: [2020-12-17T14:21:42.676] [384.0] task_p_pre_launch: Using sched_affinity for tasks [2020-12-17T14:21:42.677] [384.0] sched_setaffinity(18992,128,0x2) failed: Invalid argument [2020-12-17T14:21:42.677] [384.0] debug: task_g_pre_launch: task/affinity: Unspecified error [2020-12-17T14:21:42.677] [384.0] error: Failed to invoke task plugins: task_p_pre_launch error See the "sched_setaffinity(18992,128,0x2)"? The task is trying to set the CPU affinity for CPU 1 (mask 0x2), but that CPU is already taken by Weka. So it produces an EINVAL error. From https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html: "sched_setaffinity(2) ... EINVAL The affinity bit mask mask contains no processors that are currently physically on the system and permitted to the thread according to any restrictions that may be imposed by cpuset cgroups or the "cpuset" mechanism described in cpuset(7)." I think the solution here is to work with Weka to stop it from reserving a CPU. Another solution is to tell Slurm that CPU 1 is off limits for that node, so that it doesn't allocate it to tasks. You can do this I think with the "CPUSpecList" parameter in slurm.conf. -Michael ________________________________ You are receiving this mail because: * You reported the bug. * You are on the CC list for the bug. Created attachment 17360 [details]
image001.png
Hi Ruth, (In reply to ruth.a.braun from comment #22) > Message received. Wondering also why the gpu partition does not show this > issue (just the partition devel). I'm not sure, without more information. Maybe the GPU nodes don't have Weka on them. Or maybe the jobs on that partition aren't being allocated CPUs restricted by cgroups, for whatever reason. > Fix help: could you specify what entry I make for slurm.conf, gres.conf > and/or the cgroup.conf files? > > For example, would I add this to the Nodename Difinition? > > Nodename=DEFAULT CPUs=56 Boards=1 SocketsPerBoard=2 CoresPerSocket=28 > ThreadsPerCore=1 RealMemory=386450 CpuSpecList=0x2 After reading the docs, I realized that CpuSpecList won't work. From https://slurm.schedmd.com/slurm.conf.html#OPT_CpuSpecList: "This option has no effect unless cgroup job confinement is also configured (TaskPlugin=task/cgroup with ConstrainCores=yes in cgroup.conf)." Since you only have task/affinity specified, the next option you could try is to use CoreSpecCount=4, TaskPluginParam=SlurmdOffSpec, and add spec_cores_first to your SchedulerParameters. This will hopefully reserve the first 4 cores, which will overlap with Weka's specified core (1). However, you will need to double check in the slurmd.log. For example: Resource spec: Reserved abstract CPU IDs: 0-3 Resource spec: Reserved machine CPU IDs: 0-1,28-29 You want the reserved machine CPU IDs to overlap with the CPU reserved by Weka in cgroups (1). CoreSpecCount needs to be 4 (I think) in order to overlap with it. See https://slurm.schedmd.com/core_spec.html for more details on how cores are selected. Unfortunately, this will mean that four of your cores will be usable by jobs, since it's an imprecise workaround. To test, do srun --exclusive grep Cpus_allowed_list /proc/self/status to see what CPUs are allowed to the job (and by extension, the slurmd) on the node. I imagine you will get the same error if you try this command out right now, though. ---------------- The above workaround may be quicker, but here is my actual recommendation: Set TaskPlugin=task/cgroup,task/affinity in slurm.conf and then set ConstrainCores=yes in cgroup.conf. Using the task cgroup plugin is recommended, because then jobs can't possibly use CPUs outside of their allocation. Without task/cgroup, a smart user could potentially use sched_setaffinity() in their program and use all CPUs on the node and there would be no way to stop them. If you decide to use task/cgroup, my guess is that it will NOT play well with Weka's cgroup settings; there will be conflicts. So you will need to figure out why Weka is reserving CPUs and tell it to stop doing that. In the long run, I think this is the best path forward. You have a cgroup.conf file, but you aren't using any cgroup plugins in slurm.conf, so it's not doing anything. So my guess is that you actually wanted to take advantage of cgroups with Slurm to begin with. For more information on how to use cgroups, see https://slurm.schedmd.com/cgroups.html and https://slurm.schedmd.com/cgroup.conf.html. Thanks, -Michael Compute nodes which run the weka client use 1 cpu core (id 1) for it's purposes. it also reserves approximately 1.46 GB of memory from the compute nodes for its operations. Based on the info above, can you give me very-specific examples for the various settings files. -Ruth Well, one easy option you have is to comment out the task/affinity plugin altogether. If that is not acceptable, and if turning off Weka's cgroup reservations and using Slurm's task/cgroup plugin is also not acceptable, do this (as mentioned in comment 25): slurm.conf ******************* Add "spec_cores_first" to your SchedulerParameters; Set "TaskPluginParam=SlurmdOffSpec"; and add "CoreSpecCount=4" to the nodes that have Weka's reserved core: SchedulerParameters=bf_window=43200,bf_resolution=600,bf_max_job_test=550,bf_max_job_part=350,bf_interval=300,bf_max_job_user=30,bf_continue,nohold_on_prolog_fail,spec_cores_first TaskPluginParam=SlurmdOffSpec Nodename=DEFAULT CPUs=56 Boards=1 SocketsPerBoard=2 CoresPerSocket=28 ThreadsPerCore=1 RealMemory=386450 CoreSpecCount=4 I'm assuming Weka reserves CPU 1 on all nodes, but if it's just a random CPU, that's a problem. So you should double check. Then restart the slurmctld and slurmds. In the slurmd log, double check that the machine CPU ID 1 is included in the reserved machine CPU IDs, as mentioned in comment 25. -Michael Hi Ruth, how is the workaround going? Hi, I just put in place the easy option comment out the task/affinity plugin altogether. We're testing now Hi Ruth, how is your testing going? Is the workaround working? Have you learned more about Weka? -Michael I'll go ahead and close this out. Feel free to reopen if you want to pursue this further. Thanks! -Michael |