I have slurmd running in debug on e8001 so i can show you symptoms i'm seeing when i type commands (and output of debug): Nodename information: NodeName=e8001 CPUs=56 Boards=1 SocketsPerBoard=2 CoresPerSocket=28 ThreadsPerCore=1 RealMemory=386450 Debug info before i submit a job: [root@e8001 ~]# slurmd --conf-server clnschedsvr1 -Dvvvvv slurmd: debug: Log file re-opened slurmd: Message aggregation disabled slurmd: debug2: _read_slurm_cgroup_conf_int: No cgroup.conf file (/var/spool/slurmd/conf-cache/cgroup.conf) slurmd: debug3: _set_slurmd_spooldir: initializing slurmd spool directory `/var/spool/slurmd` slurmd: debug: skipping GRES for NodeName=e40[01-22] Name=gpu Type=tesla File=/dev/nvidia[0-3] slurmd: debug: skipping GRES for NodeName=e50[01-02] Name=gpu Type=tesla File=/dev/nvidia[0-3] slurmd: debug3: Trying to load plugin /usr/lib64/slurm/gres_gpu.so slurmd: debug: init: Gres GPU plugin loaded slurmd: debug3: Success. slurmd: debug3: _merge_gres2: From gres.conf, using gpu:tesla:8:/dev/nvidia[0-7] slurmd: debug3: Trying to load plugin /usr/lib64/slurm/gpu_generic.so slurmd: debug: init: GPU Generic plugin loaded slurmd: debug3: Success. slurmd: debug3: gres_device_major : /dev/nvidia0 major 195, minor 0 slurmd: debug3: gres_device_major : /dev/nvidia1 major 195, minor 1 slurmd: debug3: gres_device_major : /dev/nvidia2 major 195, minor 2 slurmd: debug3: gres_device_major : /dev/nvidia3 major 195, minor 3 slurmd: debug3: gres_device_major : /dev/nvidia4 major 195, minor 4 slurmd: debug3: gres_device_major : /dev/nvidia5 major 195, minor 5 slurmd: debug3: gres_device_major : /dev/nvidia6 major 195, minor 6 slurmd: debug3: gres_device_major : /dev/nvidia7 major 195, minor 7 slurmd: Gres Name=gpu Type=tesla Count=8 slurmd: debug3: Trying to load plugin /usr/lib64/slurm/topology_none.so slurmd: topology NONE plugin loaded slurmd: debug3: Success. slurmd: debug3: Trying to load plugin /usr/lib64/slurm/route_default.so slurmd: route default plugin loaded slurmd: debug3: Success. slurmd: CPU frequency setting not configured for this node slurmd: debug: Resource spec: No specialized cores configured by default on this node slurmd: debug: Resource spec: Reserved system memory limit not configured for this node slurmd: debug3: NodeName = e8001 slurmd: debug3: TopoAddr = e8001 slurmd: debug3: TopoPattern = node slurmd: debug3: ClusterName = noether slurmd: debug3: Confile = `/var/spool/slurmd/conf-cache/slurm.conf' slurmd: debug3: Debug = 3 slurmd: debug3: CPUs = 56 (CF: 56, HW: 56) slurmd: debug3: Boards = 1 (CF: 1, HW: 1) slurmd: debug3: Sockets = 2 (CF: 2, HW: 2) slurmd: debug3: Cores = 28 (CF: 28, HW: 28) slurmd: debug3: Threads = 1 (CF: 1, HW: 1) slurmd: debug3: UpTime = 2310724 = 26-17:52:04 slurmd: debug3: Block Map = 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55 slurmd: debug3: Inverse Map = 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55 slurmd: debug3: RealMemory = 386450 slurmd: debug3: TmpDisk = 452519 slurmd: debug3: Epilog = `(null)' slurmd: debug3: Logfile = `(null)' slurmd: debug3: HealthCheck = `(null)' slurmd: debug3: NodeName = e8001 slurmd: debug3: Port = 6818 slurmd: debug3: Prolog = `(null)' slurmd: debug3: TmpFS = `/tmp' slurmd: debug3: Public Cert = `(null)' slurmd: debug3: Slurmstepd = `/usr/sbin/slurmstepd' slurmd: debug3: Spool Dir = `/var/spool/slurmd' slurmd: debug3: Syslog Debug = 10 slurmd: debug3: Pid File = `/run/slurmd.pid' slurmd: debug3: Slurm UID = 487 slurmd: debug3: TaskProlog = `(null)' slurmd: debug3: TaskEpilog = `(null)' slurmd: debug3: TaskPluginParam = 0 slurmd: debug3: Use PAM = 0 slurmd: debug3: Trying to load plugin /usr/lib64/slurm/proctrack_linuxproc.so slurmd: debug3: Success. slurmd: debug3: Trying to load plugin /usr/lib64/slurm/task_affinity.so slurmd: debug3: sched_getaffinity(0) = 0xfffffffffffffd slurmd: task affinity plugin loaded with CPU mask 0xfffffffffffffd slurmd: debug3: Success. slurmd: debug3: Trying to load plugin /usr/lib64/slurm/auth_munge.so slurmd: debug: Munge authentication plugin loaded slurmd: debug3: Success. slurmd: debug: spank: opening plugin stack /var/spool/slurmd/conf-cache/plugstack.conf slurmd: debug3: Trying to load plugin /usr/lib64/slurm/cred_munge.so slurmd: Munge credential signature plugin loaded slurmd: debug3: Success. slurmd: debug3: slurmd initialization successful slurmd: slurmd version 20.02.5 started slurmd: debug3: finished daemonize slurmd: debug3: cred_unpack: job 332 ctime:1608142785 revoked:1608142786 expires:1608143136 slurmd: debug3: Trying to load plugin /usr/lib64/slurm/jobacct_gather_linux.so slurmd: debug: Job accounting gather LINUX plugin loaded slurmd: debug3: Success. slurmd: debug3: Trying to load plugin /usr/lib64/slurm/job_container_none.so slurmd: debug: job_container none plugin loaded slurmd: debug3: Success. slurmd: debug3: Trying to load plugin /usr/lib64/slurm/prep_script.so slurmd: debug3: Success. slurmd: debug3: Trying to load plugin /usr/lib64/slurm/core_spec_none.so slurmd: debug3: Success. slurmd: debug3: Trying to load plugin /usr/lib64/slurm/switch_none.so slurmd: debug: switch NONE plugin loaded slurmd: debug3: Success. slurmd: debug3: Successfully opened slurm listen port 6818 slurmd: slurmd started on Wed, 16 Dec 2020 13:22:17 -0500 slurmd: CPUs=56 Boards=1 Sockets=2 Cores=28 Threads=1 Memory=386450 TmpDisk=452519 Uptime=2310724 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) slurmd: debug3: Trying to load plugin /usr/lib64/slurm/acct_gather_energy_none.so slurmd: debug: AcctGatherEnergy NONE plugin loaded slurmd: debug3: Success. slurmd: debug3: Trying to load plugin /usr/lib64/slurm/acct_gather_profile_none.so slurmd: debug: AcctGatherProfile NONE plugin loaded slurmd: debug3: Success. slurmd: debug3: Trying to load plugin /usr/lib64/slurm/acct_gather_interconnect_none.so slurmd: debug: AcctGatherInterconnect NONE plugin loaded slurmd: debug3: Success. slurmd: debug3: Trying to load plugin /usr/lib64/slurm/acct_gather_filesystem_none.so slurmd: debug: AcctGatherFilesystem NONE plugin loaded slurmd: debug3: Success. slurmd: debug2: No acct_gather.conf file (/var/spool/slurmd/conf-cache/acct_gather.conf) slurmd: debug: _handle_node_reg_resp: slurmctld sent back 8 TRES. Test user: xurabraun (logged on to a login server) $ srun -p devel -w e8001 -n 4 hostname slurmstepd: error: Failed to invoke task plugins: task_p_pre_launch error e8001.noether e8001.noether e8001.noether srun: error: e8001: task 1: Exited with exit code 1 slurmd in -Dvvvv reports this: slurmd: debug3: in the service_connection slurmd: debug2: Start processing RPC: REQUEST_LAUNCH_TASKS slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS slurmd: launch task 338.0 request from UID:4385 GID:4385 HOST:172.21.100.4 PORT:14555 slurmd: debug3: state for jobid 332: ctime:1608142785 revoked:1608142786 expires:1608143136 slurmd: debug3: destroying job 332 state slurmd: debug: Checking credential with 724 bytes of sig data slurmd: debug: task affinity : before lllp distribution cpu bind method is '(null type)' ((null)) slurmd: debug3: task/affinity: slurmctld s 2 c 28; hw s 2 c 28 t 1 slurmd: debug3: task/affinity: job 338.0 core mask from slurmctld: 0x0000000000000F slurmd: debug3: task/affinity: job 338.0 CPU final mask for local node: 0x0000000000000F slurmd: debug3: _lllp_map_abstract_masks slurmd: debug: binding tasks:4 to nodes:0 sockets:0:1 cores:4:0 threads:4 slurmd: lllp_distribution jobid [338] implicit auto binding: cores,one_thread, dist 8192 slurmd: _task_layout_lllp_cyclic slurmd: debug3: task/affinity: slurmctld s 2 c 28; hw s 2 c 28 t 1 slurmd: debug3: task/affinity: job 338.0 core mask from slurmctld: 0x0000000000000F slurmd: debug3: task/affinity: job 338.0 CPU final mask for local node: 0x0000000000000F slurmd: debug3: _task_layout_display_masks jobid [338:0] 0x00000000000001 slurmd: debug3: _task_layout_display_masks jobid [338:1] 0x00000000000002 slurmd: debug3: _task_layout_display_masks jobid [338:2] 0x00000000000004 slurmd: debug3: _task_layout_display_masks jobid [338:3] 0x00000000000008 slurmd: debug3: _lllp_map_abstract_masks slurmd: debug3: _task_layout_display_masks jobid [338:0] 0x00000000000001 slurmd: debug3: _task_layout_display_masks jobid [338:1] 0x00000000000002 slurmd: debug3: _task_layout_display_masks jobid [338:2] 0x00000000000004 slurmd: debug3: _task_layout_display_masks jobid [338:3] 0x00000000000008 slurmd: debug3: _lllp_generate_cpu_bind 4 17 69 slurmd: _lllp_generate_cpu_bind jobid [338]: mask_cpu,one_thread, 0x00000000000001,0x00000000000002,0x00000000000004,0x00000000000008 slurmd: debug: task affinity : after lllp distribution cpu bind method is 'mask_cpu,one_thread' (0x00000000000001,0x00000000000002,0x00000000000004,0x00000000000008) slurmd: debug2: _insert_job_state: we already have a job state for job 338. No big deal, just an FYI. slurmd: _run_prolog: run job script took usec=214 slurmd: _run_prolog: prolog with lock for job 338 ran for 0 seconds slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd slurmd: debug2: _read_slurm_cgroup_conf_int: No cgroup.conf file (/var/spool/slurmd/conf-cache/cgroup.conf) slurmd: debug3: slurmstepd rank 0 (e8001), parent rank -1 (NONE), children 0, depth 0, max_depth 0 slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd slurmd: debug: task_p_slurmd_reserve_resources: 338 slurmd: debug2: Finish processing RPC: REQUEST_LAUNCH_TASKS slurmd: debug3: in the service_connection slurmd: debug2: Start processing RPC: REQUEST_TERMINATE_JOB slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB slurmd: debug: _rpc_terminate_job, uid = 487 slurmd: debug: task_p_slurmd_release_resources: affinity jobid 338 slurmd: debug: credential for job 338 revoked slurmd: debug2: No steps in jobid 338 to send signal 999 slurmd: debug2: No steps in jobid 338 to send signal 18 slurmd: debug2: No steps in jobid 338 to send signal 15 slurmd: debug4: sent ALREADY_COMPLETE slurmd: debug2: set revoke expiration for jobid 338 to 1608146409 UTS slurmd: debug2: Finish processing RPC: REQUEST_TERMINATE_JOB sacct shows job failed 338 hostname devel admins 4 FAILED 1:0
Created attachment 17210 [details] slurm.conf.121720.rab.txt Would you please reproduce this with debug2 configured for the slurmd on the node you are testing with?
Added the debug (used debug3 if that's ok), but was unable to allocate cpus on a compute node in this partition. There are two users who have cpus allocated but not all of them are in use. My job goes to pending resources for some reason. I could kill their jobs, but would like help to determine why my resources are pending. Also note that the other partition we defined is not having this same issue. The other partition is OverSubscribe=EXCLUSIVE (the one that works ok and has a lot more servers in it); while the problem partition "devel" contains 4 servers and is not.
*** Ticket 10466 has been marked as a duplicate of this ticket. ***
Hi Ruth, Can you attach your slurm.conf? What Linux distro and kernel are you running on? Just a note: doing `slurmd -D` will not show you the stepd logs. Instead, it is recommended to run slurmd in the background and to actively monitor the slurmd.log during debugging, since that will include all the logs emitted by the steps. Could you reproduce the problem and then attach the relevant portions of your slurmd.log and slurmctld.log (rather than from the slurmd in the foreground)? We recently fixed a similar cgroup-related error, so I would recommend upgrading to 20.02.6 to see if that solves the issue. Thanks -Michael
The server running slurmctld and slurmdbd is: # uname -a Linux clnschedsvr1.hpc.na.xom.com 3.10.0-957.5.1.el7.x86_64 #1 SMP Wed Dec 19 10:46:58 EST 2018 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.7 (Maipo) GPU compute nodes are running: Red Hat Enterprise Linux ComputeNode release 7.6 (Maipo) # uname -r 3.10.0-957.27.2.el7.x86_64 Slurm.conf attached – please do not publish to others Best Regards, Ruth A. Braun EMRE High Performance Computing – Sr. IT Analyst Fuels Lubricants and Chemicals IT (FLCIT) ExxonMobil Technical Computing Company 1545 Route 22 East - Clinton CCS18 Annandale, NJ 08801 908 335 3694 Tel Problem, questions, need help? Open a ticket using this goto link: http://goto/EMREHPCTICKET From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Thursday, December 17, 2020 1:39 PM To: Braun, Ruth A <ruth.a.braun@exxonmobil.com> Subject: [Bug 10460] srun error External Email - Think Before You Click Comment # 5<https://bugs.schedmd.com/show_bug.cgi?id=10460#c5> on bug 10460<https://bugs.schedmd.com/show_bug.cgi?id=10460> from Michael Hinton<mailto:hinton@schedmd.com> Hi Ruth, Can you attach your slurm.conf? What Linux distro and kernel are you running on? Just a note: doing `slurmd -D` will not show you the stepd logs. Instead, it is recommended to run slurmd in the background and to actively monitor the slurmd.log during debugging, since that will include all the logs emitted by the steps. Could you reproduce the problem and then attach the relevant portions of your slurmd.log and slurmctld.log (rather than from the slurmd in the foreground)? We recently fixed a similar cgroup-related error, so I would recommend upgrading to 20.02.6 to see if that solves the issue. Thanks -Michael ________________________________ You are receiving this mail because: * You reported the bug.
Created attachment 17211 [details] debug3-test-e8002.txt
Created attachment 17212 [details] ruthctld.log
I see the three attachments I emailed back above. Let me know if you need anything else. Ruth 12/17
Hi Ruth, > Meanwhile, if I should upgrade (this cluster is not in production yet so > I could do what I want)… should I just go directly to the latest release 20.11.1.? You could do that, but I would recommend upgrading minor versions for now (20.02.5 --> 20.02.6) because that can be easily done in place without needing to upgrade the database or tweak your configuration. Minor version upgrades only contain bug fixes and don't introduce new features or breaking changes.
Hi Michael, Ok, I’ll work on the upgrade to 20.02.6 today. I am out of office Christmas week, but working today and plan to check in periodically. Please continue to send info on the interpretation of my issue (and suggestions) . I’ll be back in the office 12/28. Regards, Ruth From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Thursday, December 17, 2020 7:42 PM To: Braun, Ruth A <ruth.a.braun@exxonmobil.com> Subject: [Bug 10460] srun error External Email - Think Before You Click Comment # 10<https://bugs.schedmd.com/show_bug.cgi?id=10460#c10> on bug 10460<https://bugs.schedmd.com/show_bug.cgi?id=10460> from Michael Hinton<mailto:hinton@schedmd.com> Hi Ruth, > Meanwhile, if I should upgrade (this cluster is not in production yet so > I could do what I want)… should I just go directly to the latest release 20.11.1.? You could do that, but I would recommend upgrading minor versions for now (20.02.5 --> 20.02.6) because that can be easily done in place without needing to upgrade the database or tweak your configuration. Minor version upgrades only contain bug fixes and don't introduce new features or breaking changes. ________________________________ You are receiving this mail because: * You reported the bug.
Ruth, On the nodes emitting the errors, *while a job causing the error is still running*, could you please run the following commands and paste the output here?: find /sys/fs/cgroup/ -name cpuset.cpus -exec echo '{}' \; -exec cat '{}' \; find /sys/fs/cgroup/ -name cpuset.mems -exec echo '{}' \; -exec cat '{}' \; This will double check to see if cgroup stuff is being set and propagated correctly. We thought we fixed this in 20.02.6 and 20.11.0, but it's possible it did not get fixed completely. -Michael
Alternatively, upgrade to 20.02.6 to see if that fixes things, and if not, then do what I asked in comment 14.
Will do! Best Regards, Ruth Ruth A. Braun EMRE High Performance Computing – Sr. IT Analyst Fuels Lubricants and Chemicals IT (FLCIT) ExxonMobil Technical Computing Company 1545 Route 22 East - Clinton CCS18 Annandale, NJ 08801 908 335 3694 Tel Problem, questions, need help? Open a ticket using this goto link: http://goto/EMREHPCTICKET From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Tuesday, December 22, 2020 12:09 PM To: Braun, Ruth A <ruth.a.braun@exxonmobil.com> Subject: [Bug 10460] srun error External Email - Think Before You Click Michael Hinton<mailto:hinton@schedmd.com> changed bug 10460<https://bugs.schedmd.com/show_bug.cgi?id=10460> What Removed Added CC felip.moll@schedmd.com<mailto:felip.moll@schedmd.com> Comment # 14<https://bugs.schedmd.com/show_bug.cgi?id=10460#c14> on bug 10460<https://bugs.schedmd.com/show_bug.cgi?id=10460> from Michael Hinton<mailto:hinton@schedmd.com> Ruth, On the nodes emitting the errors, *while a job causing the error is still running*, could you please run the following commands and paste the output here?: find /sys/fs/cgroup/ -name cpuset.cpus -exec echo '{}' \; -exec cat '{}' \; find /sys/fs/cgroup/ -name cpuset.mems -exec echo '{}' \; -exec cat '{}' \; This will double check to see if cgroup stuff is being set and propagated correctly. We thought we fixed this in 20.02.6 and 20.11.0, but it's possible it did not get fixed completely. -Michael ________________________________ You are receiving this mail because: * You reported the bug.
Sorry this took so long but here is output on a compute node that's running a job and at slurm-20.02.6-1 [root@e4001 ~]# find /sys/fs/cgroup/ -name cpuset.cpus -exec echo '{}' \; -exec cat '{}' \; /sys/fs/cgroup/cpuset/weka/cpuset.cpus 1 /sys/fs/cgroup/cpuset/system/cpuset.cpus 0,2-55 /sys/fs/cgroup/cpuset/cpuset.cpus 0-55 [root@e4001 ~]# find /sys/fs/cgroup/ -name cpuset.mems -exec echo '{}' \; -exec cat '{}' \; /sys/fs/cgroup/cpuset/weka/cpuset.mems 0 /sys/fs/cgroup/cpuset/system/cpuset.mems 0-1 /sys/fs/cgroup/cpuset/cpuset.mems 0-1 # rpm -qa|grep slurm slurm-20.02.6-1.el7.x86_64 slurm-slurmd-20.02.6-1.el7.x86_64 slurm-pam_slurm-20.02.6-1.el7.x86_64 slurm-perlapi-20.02.6-1.el7.x86_64 slurm-devel-20.02.6-1.el7.x86_64 slurm-libpmi-20.02.6-1.el7.x86_64 slurm-torque-20.02.6-1.el7.x86_64 slurm-contribs-20.02.6-1.el7.x86_64 slurm-example-configs-20.02.6-1.el7.x86_64 [root@e4001 ~]# date Sat Jan 2 09:03:07 EST 2021
Please use this set of output instead of my last post: With 20.02.6-1 now running... User xurabraun gets error: [xurabraun@vlogin003 ~]$ srun -p devel -N 1 -n 8 --pty bash [xurabraun@SLURM]$ srun: error: e4002: task 1: Exited with exit code 1 [xurabraun@SLURM]$ hostname e4002.noether [xurabraun@SLURM]$ date Sat Jan 2 09:23:40 EST 2021 (root ssh to compute node e4002) to perform find commands while xurabraun job is still running) [root@e4002 ~]# date Sat Jan 2 09:24:40 EST 2021 [root@e4002 ~]# find /sys/fs/cgroup/ -name cpuset.cpus -exec echo '{}' \; -exec cat '{}' \; /sys/fs/cgroup/cpuset/weka/cpuset.cpus 1 /sys/fs/cgroup/cpuset/system/cpuset.cpus 0,2-55 /sys/fs/cgroup/cpuset/cpuset.cpus 0-55 [root@e4002 ~]# find /sys/fs/cgroup/ -name cpuset.mems -exec echo '{}' \; -exec cat '{}' \; /sys/fs/cgroup/cpuset/weka/cpuset.mems 0 /sys/fs/cgroup/cpuset/system/cpuset.mems 0-1 /sys/fs/cgroup/cpuset/cpuset.mems 0-1
(In reply to ruth.a.braun from comment #18) > [root@e4002 ~]# find /sys/fs/cgroup/ -name cpuset.cpus -exec echo '{}' \; > -exec cat '{}' \; > /sys/fs/cgroup/cpuset/weka/cpuset.cpus > 1 > /sys/fs/cgroup/cpuset/system/cpuset.cpus > 0,2-55 > /sys/fs/cgroup/cpuset/cpuset.cpus > 0-55 It appears that Weka is using cgroups to reserve CPU 1 on that node. However, Slurm doesn't know about this, and so when the job runs on the node, it tries to set the CPU affinity for CPU 1 and fails: [2020-12-17T14:21:42.676] [384.0] task_p_pre_launch: Using sched_affinity for tasks [2020-12-17T14:21:42.677] [384.0] sched_setaffinity(18992,128,0x2) failed: Invalid argument [2020-12-17T14:21:42.677] [384.0] debug: task_g_pre_launch: task/affinity: Unspecified error [2020-12-17T14:21:42.677] [384.0] error: Failed to invoke task plugins: task_p_pre_launch error See the "sched_setaffinity(18992,128,0x2)"? The task is trying to set the CPU affinity for CPU 1 (mask 0x2), but that CPU is already taken by Weka. So it produces an EINVAL error. From https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html: "sched_setaffinity(2) ... EINVAL The affinity bit mask mask contains no processors that are currently physically on the system and permitted to the thread according to any restrictions that may be imposed by cpuset cgroups or the "cpuset" mechanism described in cpuset(7)." I think the solution here is to work with Weka to stop it from reserving a CPU. Another solution is to tell Slurm that CPU 1 is off limits for that node, so that it doesn't allocate it to tasks. You can do this I think with the "CPUSpecList" parameter in slurm.conf. -Michael
Michael , Message received. Wondering also why the gpu partition does not show this issue (just the partition devel). Fix help: could you specify what entry I make for slurm.conf, gres.conf and/or the cgroup.conf files? For example, would I add this to the Nodename Difinition? Nodename=DEFAULT CPUs=56 Boards=1 SocketsPerBoard=2 CoresPerSocket=28 ThreadsPerCore=1 RealMemory=386450 CpuSpecList=0x2 [cid:image001.png@01D6E402.049BB870] Best Regards, Ruth Ruth A. Braun EMRE High Performance Computing – Sr. IT Analyst Fuels Lubricants and Chemicals IT (FLCIT) ExxonMobil Technical Computing Company 1545 Route 22 East - Clinton CCS18 Annandale, NJ 08801 908 335 3694 Tel Problem, questions, need help? Open a ticket using this goto link: http://goto/EMREHPCTICKET From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Tuesday, January 5, 2021 1:01 PM To: Braun, Ruth A <ruth.a.braun@exxonmobil.com> Subject: [Bug 10460] srun error External Email - Think Before You Click Comment # 21<https://bugs.schedmd.com/show_bug.cgi?id=10460#c21> on bug 10460<https://bugs.schedmd.com/show_bug.cgi?id=10460> from Michael Hinton<mailto:hinton@schedmd.com> (In reply to ruth.a.braun from comment #18<show_bug.cgi?id=10460#c18>) > [root@e4002 ~]# find /sys/fs/cgroup/ -name cpuset.cpus -exec echo '{}' \; > -exec cat '{}' \; > /sys/fs/cgroup/cpuset/weka/cpuset.cpus > 1 > /sys/fs/cgroup/cpuset/system/cpuset.cpus > 0,2-55 > /sys/fs/cgroup/cpuset/cpuset.cpus > 0-55 It appears that Weka is using cgroups to reserve CPU 1 on that node. However, Slurm doesn't know about this, and so when the job runs on the node, it tries to set the CPU affinity for CPU 1 and fails: [2020-12-17T14:21:42.676] [384.0] task_p_pre_launch: Using sched_affinity for tasks [2020-12-17T14:21:42.677] [384.0] sched_setaffinity(18992,128,0x2) failed: Invalid argument [2020-12-17T14:21:42.677] [384.0] debug: task_g_pre_launch: task/affinity: Unspecified error [2020-12-17T14:21:42.677] [384.0] error: Failed to invoke task plugins: task_p_pre_launch error See the "sched_setaffinity(18992,128,0x2)"? The task is trying to set the CPU affinity for CPU 1 (mask 0x2), but that CPU is already taken by Weka. So it produces an EINVAL error. From https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html: "sched_setaffinity(2) ... EINVAL The affinity bit mask mask contains no processors that are currently physically on the system and permitted to the thread according to any restrictions that may be imposed by cpuset cgroups or the "cpuset" mechanism described in cpuset(7)." I think the solution here is to work with Weka to stop it from reserving a CPU. Another solution is to tell Slurm that CPU 1 is off limits for that node, so that it doesn't allocate it to tasks. You can do this I think with the "CPUSpecList" parameter in slurm.conf. -Michael ________________________________ You are receiving this mail because: * You reported the bug. * You are on the CC list for the bug.
Created attachment 17360 [details] image001.png
Hi Ruth, (In reply to ruth.a.braun from comment #22) > Message received. Wondering also why the gpu partition does not show this > issue (just the partition devel). I'm not sure, without more information. Maybe the GPU nodes don't have Weka on them. Or maybe the jobs on that partition aren't being allocated CPUs restricted by cgroups, for whatever reason. > Fix help: could you specify what entry I make for slurm.conf, gres.conf > and/or the cgroup.conf files? > > For example, would I add this to the Nodename Difinition? > > Nodename=DEFAULT CPUs=56 Boards=1 SocketsPerBoard=2 CoresPerSocket=28 > ThreadsPerCore=1 RealMemory=386450 CpuSpecList=0x2 After reading the docs, I realized that CpuSpecList won't work. From https://slurm.schedmd.com/slurm.conf.html#OPT_CpuSpecList: "This option has no effect unless cgroup job confinement is also configured (TaskPlugin=task/cgroup with ConstrainCores=yes in cgroup.conf)." Since you only have task/affinity specified, the next option you could try is to use CoreSpecCount=4, TaskPluginParam=SlurmdOffSpec, and add spec_cores_first to your SchedulerParameters. This will hopefully reserve the first 4 cores, which will overlap with Weka's specified core (1). However, you will need to double check in the slurmd.log. For example: Resource spec: Reserved abstract CPU IDs: 0-3 Resource spec: Reserved machine CPU IDs: 0-1,28-29 You want the reserved machine CPU IDs to overlap with the CPU reserved by Weka in cgroups (1). CoreSpecCount needs to be 4 (I think) in order to overlap with it. See https://slurm.schedmd.com/core_spec.html for more details on how cores are selected. Unfortunately, this will mean that four of your cores will be usable by jobs, since it's an imprecise workaround. To test, do srun --exclusive grep Cpus_allowed_list /proc/self/status to see what CPUs are allowed to the job (and by extension, the slurmd) on the node. I imagine you will get the same error if you try this command out right now, though. ---------------- The above workaround may be quicker, but here is my actual recommendation: Set TaskPlugin=task/cgroup,task/affinity in slurm.conf and then set ConstrainCores=yes in cgroup.conf. Using the task cgroup plugin is recommended, because then jobs can't possibly use CPUs outside of their allocation. Without task/cgroup, a smart user could potentially use sched_setaffinity() in their program and use all CPUs on the node and there would be no way to stop them. If you decide to use task/cgroup, my guess is that it will NOT play well with Weka's cgroup settings; there will be conflicts. So you will need to figure out why Weka is reserving CPUs and tell it to stop doing that. In the long run, I think this is the best path forward. You have a cgroup.conf file, but you aren't using any cgroup plugins in slurm.conf, so it's not doing anything. So my guess is that you actually wanted to take advantage of cgroups with Slurm to begin with. For more information on how to use cgroups, see https://slurm.schedmd.com/cgroups.html and https://slurm.schedmd.com/cgroup.conf.html. Thanks, -Michael
Compute nodes which run the weka client use 1 cpu core (id 1) for it's purposes. it also reserves approximately 1.46 GB of memory from the compute nodes for its operations. Based on the info above, can you give me very-specific examples for the various settings files. -Ruth
Well, one easy option you have is to comment out the task/affinity plugin altogether. If that is not acceptable, and if turning off Weka's cgroup reservations and using Slurm's task/cgroup plugin is also not acceptable, do this (as mentioned in comment 25): slurm.conf ******************* Add "spec_cores_first" to your SchedulerParameters; Set "TaskPluginParam=SlurmdOffSpec"; and add "CoreSpecCount=4" to the nodes that have Weka's reserved core: SchedulerParameters=bf_window=43200,bf_resolution=600,bf_max_job_test=550,bf_max_job_part=350,bf_interval=300,bf_max_job_user=30,bf_continue,nohold_on_prolog_fail,spec_cores_first TaskPluginParam=SlurmdOffSpec Nodename=DEFAULT CPUs=56 Boards=1 SocketsPerBoard=2 CoresPerSocket=28 ThreadsPerCore=1 RealMemory=386450 CoreSpecCount=4 I'm assuming Weka reserves CPU 1 on all nodes, but if it's just a random CPU, that's a problem. So you should double check. Then restart the slurmctld and slurmds. In the slurmd log, double check that the machine CPU ID 1 is included in the reserved machine CPU IDs, as mentioned in comment 25. -Michael
Hi Ruth, how is the workaround going?
Hi, I just put in place the easy option comment out the task/affinity plugin altogether. We're testing now
Hi Ruth, how is your testing going? Is the workaround working? Have you learned more about Weka? -Michael
I'll go ahead and close this out. Feel free to reopen if you want to pursue this further. Thanks! -Michael