Ticket 10460

Summary:	srun error
Product:	Slurm	Reporter:	ruth.a.braun
Component:	slurmstepd	Assignee:	Director of Support <support>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	felip.moll, kevin.m.ying, ruth.a.braun
Version:	20.02.6
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=10466 https://bugs.schedmd.com/show_bug.cgi?id=10487 https://bugs.schedmd.com/show_bug.cgi?id=10613 https://bugs.schedmd.com/show_bug.cgi?id=10674 https://bugs.schedmd.com/show_bug.cgi?id=10750 https://bugs.schedmd.com/show_bug.cgi?id=12393 https://bugs.schedmd.com/show_bug.cgi?id=13234
Site:	EM	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	image001.png

Description ruth.a.braun 2020-12-16 12:16:01 MST

I have slurmd running in debug on e8001 so i can show you symptoms i'm seeing when i type commands (and output of debug):

Nodename information:
NodeName=e8001 CPUs=56 Boards=1 SocketsPerBoard=2 CoresPerSocket=28 ThreadsPerCore=1 RealMemory=386450

Debug info before i submit a job:
[root@e8001 ~]# slurmd  --conf-server clnschedsvr1 -Dvvvvv
slurmd: debug:  Log file re-opened
slurmd: Message aggregation disabled
slurmd: debug2: _read_slurm_cgroup_conf_int: No cgroup.conf file (/var/spool/slurmd/conf-cache/cgroup.conf)
slurmd: debug3: _set_slurmd_spooldir: initializing slurmd spool directory `/var/spool/slurmd`
slurmd: debug:  skipping GRES for NodeName=e40[01-22]  Name=gpu Type=tesla File=/dev/nvidia[0-3]

slurmd: debug:  skipping GRES for NodeName=e50[01-02]  Name=gpu Type=tesla File=/dev/nvidia[0-3]

slurmd: debug3: Trying to load plugin /usr/lib64/slurm/gres_gpu.so
slurmd: debug:  init: Gres GPU plugin loaded
slurmd: debug3: Success.
slurmd: debug3: _merge_gres2: From gres.conf, using gpu:tesla:8:/dev/nvidia[0-7]
slurmd: debug3: Trying to load plugin /usr/lib64/slurm/gpu_generic.so
slurmd: debug:  init: GPU Generic plugin loaded
slurmd: debug3: Success.
slurmd: debug3: gres_device_major : /dev/nvidia0 major 195, minor 0
slurmd: debug3: gres_device_major : /dev/nvidia1 major 195, minor 1
slurmd: debug3: gres_device_major : /dev/nvidia2 major 195, minor 2
slurmd: debug3: gres_device_major : /dev/nvidia3 major 195, minor 3
slurmd: debug3: gres_device_major : /dev/nvidia4 major 195, minor 4
slurmd: debug3: gres_device_major : /dev/nvidia5 major 195, minor 5
slurmd: debug3: gres_device_major : /dev/nvidia6 major 195, minor 6
slurmd: debug3: gres_device_major : /dev/nvidia7 major 195, minor 7
slurmd: Gres Name=gpu Type=tesla Count=8
slurmd: debug3: Trying to load plugin /usr/lib64/slurm/topology_none.so
slurmd: topology NONE plugin loaded
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin /usr/lib64/slurm/route_default.so
slurmd: route default plugin loaded
slurmd: debug3: Success.
slurmd: CPU frequency setting not configured for this node
slurmd: debug:  Resource spec: No specialized cores configured by default on this node
slurmd: debug:  Resource spec: Reserved system memory limit not configured for this node
slurmd: debug3: NodeName    = e8001
slurmd: debug3: TopoAddr    = e8001
slurmd: debug3: TopoPattern = node
slurmd: debug3: ClusterName = noether
slurmd: debug3: Confile     = `/var/spool/slurmd/conf-cache/slurm.conf'
slurmd: debug3: Debug       = 3
slurmd: debug3: CPUs        = 56 (CF: 56, HW: 56)
slurmd: debug3: Boards      = 1  (CF:  1, HW:  1)
slurmd: debug3: Sockets     = 2  (CF:  2, HW:  2)
slurmd: debug3: Cores       = 28 (CF: 28, HW: 28)
slurmd: debug3: Threads     = 1  (CF:  1, HW:  1)
slurmd: debug3: UpTime      = 2310724 = 26-17:52:04
slurmd: debug3: Block Map   = 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55
slurmd: debug3: Inverse Map = 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55
slurmd: debug3: RealMemory  = 386450
slurmd: debug3: TmpDisk     = 452519
slurmd: debug3: Epilog      = `(null)'
slurmd: debug3: Logfile     = `(null)'
slurmd: debug3: HealthCheck = `(null)'
slurmd: debug3: NodeName    = e8001
slurmd: debug3: Port        = 6818
slurmd: debug3: Prolog      = `(null)'
slurmd: debug3: TmpFS       = `/tmp'
slurmd: debug3: Public Cert = `(null)'
slurmd: debug3: Slurmstepd  = `/usr/sbin/slurmstepd'
slurmd: debug3: Spool Dir   = `/var/spool/slurmd'
slurmd: debug3: Syslog Debug  = 10
slurmd: debug3: Pid File    = `/run/slurmd.pid'
slurmd: debug3: Slurm UID   = 487
slurmd: debug3: TaskProlog  = `(null)'
slurmd: debug3: TaskEpilog  = `(null)'
slurmd: debug3: TaskPluginParam = 0
slurmd: debug3: Use PAM     = 0
slurmd: debug3: Trying to load plugin /usr/lib64/slurm/proctrack_linuxproc.so
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin /usr/lib64/slurm/task_affinity.so
slurmd: debug3: sched_getaffinity(0) = 0xfffffffffffffd
slurmd: task affinity plugin loaded with CPU mask 0xfffffffffffffd
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin /usr/lib64/slurm/auth_munge.so
slurmd: debug:  Munge authentication plugin loaded
slurmd: debug3: Success.
slurmd: debug:  spank: opening plugin stack /var/spool/slurmd/conf-cache/plugstack.conf
slurmd: debug3: Trying to load plugin /usr/lib64/slurm/cred_munge.so
slurmd: Munge credential signature plugin loaded
slurmd: debug3: Success.
slurmd: debug3: slurmd initialization successful
slurmd: slurmd version 20.02.5 started
slurmd: debug3: finished daemonize
slurmd: debug3: cred_unpack: job 332 ctime:1608142785 revoked:1608142786 expires:1608143136
slurmd: debug3: Trying to load plugin /usr/lib64/slurm/jobacct_gather_linux.so
slurmd: debug:  Job accounting gather LINUX plugin loaded
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin /usr/lib64/slurm/job_container_none.so
slurmd: debug:  job_container none plugin loaded
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin /usr/lib64/slurm/prep_script.so
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin /usr/lib64/slurm/core_spec_none.so
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin /usr/lib64/slurm/switch_none.so
slurmd: debug:  switch NONE plugin loaded
slurmd: debug3: Success.
slurmd: debug3: Successfully opened slurm listen port 6818
slurmd: slurmd started on Wed, 16 Dec 2020 13:22:17 -0500
slurmd: CPUs=56 Boards=1 Sockets=2 Cores=28 Threads=1 Memory=386450 TmpDisk=452519 Uptime=2310724 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
slurmd: debug3: Trying to load plugin /usr/lib64/slurm/acct_gather_energy_none.so
slurmd: debug:  AcctGatherEnergy NONE plugin loaded
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin /usr/lib64/slurm/acct_gather_profile_none.so
slurmd: debug:  AcctGatherProfile NONE plugin loaded
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin /usr/lib64/slurm/acct_gather_interconnect_none.so
slurmd: debug:  AcctGatherInterconnect NONE plugin loaded
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin /usr/lib64/slurm/acct_gather_filesystem_none.so
slurmd: debug:  AcctGatherFilesystem NONE plugin loaded
slurmd: debug3: Success.
slurmd: debug2: No acct_gather.conf file (/var/spool/slurmd/conf-cache/acct_gather.conf)
slurmd: debug:  _handle_node_reg_resp: slurmctld sent back 8 TRES.

Test user: xurabraun   (logged on to a login server)

$ srun -p devel -w e8001 -n 4 hostname
slurmstepd: error: Failed to invoke task plugins: task_p_pre_launch error
e8001.noether
e8001.noether
e8001.noether
srun: error: e8001: task 1: Exited with exit code 1


slurmd in -Dvvvv reports this:

slurmd: debug3: in the service_connection
slurmd: debug2: Start processing RPC: REQUEST_LAUNCH_TASKS
slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
slurmd: launch task 338.0 request from UID:4385 GID:4385 HOST:172.21.100.4 PORT:14555
slurmd: debug3: state for jobid 332: ctime:1608142785 revoked:1608142786 expires:1608143136
slurmd: debug3: destroying job 332 state
slurmd: debug:  Checking credential with 724 bytes of sig data
slurmd: debug:  task affinity : before lllp distribution cpu bind method is '(null type)' ((null))
slurmd: debug3: task/affinity: slurmctld s 2 c 28; hw s 2 c 28 t 1
slurmd: debug3: task/affinity: job 338.0 core mask from slurmctld: 0x0000000000000F
slurmd: debug3: task/affinity: job 338.0 CPU final mask for local node: 0x0000000000000F
slurmd: debug3: _lllp_map_abstract_masks
slurmd: debug:  binding tasks:4 to nodes:0 sockets:0:1 cores:4:0 threads:4
slurmd: lllp_distribution jobid [338] implicit auto binding: cores,one_thread, dist 8192
slurmd: _task_layout_lllp_cyclic
slurmd: debug3: task/affinity: slurmctld s 2 c 28; hw s 2 c 28 t 1
slurmd: debug3: task/affinity: job 338.0 core mask from slurmctld: 0x0000000000000F
slurmd: debug3: task/affinity: job 338.0 CPU final mask for local node: 0x0000000000000F
slurmd: debug3: _task_layout_display_masks jobid [338:0] 0x00000000000001
slurmd: debug3: _task_layout_display_masks jobid [338:1] 0x00000000000002
slurmd: debug3: _task_layout_display_masks jobid [338:2] 0x00000000000004
slurmd: debug3: _task_layout_display_masks jobid [338:3] 0x00000000000008
slurmd: debug3: _lllp_map_abstract_masks
slurmd: debug3: _task_layout_display_masks jobid [338:0] 0x00000000000001
slurmd: debug3: _task_layout_display_masks jobid [338:1] 0x00000000000002
slurmd: debug3: _task_layout_display_masks jobid [338:2] 0x00000000000004
slurmd: debug3: _task_layout_display_masks jobid [338:3] 0x00000000000008
slurmd: debug3: _lllp_generate_cpu_bind 4 17 69
slurmd: _lllp_generate_cpu_bind jobid [338]: mask_cpu,one_thread, 0x00000000000001,0x00000000000002,0x00000000000004,0x00000000000008
slurmd: debug:  task affinity : after lllp distribution cpu bind method is 'mask_cpu,one_thread' (0x00000000000001,0x00000000000002,0x00000000000004,0x00000000000008)
slurmd: debug2: _insert_job_state: we already have a job state for job 338.  No big deal, just an FYI.
slurmd: _run_prolog: run job script took usec=214
slurmd: _run_prolog: prolog with lock for job 338 ran for 0 seconds
slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
slurmd: debug2: _read_slurm_cgroup_conf_int: No cgroup.conf file (/var/spool/slurmd/conf-cache/cgroup.conf)
slurmd: debug3: slurmstepd rank 0 (e8001), parent rank -1 (NONE), children 0, depth 0, max_depth 0
slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
slurmd: debug:  task_p_slurmd_reserve_resources: 338
slurmd: debug2: Finish processing RPC: REQUEST_LAUNCH_TASKS
slurmd: debug3: in the service_connection
slurmd: debug2: Start processing RPC: REQUEST_TERMINATE_JOB
slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB
slurmd: debug:  _rpc_terminate_job, uid = 487
slurmd: debug:  task_p_slurmd_release_resources: affinity jobid 338
slurmd: debug:  credential for job 338 revoked
slurmd: debug2: No steps in jobid 338 to send signal 999
slurmd: debug2: No steps in jobid 338 to send signal 18
slurmd: debug2: No steps in jobid 338 to send signal 15
slurmd: debug4: sent ALREADY_COMPLETE
slurmd: debug2: set revoke expiration for jobid 338 to 1608146409 UTS
slurmd: debug2: Finish processing RPC: REQUEST_TERMINATE_JOB

sacct shows job failed
338            hostname      devel     admins          4     FAILED      1:0

Comment 1 Jason Booth 2020-12-16 16:01:36 MST

Created attachment 17210 [details]
slurm.conf.121720.rab.txt

Would you please reproduce this with debug2 configured for the slurmd on the node you are testing with?

Comment 3 ruth.a.braun 2020-12-16 20:49:17 MST

Added the debug (used debug3 if that's ok), but was unable to allocate cpus on a compute node in this partition.   There are two users who have cpus allocated but not all of them are in use.  My job goes to pending resources for some reason.  

I could kill their jobs, but would like help to determine why my resources are pending.

Also note that the other partition we defined is not having this same issue.  The other partition is OverSubscribe=EXCLUSIVE (the one that works ok and has a lot more servers in it);  while the problem partition "devel" contains 4 servers and is not.

Comment 4 Michael Hinton 2020-12-17 11:29:10 MST

*** Ticket 10466 has been marked as a duplicate of this ticket. ***

Comment 5 Michael Hinton 2020-12-17 11:39:19 MST

Hi Ruth,

Can you attach your slurm.conf? What Linux distro and kernel are you running on?

Just a note: doing `slurmd -D` will not show you the stepd logs. Instead, it is recommended to run slurmd in the background and to actively monitor the slurmd.log during debugging, since that will include all the logs emitted by the steps.

Could you reproduce the problem and then attach the relevant portions of your slurmd.log and slurmctld.log (rather than from the slurmd in the foreground)?

We recently fixed a similar cgroup-related error, so I would recommend upgrading to 20.02.6 to see if that solves the issue.

Thanks
-Michael

Comment 6 ruth.a.braun 2020-12-17 12:39:42 MST

The server running slurmctld and slurmdbd is:
# uname -a
Linux clnschedsvr1.hpc.na.xom.com 3.10.0-957.5.1.el7.x86_64 #1 SMP Wed Dec 19 10:46:58 EST 2018 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.7 (Maipo)

GPU compute nodes are running:
Red Hat Enterprise Linux ComputeNode release 7.6 (Maipo)
# uname -r
3.10.0-957.27.2.el7.x86_64

Slurm.conf attached – please do not publish to others

Best Regards,

Ruth A. Braun
EMRE High Performance Computing – Sr. IT Analyst
Fuels Lubricants and Chemicals IT (FLCIT)

ExxonMobil Technical Computing Company
1545 Route 22 East - Clinton CCS18
Annandale, NJ 08801
908 335 3694 Tel



Problem, questions, need help?  Open a ticket using this goto link:
 http://goto/EMREHPCTICKET

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Thursday, December 17, 2020 1:39 PM
To: Braun, Ruth A <ruth.a.braun@exxonmobil.com>
Subject: [Bug 10460] srun error

External Email - Think Before You Click



Comment # 5<https://bugs.schedmd.com/show_bug.cgi?id=10460#c5> on bug 10460<https://bugs.schedmd.com/show_bug.cgi?id=10460> from Michael Hinton<mailto:hinton@schedmd.com>

Hi Ruth,



Can you attach your slurm.conf? What Linux distro and kernel are you running

on?



Just a note: doing `slurmd -D` will not show you the stepd logs. Instead, it is

recommended to run slurmd in the background and to actively monitor the

slurmd.log during debugging, since that will include all the logs emitted by

the steps.



Could you reproduce the problem and then attach the relevant portions of your

slurmd.log and slurmctld.log (rather than from the slurmd in the foreground)?



We recently fixed a similar cgroup-related error, so I would recommend

upgrading to 20.02.6 to see if that solves the issue.



Thanks

-Michael

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 7 ruth.a.braun 2020-12-17 12:39:42 MST

Created attachment 17211 [details]
debug3-test-e8002.txt

Comment 8 ruth.a.braun 2020-12-17 12:39:42 MST

Created attachment 17212 [details]
ruthctld.log

Comment 9 ruth.a.braun 2020-12-17 13:17:14 MST

I see the three attachments I emailed back above.  Let me know if you need anything else.  Ruth 12/17

Comment 10 Michael Hinton 2020-12-17 17:41:30 MST

Hi Ruth,

> Meanwhile, if I should upgrade (this cluster is not in production yet so 
> I could do what I want)… should I just go directly to the latest release 20.11.1.?
You could do that, but I would recommend upgrading minor versions for now (20.02.5 --> 20.02.6) because that can be easily done in place without needing to upgrade the database or tweak your configuration. Minor version upgrades only contain bug fixes and don't introduce new features or breaking changes.

Comment 11 ruth.a.braun 2020-12-18 05:10:31 MST

Hi Michael,
Ok, I’ll work on the upgrade to 20.02.6 today.  I am out of office Christmas week, but working today and plan to check in periodically. Please continue to send info on the interpretation of my issue (and suggestions) .
I’ll be back in the office 12/28.

Regards, Ruth

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Thursday, December 17, 2020 7:42 PM
To: Braun, Ruth A <ruth.a.braun@exxonmobil.com>
Subject: [Bug 10460] srun error

External Email - Think Before You Click

Comment # 10<https://bugs.schedmd.com/show_bug.cgi?id=10460#c10> on bug 10460<https://bugs.schedmd.com/show_bug.cgi?id=10460> from Michael Hinton<mailto:hinton@schedmd.com>

Hi Ruth,

> Meanwhile, if I should upgrade (this cluster is not in production yet so

> I could do what I want)… should I just go directly to the latest release 20.11.1.?

You could do that, but I would recommend upgrading minor versions for now

(20.02.5 --> 20.02.6) because that can be easily done in place without needing

to upgrade the database or tweak your configuration. Minor version upgrades

only contain bug fixes and don't introduce new features or breaking changes.

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 14 Michael Hinton 2020-12-22 10:09:29 MST

Ruth,

On the nodes emitting the errors, *while a job causing the error is still running*, could you please run the following commands and paste the output here?:

find /sys/fs/cgroup/ -name cpuset.cpus -exec echo '{}' \; -exec cat '{}' \;
find /sys/fs/cgroup/ -name cpuset.mems -exec echo '{}' \; -exec cat '{}' \;

This will double check to see if cgroup stuff is being set and propagated correctly. We thought we fixed this in 20.02.6 and 20.11.0, but it's possible it did not get fixed completely.

-Michael

Comment 15 Michael Hinton 2020-12-22 10:23:11 MST

Alternatively, upgrade to 20.02.6 to see if that fixes things, and if not, then do what I asked in comment 14.

Comment 16 ruth.a.braun 2020-12-28 10:48:02 MST

Will do!


Best Regards,

Ruth



Ruth A. Braun
EMRE High Performance Computing – Sr. IT Analyst
Fuels Lubricants and Chemicals IT (FLCIT)

ExxonMobil Technical Computing Company
1545 Route 22 East - Clinton CCS18
Annandale, NJ 08801
908 335 3694 Tel



Problem, questions, need help?  Open a ticket using this goto link:
 http://goto/EMREHPCTICKET

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Tuesday, December 22, 2020 12:09 PM
To: Braun, Ruth A <ruth.a.braun@exxonmobil.com>
Subject: [Bug 10460] srun error

External Email - Think Before You Click



Michael Hinton<mailto:hinton@schedmd.com> changed bug 10460<https://bugs.schedmd.com/show_bug.cgi?id=10460>
What

Removed

Added

CC



felip.moll@schedmd.com<mailto:felip.moll@schedmd.com>

Comment # 14<https://bugs.schedmd.com/show_bug.cgi?id=10460#c14> on bug 10460<https://bugs.schedmd.com/show_bug.cgi?id=10460> from Michael Hinton<mailto:hinton@schedmd.com>

Ruth,



On the nodes emitting the errors, *while a job causing the error is still

running*, could you please run the following commands and paste the output

here?:



find /sys/fs/cgroup/ -name cpuset.cpus -exec echo '{}' \; -exec cat '{}' \;

find /sys/fs/cgroup/ -name cpuset.mems -exec echo '{}' \; -exec cat '{}' \;



This will double check to see if cgroup stuff is being set and propagated

correctly. We thought we fixed this in 20.02.6 and 20.11.0, but it's possible

it did not get fixed completely.



-Michael

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 17 ruth.a.braun 2021-01-02 07:04:46 MST


Sorry this took so long but here is output on a compute node that's running a job and at slurm-20.02.6-1
[root@e4001 ~]# find /sys/fs/cgroup/ -name cpuset.cpus -exec echo '{}' \; -exec cat '{}' \;
/sys/fs/cgroup/cpuset/weka/cpuset.cpus
1
/sys/fs/cgroup/cpuset/system/cpuset.cpus
0,2-55
/sys/fs/cgroup/cpuset/cpuset.cpus
0-55

[root@e4001 ~]# find /sys/fs/cgroup/ -name cpuset.mems -exec echo '{}' \; -exec cat '{}' \;
/sys/fs/cgroup/cpuset/weka/cpuset.mems
0
/sys/fs/cgroup/cpuset/system/cpuset.mems
0-1
/sys/fs/cgroup/cpuset/cpuset.mems
0-1

# rpm -qa|grep slurm
slurm-20.02.6-1.el7.x86_64
slurm-slurmd-20.02.6-1.el7.x86_64
slurm-pam_slurm-20.02.6-1.el7.x86_64
slurm-perlapi-20.02.6-1.el7.x86_64
slurm-devel-20.02.6-1.el7.x86_64
slurm-libpmi-20.02.6-1.el7.x86_64
slurm-torque-20.02.6-1.el7.x86_64
slurm-contribs-20.02.6-1.el7.x86_64
slurm-example-configs-20.02.6-1.el7.x86_64
[root@e4001 ~]# date
Sat Jan  2 09:03:07 EST 2021

Comment 18 ruth.a.braun 2021-01-02 07:29:28 MST

Please use this set of output instead of my last post:
With 20.02.6-1 now running...

User xurabraun gets error:
[xurabraun@vlogin003 ~]$ srun -p devel -N 1 -n 8 --pty bash
[xurabraun@SLURM]$ srun: error: e4002: task 1: Exited with exit code 1

[xurabraun@SLURM]$ hostname
e4002.noether
[xurabraun@SLURM]$ date
Sat Jan  2 09:23:40 EST 2021

(root ssh to compute node e4002) to perform find commands while xurabraun job is still running)

[root@e4002 ~]# date
Sat Jan  2 09:24:40 EST 2021
[root@e4002 ~]# find /sys/fs/cgroup/ -name cpuset.cpus -exec echo '{}' \; -exec cat '{}' \;
/sys/fs/cgroup/cpuset/weka/cpuset.cpus
1
/sys/fs/cgroup/cpuset/system/cpuset.cpus
0,2-55
/sys/fs/cgroup/cpuset/cpuset.cpus
0-55
[root@e4002 ~]# find /sys/fs/cgroup/ -name cpuset.mems -exec echo '{}' \; -exec cat '{}' \;
/sys/fs/cgroup/cpuset/weka/cpuset.mems
0
/sys/fs/cgroup/cpuset/system/cpuset.mems
0-1
/sys/fs/cgroup/cpuset/cpuset.mems
0-1

Comment 21 Michael Hinton 2021-01-05 11:01:08 MST

(In reply to ruth.a.braun from comment #18)
> [root@e4002 ~]# find /sys/fs/cgroup/ -name cpuset.cpus -exec echo '{}' \;
> -exec cat '{}' \;
> /sys/fs/cgroup/cpuset/weka/cpuset.cpus
> 1
> /sys/fs/cgroup/cpuset/system/cpuset.cpus
> 0,2-55
> /sys/fs/cgroup/cpuset/cpuset.cpus
> 0-55
It appears that Weka is using cgroups to reserve CPU 1 on that node. However, Slurm doesn't know about this, and so when the job runs on the node, it tries to set the CPU affinity for CPU 1 and fails:

[2020-12-17T14:21:42.676] [384.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-12-17T14:21:42.677] [384.0] sched_setaffinity(18992,128,0x2) failed: Invalid argument
[2020-12-17T14:21:42.677] [384.0] debug:  task_g_pre_launch: task/affinity: Unspecified error
[2020-12-17T14:21:42.677] [384.0] error: Failed to invoke task plugins: task_p_pre_launch error

See the "sched_setaffinity(18992,128,0x2)"? The task is trying to set the CPU affinity for CPU 1 (mask 0x2), but that CPU is already taken by Weka. So it produces an EINVAL error. From https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html:

"sched_setaffinity(2)
...
EINVAL The affinity bit mask mask contains no processors that are
              currently physically on the system and permitted to the thread
              according to any restrictions that may be imposed by cpuset
              cgroups or the "cpuset" mechanism described in cpuset(7)."

I think the solution here is to work with Weka to stop it from reserving a CPU.

Another solution is to tell Slurm that CPU 1 is off limits for that node, so that it doesn't allocate it to tasks. You can do this I think with the "CPUSpecList" parameter in slurm.conf.

-Michael

Comment 22 ruth.a.braun 2021-01-06 06:00:46 MST

Michael ,
Message received.  Wondering also why the gpu partition does not show this issue (just the partition devel).

Fix help:  could you specify what entry I make for slurm.conf, gres.conf and/or the cgroup.conf files?

For example, would I add this to the Nodename Difinition?

Nodename=DEFAULT CPUs=56 Boards=1 SocketsPerBoard=2 CoresPerSocket=28 ThreadsPerCore=1 RealMemory=386450 CpuSpecList=0x2

[cid:image001.png@01D6E402.049BB870]

Best Regards,

Ruth



Ruth A. Braun
EMRE High Performance Computing – Sr. IT Analyst
Fuels Lubricants and Chemicals IT (FLCIT)

ExxonMobil Technical Computing Company
1545 Route 22 East - Clinton CCS18
Annandale, NJ 08801
908 335 3694 Tel



Problem, questions, need help?  Open a ticket using this goto link:
 http://goto/EMREHPCTICKET

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Tuesday, January 5, 2021 1:01 PM
To: Braun, Ruth A <ruth.a.braun@exxonmobil.com>
Subject: [Bug 10460] srun error

External Email - Think Before You Click



Comment # 21<https://bugs.schedmd.com/show_bug.cgi?id=10460#c21> on bug 10460<https://bugs.schedmd.com/show_bug.cgi?id=10460> from Michael Hinton<mailto:hinton@schedmd.com>

(In reply to ruth.a.braun from comment #18<show_bug.cgi?id=10460#c18>)

> [root@e4002 ~]# find /sys/fs/cgroup/ -name cpuset.cpus -exec echo '{}' \;

> -exec cat '{}' \;

> /sys/fs/cgroup/cpuset/weka/cpuset.cpus

> 1

> /sys/fs/cgroup/cpuset/system/cpuset.cpus

> 0,2-55

> /sys/fs/cgroup/cpuset/cpuset.cpus

> 0-55

It appears that Weka is using cgroups to reserve CPU 1 on that node. However,

Slurm doesn't know about this, and so when the job runs on the node, it tries

to set the CPU affinity for CPU 1 and fails:



[2020-12-17T14:21:42.676] [384.0] task_p_pre_launch: Using sched_affinity for

tasks

[2020-12-17T14:21:42.677] [384.0] sched_setaffinity(18992,128,0x2) failed:

Invalid argument

[2020-12-17T14:21:42.677] [384.0] debug:  task_g_pre_launch: task/affinity:

Unspecified error

[2020-12-17T14:21:42.677] [384.0] error: Failed to invoke task plugins:

task_p_pre_launch error



See the "sched_setaffinity(18992,128,0x2)"? The task is trying to set the CPU

affinity for CPU 1 (mask 0x2), but that CPU is already taken by Weka. So it

produces an EINVAL error. From

https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html:



"sched_setaffinity(2)

...

EINVAL The affinity bit mask mask contains no processors that are

              currently physically on the system and permitted to the thread

              according to any restrictions that may be imposed by cpuset

              cgroups or the "cpuset" mechanism described in cpuset(7)."



I think the solution here is to work with Weka to stop it from reserving a CPU.



Another solution is to tell Slurm that CPU 1 is off limits for that node, so

that it doesn't allocate it to tasks. You can do this I think with the

"CPUSpecList" parameter in slurm.conf.



-Michael

________________________________
You are receiving this mail because:

  *   You reported the bug.
  *   You are on the CC list for the bug.

Comment 23 ruth.a.braun 2021-01-06 06:00:46 MST

Created attachment 17360 [details]
image001.png

Comment 25 Michael Hinton 2021-01-06 15:41:45 MST

Hi Ruth,

(In reply to ruth.a.braun from comment #22)
> Message received.  Wondering also why the gpu partition does not show this
> issue (just the partition devel).
I'm not sure, without more information. Maybe the GPU nodes don't have Weka on them. Or maybe the jobs on that partition aren't being allocated CPUs restricted by cgroups, for whatever reason.

> Fix help:  could you specify what entry I make for slurm.conf, gres.conf
> and/or the cgroup.conf files?
> 
> For example, would I add this to the Nodename Difinition?
> 
> Nodename=DEFAULT CPUs=56 Boards=1 SocketsPerBoard=2 CoresPerSocket=28
> ThreadsPerCore=1 RealMemory=386450 CpuSpecList=0x2
After reading the docs, I realized that CpuSpecList won't work. From https://slurm.schedmd.com/slurm.conf.html#OPT_CpuSpecList:

"This option has no effect unless cgroup job confinement is also configured (TaskPlugin=task/cgroup with ConstrainCores=yes in cgroup.conf)."

Since you only have task/affinity specified, the next option you could try is to use CoreSpecCount=4, TaskPluginParam=SlurmdOffSpec, and add spec_cores_first to your SchedulerParameters. This will hopefully reserve the first 4 cores, which will overlap with Weka's specified core (1). However, you will need to double check in the slurmd.log. For example:

Resource spec: Reserved abstract CPU IDs: 0-3
Resource spec: Reserved machine CPU IDs: 0-1,28-29

You want the reserved machine CPU IDs to overlap with the CPU reserved by Weka in cgroups (1). CoreSpecCount needs to be 4 (I think) in order to overlap with it. See https://slurm.schedmd.com/core_spec.html for more details on how cores are selected. 

Unfortunately, this will mean that four of your cores will be usable by jobs, since it's an imprecise workaround.

To test, do

    srun --exclusive grep Cpus_allowed_list /proc/self/status

to see what CPUs are allowed to the job (and by extension, the slurmd) on the node. I imagine you will get the same error if you try this command out right now, though.

----------------

The above workaround may be quicker, but here is my actual recommendation:

Set TaskPlugin=task/cgroup,task/affinity in slurm.conf and then set ConstrainCores=yes in cgroup.conf. Using the task cgroup plugin is recommended, because then jobs can't possibly use CPUs outside of their allocation. Without task/cgroup, a smart user could potentially use sched_setaffinity() in their program and use all CPUs on the node and there would be no way to stop them.

If you decide to use task/cgroup, my guess is that it will NOT play well with Weka's cgroup settings; there will be conflicts. So you will need to figure out why Weka is reserving CPUs and tell it to stop doing that.

In the long run, I think this is the best path forward. You have a cgroup.conf file, but you aren't using any cgroup plugins in slurm.conf, so it's not doing anything. So my guess is that you actually wanted to take advantage of cgroups with Slurm to begin with.

For more information on how to use cgroups, see https://slurm.schedmd.com/cgroups.html and https://slurm.schedmd.com/cgroup.conf.html.

Thanks,
-Michael

Comment 26 ruth.a.braun 2021-01-07 10:59:36 MST

Compute nodes which run the weka client use 1 cpu core (id 1) for it's purposes.  
it also reserves approximately 1.46 GB of memory from the compute nodes for its operations.

Based on the info above, can you give me very-specific examples for the various settings files. -Ruth

Comment 27 Michael Hinton 2021-01-07 12:31:16 MST

Well, one easy option you have is to comment out the task/affinity plugin altogether.

If that is not acceptable, and if turning off Weka's cgroup reservations and using Slurm's task/cgroup plugin is also not acceptable, do this (as mentioned in comment 25):

slurm.conf
*******************
Add "spec_cores_first" to your SchedulerParameters; Set "TaskPluginParam=SlurmdOffSpec"; and
add "CoreSpecCount=4" to the nodes that have Weka's reserved core:

SchedulerParameters=bf_window=43200,bf_resolution=600,bf_max_job_test=550,bf_max_job_part=350,bf_interval=300,bf_max_job_user=30,bf_continue,nohold_on_prolog_fail,spec_cores_first
TaskPluginParam=SlurmdOffSpec
Nodename=DEFAULT CPUs=56 Boards=1 SocketsPerBoard=2 CoresPerSocket=28 ThreadsPerCore=1 RealMemory=386450 CoreSpecCount=4

I'm assuming Weka reserves CPU 1 on all nodes, but if it's just a random CPU, that's a problem. So you should double check.

Then restart the slurmctld and slurmds. In the slurmd log, double check that the machine CPU ID 1 is included in the reserved machine CPU IDs, as mentioned in comment 25.

-Michael

Comment 28 Michael Hinton 2021-01-12 10:59:58 MST

Hi Ruth, how is the workaround going?

Comment 29 ruth.a.braun 2021-01-14 12:57:11 MST

Hi, I just put in place the easy option  comment out the task/affinity plugin altogether.  We're testing now

Comment 30 Michael Hinton 2021-02-01 14:30:29 MST

Hi Ruth, how is your testing going? Is the workaround working? Have you learned more about Weka?

-Michael

Comment 31 Michael Hinton 2021-02-11 15:36:55 MST

I'll go ahead and close this out. Feel free to reopen if you want to pursue this further.

Thanks!
-Michael