Ticket 10466

Summary: slurmstepd error: Failed to invoke task plugins: task_p_pre_launch error
Product: Slurm Reporter: Kevin <kevin.m.ying>
Component: slurmstepdAssignee: Director of Support <support>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 20.02.5   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=10460
Site: EM Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Kevin 2020-12-16 15:47:07 MST
gets errors in srun job submission:

[vlogin001 ~]$  srun -N 1 -n 4 -p devel -w e8002 --pty bash
srun: error: e8002: task 1: Exited with exit code 1

e8002:/var/log/messages show:
Dec 16 11:31:46 e8002 slurmd[44149]: launch task 314.0 request from UID:56207 GID:56207 HOST:172.21.100.1 PORT:33976
Dec 16 11:31:46 e8002 slurmd[44149]: lllp_distribution jobid [314] implicit auto binding: cores,one_thread, dist 8192
Dec 16 11:31:46 e8002 slurmd[44149]: _task_layout_lllp_cyclic
Dec 16 11:31:46 e8002 slurmd[44149]: _lllp_generate_cpu_bind jobid [314]: mask_cpu,one_thread, 0x00000000000001,0x00000000000002,0x00000000000004,0x00000000000008
Dec 16 11:31:46 e8002 slurmd[44149]: _run_prolog: run job script took usec=159
Dec 16 11:31:46 e8002 slurmd[44149]: _run_prolog: prolog with lock for job 314 ran for 0 seconds
Dec 16 11:31:46 e8002 slurmstepd[16033]: in _window_manager
Dec 16 11:31:46 e8002 slurmstepd[16041]: task_p_pre_launch: Using sched_affinity for tasks
Dec 16 11:31:46 e8002 slurmstepd[16042]: task_p_pre_launch: Using sched_affinity for tasks
Dec 16 11:31:46 e8002 slurmstepd[16040]: task_p_pre_launch: Using sched_affinity for tasks
Dec 16 11:31:46 e8002 slurmstepd[16039]: task_p_pre_launch: Using sched_affinity for tasks
Dec 16 11:31:46 e8002 slurmstepd[16040]: error: Failed to invoke task plugins: task_p_pre_launch error

Restarting slurmd makes the problem go away for a few minutes, then the problem reappear.
We are using slurm-20.02.5-1.el7 on both job submission node and the compute node.

Thanks!
Comment 1 Michael Hinton 2020-12-17 10:22:38 MST
Can you attach your slurm.conf? What Linux distro and kernel are you running on?

Could you also set SlurmctldDebug=debug2, restart Slurm, reproduce the problem, and then attach the relevant portions of your slurmd.log and slurmctld.log?

I can't be sure without more logs, but we recently fixed a similar cgroup-related error, so I would recommend upgrading to 20.02.6 to see if that solves the issue.

Thanks
-Michael
Comment 2 Kevin 2020-12-17 11:00:16 MST
Thanks Michael,

My teammate also submitted a case for this problem: 10460.
You may merge this two cases.  

- Kevin Ying
Comment 3 Michael Hinton 2020-12-17 11:29:10 MST
Merging this with bug 10460

*** This ticket has been marked as a duplicate of ticket 10460 ***