Ticket 13544

Summary: Slurmd 21.08.5 fails to launch jobs on Ubuntu 18.04 w kernel 4.15.0-169-generic
Product: Slurm Reporter: Sean Caron <scaron>
Component: slurmdAssignee: Jacob Jenson <jacob>
Status: RESOLVED INVALID QA Contact:
Severity: 6 - No support contract    
Priority: ---    
Version: 21.08.5   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Sean Caron 2022-03-01 12:37:11 MST
Hi there,

I'm with the University of Michigan but not the portion of the University that has an active support contract so I'm not expecting a resolution to any timeline of mine, just wanted to report this bug in the hopes that it will be fixed in a later version, and so that it will be on the record if others encounter this issue.

We're running Ubuntu 18.04 LTS and Slurm 21.08.5.

One of my staff members was reloading some compute nodes this week. These nodes worked totally fine prior to reloading. We just wanted to tweak the partition layout on them.

On a reloaded node, we found Slurmd started fine as usual, but any jobs sent to the node would immediately fail with the node dropping into DRAIN state due to "batch job complete failure".

If we looked at slurmd.log on the node, we saw something like:

[2022-03-01T13:21:52.850] [30331376.batch] error: Failed to invoke task plugins: one of task_p_pre_setuid functions returned error
[2022-03-01T13:21:52.850] [30331376.batch] error: called without a previous init. This shouldn't happen!
[2022-03-01T13:21:52.851] [30331376.batch] error: called without a previous init. This shouldn't happen!
[2022-03-01T13:21:52.851] [30331376.batch] error: job_manager: exiting abnormally: Slurmd could not execve job

I cranked the debug level on Slurmd and it didn't really give any more useful information, but here is an example just for your reference. This was a simple test job, basically just running "hostname ; uptime".

[2022-03-01T13:42:26.573] [30331487.batch] debug3: cgroup/v1: xcgroup_create_slurm_cg: slurm cgroup /slurm successfully created for ns cpuset
[2022-03-01T13:42:26.573] [30331487.batch] debug3: cgroup/v1: common_cgroup_set_param: common_cgroup_set_param: parameter 'cgroup.clone_children' set to '0' for '/sys/fs/cgroup/cpuset/slurm'
[2022-03-01T13:42:26.573] [30331487.batch] debug3: cgroup/v1: xcgroup_create_slurm_cg: slurm cgroup /slurm successfully created for ns cpuset
[2022-03-01T13:42:26.573] [30331487.batch] debug:  task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0'
[2022-03-01T13:42:26.573] [30331487.batch] debug:  task/cgroup: task_cgroup_cpuset_create: step abstract cores are '0'
[2022-03-01T13:42:26.573] [30331487.batch] debug:  task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0,12'
[2022-03-01T13:42:26.573] [30331487.batch] debug:  task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '0,12'
[2022-03-01T13:42:26.574] [30331487.batch] debug3: cgroup/v1: common_cgroup_set_param: common_cgroup_set_param: parameter 'cpuset.cpus' set to '0,12,0-23' for '/sys/fs/cgroup/cpuset/slurm/uid_147809'
[2022-03-01T13:42:26.574] [30331487.batch] debug3: cgroup/v1: common_cgroup_set_param: common_cgroup_set_param: parameter 'cpuset.mems' set to '0-1' for '/sys/fs/cgroup/cpuset/slurm/uid_147809'
[2022-03-01T13:42:26.574] [30331487.batch] debug3: cgroup/v1: common_cgroup_set_param: common_cgroup_set_param: parameter 'cpuset.cpus' set to '0,12' for '/sys/fs/cgroup/cpuset/slurm/uid_147809/job_303314
87'
[2022-03-01T13:42:26.574] [30331487.batch] debug3: cgroup/v1: common_cgroup_set_param: common_cgroup_set_param: parameter 'cpuset.mems' set to '0-1' for '/sys/fs/cgroup/cpuset/slurm/uid_147809/job_3033148
7'
[2022-03-01T13:42:26.574] [30331487.batch] debug3: cgroup/v1: common_cgroup_set_param: common_cgroup_set_param: parameter 'cpuset.cpus' set to '0,12' for '/sys/fs/cgroup/cpuset/slurm/uid_147809/job_303314
87/step_batch'
[2022-03-01T13:42:26.574] [30331487.batch] debug3: cgroup/v1: common_cgroup_set_param: common_cgroup_set_param: parameter 'cpuset.mems' set to '0-1' for '/sys/fs/cgroup/cpuset/slurm/uid_147809/job_3033148
7/step_batch'
[2022-03-01T13:42:26.575] [30331487.batch] debug3: cgroup/v1: xcgroup_create_slurm_cg: slurm cgroup /slurm successfully created for ns memory
[2022-03-01T13:42:26.575] [30331487.batch] debug3: cgroup/v1: common_cgroup_set_param: common_cgroup_set_param: parameter 'memory.use_hierarchy' set to '1' for '/sys/fs/cgroup/memory/slurm/uid_147809'
[2022-03-01T13:42:26.575] [30331487.batch] debug3: cgroup/v1: common_cgroup_set_param: common_cgroup_set_param: parameter 'memory.use_hierarchy' set to '1' for '/sys/fs/cgroup/memory/slurm/uid_147809/job_
30331487'
[2022-03-01T13:42:26.575] [30331487.batch] debug3: cgroup/v1: common_cgroup_set_param: common_cgroup_set_param: parameter 'memory.use_hierarchy' set to '1' for '/sys/fs/cgroup/memory/slurm/uid_147809/job_
30331487/step_batch'
[2022-03-01T13:42:26.575] [30331487.batch] task/cgroup: _memcg_initialize: job: alloc=2000MB mem.limit=2000MB memsw.limit=2000MB
[2022-03-01T13:42:26.575] [30331487.batch] debug3: cgroup/v1: common_cgroup_set_uint64_param: common_cgroup_set_uint64_param: parameter 'memory.limit_in_bytes' set to '2097152000' for '/sys/fs/cgroup/memo
ry/slurm/uid_147809/job_30331487'
[2022-03-01T13:42:26.575] [30331487.batch] debug3: cgroup/v1: common_cgroup_set_uint64_param: common_cgroup_set_uint64_param: parameter 'memory.soft_limit_in_bytes' set to '2097152000' for '/sys/fs/cgroup
/memory/slurm/uid_147809/job_30331487'
[2022-03-01T13:42:26.575] [30331487.batch] debug:  task_g_pre_setuid: task/cgroup: Unspecified error
[2022-03-01T13:42:26.575] [30331487.batch] error: Failed to invoke task plugins: one of task_p_pre_setuid functions returned error
[2022-03-01T13:42:26.575] [30331487.batch] debug:  _fork_all_tasks failed
[2022-03-01T13:42:26.575] [30331487.batch] debug2: step_terminate_monitor will run for 600 secs
[2022-03-01T13:42:26.575] [30331487.batch] debug3: cgroup/v1: common_cgroup_set_param: common_cgroup_set_param: parameter 'freezer.state' set to 'THAWED' for '/sys/fs/cgroup/freezer/slurm/uid_147809/job_3
0331487/step_batch'
[2022-03-01T13:42:26.575] [30331487.batch] debug:  signaling condition
[2022-03-01T13:42:26.575] [30331487.batch] debug2: step_terminate_monitor is stopping
[2022-03-01T13:42:26.575] [30331487.batch] debug2: _monitor exit code: 0

The only things that changed when the machine was reloaded was the amount of TmpDisk (we confirmed slurmd.conf and gres.conf were correctly updated to reflect the change) and the kernel version running on the compute node.

The most recent Ubuntu 18.04 LTS kernel that was installed on the node when it was reloaded was 4.15.0-169-generic.

We have existing nodes working fine using 4.15.0-135-generic, amongst other things.

We downgraded the kernel to 4.15.0-135-generic and it seems to have resolved the issue.

I'm not sure exactly at which point between 135-generic and 169-generic the issue was introduced that is causing a problematic interaction with Slurm, but you should definitely be able to reproduce this with 169-generic.

I'm also not sure if this is just a one-off issue with 4.15.0-169-generic but I just wanted to report this in case it becomes an issue with all Ubuntu 18.04 4.15 kernels going forward.

Thanks,

Sean