With 20.11.0-rc1 I tried: # srun -Z -w $(hostname) hostname srun: do not allocate resources srun: error: Task launch for StepId=4294930274.485580765 failed on node lyra17: Unspecified error srun: error: Application launch failed: Unspecified error srun: Job step aborted: Waiting up to 32 seconds for job step to finish. In the slurmd logs: [2020-11-11T15:57:47.559] launch task StepId=4294930274.485580765 request from UID:0 GID:0 HOST:172.30.140.84 PORT:40136 [2020-11-11T15:57:47.559] task/affinity: lllp_distribution: JobId=4294930274 implicit auto binding: cores, dist 1 [2020-11-11T15:57:47.559] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic [2020-11-11T15:57:47.559] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [4294930274]: mask_cpu, 0x0000000000000000000000000000000100000000000000000000000000000001 [2020-11-11T15:57:47.560] task rank unavailable due to invalid job credential, step completion RPC impossible [2020-11-11T15:57:47.564] [4294930274.485580765] error: No gids given in the cred. [2020-11-11T15:57:47.565] [4294930274.485580765] error: _step_setup: no job returned [2020-11-11T15:57:47.565] [4294930274.485580765] done with job [2020-11-11T15:57:47.565] error: slurmstepd return code -1 The faked credential doesn't have ngids set. Since da766fb0e4a27f74fbc48b3c0beac3d3daa4dec5 that is now an error if slurm_cred_send_gids_enabled(). I think _job_fake_cred() needs to set arg.ngids and arg.gids
Hey Matt, thanks for the heads up. We'll look into it. -Michael
Hi Matt, This was fixed with commits b55caa1, 68aee0d, and 09e2f88, and will land in 20.11.3. See https://github.com/SchedMD/slurm/compare/a99f9880cf16...09e2f88cdb5c. If this does not fix the issue for you, please reopen. Thanks! -Michael