Ticket 10199

Summary: no-allocate does not work with send_gids
Product: Slurm Reporter: Matt Ezell <ezellma>
Component: slurmstepdAssignee: Director of Support <support>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 20.11.x   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=8538
Site: ORNL-OLCF Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 20.11.3 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Ticket Depends on: 8538    
Ticket Blocks:    

Description Matt Ezell 2020-11-11 14:06:30 MST
With 20.11.0-rc1 I tried:

# srun -Z -w $(hostname) hostname
srun: do not allocate resources
srun: error: Task launch for StepId=4294930274.485580765 failed on node lyra17: Unspecified error
srun: error: Application launch failed: Unspecified error
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

In the slurmd logs:
[2020-11-11T15:57:47.559] launch task StepId=4294930274.485580765 request from UID:0 GID:0 HOST:172.30.140.84 PORT:40136
[2020-11-11T15:57:47.559] task/affinity: lllp_distribution: JobId=4294930274 implicit auto binding: cores, dist 1
[2020-11-11T15:57:47.559] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic 
[2020-11-11T15:57:47.559] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [4294930274]: mask_cpu, 0x0000000000000000000000000000000100000000000000000000000000000001
[2020-11-11T15:57:47.560] task rank unavailable due to invalid job credential, step completion RPC impossible
[2020-11-11T15:57:47.564] [4294930274.485580765] error: No gids given in the cred.
[2020-11-11T15:57:47.565] [4294930274.485580765] error: _step_setup: no job returned
[2020-11-11T15:57:47.565] [4294930274.485580765] done with job
[2020-11-11T15:57:47.565] error: slurmstepd return code -1

The faked credential doesn't have ngids set. Since da766fb0e4a27f74fbc48b3c0beac3d3daa4dec5 that is now an error if slurm_cred_send_gids_enabled().  I think _job_fake_cred() needs to set arg.ngids and arg.gids
Comment 2 Michael Hinton 2020-11-12 11:13:50 MST
Hey Matt, thanks for the heads up. We'll look into it.

-Michael
Comment 3 Michael Hinton 2021-01-13 16:43:29 MST
Hi Matt,

This was fixed with commits b55caa1, 68aee0d, and 09e2f88, and will land in 20.11.3. See https://github.com/SchedMD/slurm/compare/a99f9880cf16...09e2f88cdb5c. If this does not fix the issue for you, please reopen. 

Thanks!
-Michael