Ticket 10199 - no-allocate does not work with send_gids
Summary: no-allocate does not work with send_gids
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmstepd (show other tickets)
Version: 20.11.x
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Director of Support
QA Contact:
URL:
Depends on: 8538
Blocks:
  Show dependency treegraph
 
Reported: 2020-11-11 14:06 MST by Matt Ezell
Modified: 2021-01-13 16:43 MST (History)
0 users

See Also:
Site: ORNL-OLCF
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 20.11.3
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Matt Ezell 2020-11-11 14:06:30 MST
With 20.11.0-rc1 I tried:

# srun -Z -w $(hostname) hostname
srun: do not allocate resources
srun: error: Task launch for StepId=4294930274.485580765 failed on node lyra17: Unspecified error
srun: error: Application launch failed: Unspecified error
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

In the slurmd logs:
[2020-11-11T15:57:47.559] launch task StepId=4294930274.485580765 request from UID:0 GID:0 HOST:172.30.140.84 PORT:40136
[2020-11-11T15:57:47.559] task/affinity: lllp_distribution: JobId=4294930274 implicit auto binding: cores, dist 1
[2020-11-11T15:57:47.559] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic 
[2020-11-11T15:57:47.559] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [4294930274]: mask_cpu, 0x0000000000000000000000000000000100000000000000000000000000000001
[2020-11-11T15:57:47.560] task rank unavailable due to invalid job credential, step completion RPC impossible
[2020-11-11T15:57:47.564] [4294930274.485580765] error: No gids given in the cred.
[2020-11-11T15:57:47.565] [4294930274.485580765] error: _step_setup: no job returned
[2020-11-11T15:57:47.565] [4294930274.485580765] done with job
[2020-11-11T15:57:47.565] error: slurmstepd return code -1

The faked credential doesn't have ngids set. Since da766fb0e4a27f74fbc48b3c0beac3d3daa4dec5 that is now an error if slurm_cred_send_gids_enabled().  I think _job_fake_cred() needs to set arg.ngids and arg.gids
Comment 2 Michael Hinton 2020-11-12 11:13:50 MST
Hey Matt, thanks for the heads up. We'll look into it.

-Michael
Comment 3 Michael Hinton 2021-01-13 16:43:29 MST
Hi Matt,

This was fixed with commits b55caa1, 68aee0d, and 09e2f88, and will land in 20.11.3. See https://github.com/SchedMD/slurm/compare/a99f9880cf16...09e2f88cdb5c. If this does not fix the issue for you, please reopen. 

Thanks!
-Michael