This happens: -bash-4.2$ salloc -N1 -c 1 -p sandybridge srun -N1 --pty --preserve-env $SHELL salloc: Granted job allocation 119772 srun: error: task 0 launch failed: Slurmd could not connect IO srun: error: task 1 launch failed: Slurmd could not connect IO salloc: Relinquishing job allocation 119772 If I have an entry in /etc/group that is larger than ~400 bytes on RHEL 7.2 -bash-4.2$ rpm -qa | grep glibc-2 | grep x86_64 glibc-2.17-106.el7_2.4.x86_64 compat-glibc-2.12-4.el7.centos.x86_64 If I attach strace to slurmd (-f, so also slurmctld), we can somewhat see what's happening: [pid 26158] stat("/dev/pts/1", {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0 [pid 26158] getuid() = 24712 [pid 26158] open("/etc/group", O_RDONLY|O_CLOEXEC) = 16 [pid 26158] fstat(16, {st_mode=S_IFREG|0644, st_size=25951, ...}) = 0 [pid 26158] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aff576f2000 [pid 26158] read(16, "[redacted -- bits of /etc/group]"..., 4096) = 4096 [pid 26158] read(16, "[redacted -- bits of /etc/group]"..., 4096) = 4096 [pid 26158] close(16) = 0 [pid 26158] munmap(0x2aff576f2000, 4096) = 0 [pid 26158] getgid() = 0 [pid 26158] chown("/dev/pts/1", 24712, 0) = -1 EPERM (Operation not permitted) [pid 26158] close(15) = 0 Our authentication system works by pulling information from ldap (among other places) and filling out /etc/passwd, /etc/group and /etc/shadow (and the slurm DB). Due to a similar bug we saw on another system, entries in /etc/group are split to be no longer than 512 bytes. Reducing this constant to 400 bytes, resolves the problem [root@hl02 ~]# cp group.good /etc/group cp: overwrite ‘/etc/group’? y -bash-4.2$ salloc -N1 -c 1 -p sandybridge srun -N1 --pty --preserve-env $SHELL salloc: Granted job allocation 119774 bash-4.2$ Groups are split as follows: grep my_group /etc/group mygroup:x:1000:list_of_users mygroup-1:x:1000:more_users which allows things like getent group my_group to return the full list and filesystem permissions to work, but doesn't seem to trip this bug. I did some digging in libc, and I suspect that the culprit is the call to __getgrnam_r() inside grantpt(), but I didn't isolate the problem to a particular line of code. If you do what some other tools do, and setup the pty directly instead of using openpty() the chown can be avoided (on modern Linux) and this problem doesn't happen (not sure if this is a good idea to change or not).
I suppose this is also relevant.. on the compute node: [root@hl02 ~]# mount | grep pts devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
What OS is this? It's not clear to me that this is our bug - if the openpty() call is failing that sounds like a libc issue...
I was playing with RHEL7.x at the time, but it's reproducible on pretty much anything glibc based - it's really probably not your bug. I opened the ticket in case you'd like to implement a workaround and so that the very non-obvious set of symptoms is on the record. I'll also note that SLURM seems to hit this with slightly smaller groups than most other things - not really sure why that is.
Ben - I'm cleaning up some stuff, and going to go ahead and mark this as resolved/invalid. I don't really want to go rebuild a broken syscall within Slurm just to dodge a glibc bug. Feel free to reopen if you disagree, or propose a patch to replace openpty() with a Slurm-specific xopenpty() call. - Tim