Hello, The slurmctld on our cluster crashed with the following error: traps: srvcn[821278] trap divide error ip:4d5336 sp:7f4de13d2840 error:0 in slurmctld[400000+110000] Started Process Core Dump (PID 821279/UID 0). Process 2683604 (slurmctld) of user 973 dumped core. systemd-coredump@1-821279-0.service: Succeeded. addr2line -e /usr/sbin/slurmctld 4d5336 /root/rpmbuild/BUILD/slurm-22.05.5/src/slurmctld/step_mgr.c:2121 2121: cpus_per_task = cpu_cnt / task_cnt; The slurm version is slurm 22.05.5 I was able to reproduce it with the following job submit: [zs0402@uccn998 slurm]$ srun --partition=multiple --ntasks=1 --pty bash srun: job 20716 queued and waiting for resources srun: job 20716 has been allocated resources srun: error: Unable to create step for job 20716: Zero Bytes were transmitted or received The Partition multiple is configured as follows: PartitionName=multiple OverSubscribe=EXCLUSIVE Nodes=uccn[458-460] DefMemPerCPU=1125 MaxMemPerNode=90000 MaxCPUsPerNode=80 DefaultTime=30 Maxtime=48:00:00 MinNodes=2 MaxNodes=4 I think the problem is the --ntasks=1 via srun in conjuncture with the MinNodes=2 This seems like a bug to me. Is it already fixed in newer versions? Best Regards, Pascal
Hi. I am looking into this.
Just an update: The division you note in step_mgr.c:2121 looks like it's going to be fixed by a proposed patch for bug 15857 in an upcoming release. I'm keeping my eye on that one to see how it lands and will provide an update on this ticket after that to see if there's any work left to be done with a fix.
(In reply to Chad Vizino from comment #3) > Just an update: The division you note in step_mgr.c:2121 looks like it's > going to be fixed by a proposed patch for bug 15857 in an upcoming release. > I'm keeping my eye on that one to see how it lands and will provide an > update on this ticket after that to see if there's any work left to be done > with a fix. The fix for that one was just checked in to the master branch: >https://github.com/schedmd/slurm/commit/aee252a45e So I'll close this ticket for now as a partial duplicate of bug 15857. *** This ticket has been marked as a duplicate of ticket 15857 ***