Created attachment 89 [details] Log Snippet Dave Fox: I logged in this morning and found the machine idle. No nodes were down and everything looked fine. The last job that ran was 32K and ended in the 4th about 17:40 the next job in the queue was 32K, but slurm was not startinging it. I double checked the control system looked ok, then did an sfree on the block to see if that would help. Still slurm would not start a job. In the end, I did a /etc/init.d/slurm stop followed by start. Then slurm started scheduling again. Again, on July 5 by Adam Bertsch: Just before sending this message I was clearing up the last few nodes that had been replaced, and slurm was for some reason not scheduling the next job. We had a 48K job trying to run, and the log made it look like it was trying to run, but it never started. I restarted slurm, and the job was able to go. slurmctld.log.thur copied to ~da on seq
This is fixed! This is the same problem that has been seen before for quite some time. So the problem here is if mmcs was hanging onto a job and while that was happening some other force (Admin or other) removes the block that was running the job the num_unused_cpus was not updated correctly. This is fixed in this patch. https://github.com/SchedMD/slurm/commit/11e2759f5c390e6d8c0448d6916104c28b4c3344