Created attachment 24148 [details] Slurm controller log overlapping the time when this happened We're seeing a number of computes in ALLOC state but there are no jobs running. This issue is interfering with our logic to clean up unused computes as we set them to DRAIN and wait until they become IDLE before removing. Here's the computes: > sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST on-demand up 4-00:00:00 0 n/a sa* up 4-00:00:00 10 drng compute[12973,12979,12987,12993,12995,13000,13004,13026,13029,13081] spot up 4-00:00:00 10 drng compute[12973,12979,12987,12993,12995,13000,13004,13026,13029,13081] > squeue -a JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) << empty >>
Created attachment 24149 [details] Slurm configuration
Please run "scontrol show nodes". Please also attach the slurmd.log from one of the compute nodes in this state. We have a suspicion that the node is stuck in completing, and the "show nodes" output would help us determine. "sinfo" can be misleading with nodes in the completing state, so in this case this output is not the most useful. Are these nodes still powered up, and are the slurmds still active on that node?
In addition to the things Jason asked for, I'm curious whether this script is working in some cases and not others. Or was it working previously and then it stopped working after some sort of change? Thanks, Ben
This error happened on a production cluster and we had to manually clean up, so I don't have any more logs for this. We've only seen this happen once so far.
I'm glad to hear this isn't more of an issue for you, but that does make it harder to track down. Looking at the logs I see that the script does drain the nodes and mark them as future a few times successfully. Focusing on compute12973, I see that the most recent log entries for it show that it was only set to DRAINING. One of the last jobs scheduled on the node was '8814326'. It looks like that job, and others that were scheduled on that node, were killed: [2022-03-30T01:58:00.973] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=8814326 uid 564800353 It looks like there are quite a few log entries like this for other jobs, so I assume the script (or another script) is killing jobs. I am curious what sacct shows about this job. Can I have you run and send the output: sacct -j 8814326 Can you explain why there is a script that is killing jobs so often? Was there something that was failing before that caused you to forcefully remove them? Thanks, Ben
I wanted to follow up and see if you've had this happen again, and if so whether you've been able to capture any additional information about what was happening. Let me know if you still need help with this ticket. Thanks, Ben
It sounds like this hasn't been a recurring issue and we have limited information to go on right now. Since it doesn't appear to have happened again in over a month I'll go ahead and close this ticket. Feel free to update the ticket if it does happen again. Thanks, Ben