Ticket 13734

Summary: Computes are showing as ALLOC but there are no jobs
Product: Slurm Reporter: PDT Partners <customer-pdt>
Component: slurmctldAssignee: Ben Roberts <ben>
Status: RESOLVED CANNOTREPRODUCE QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: nick
Version: 20.11.7   
Hardware: Linux   
OS: Linux   
Site: PDT Partners Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Slurm controller log overlapping the time when this happened
Slurm configuration

Description PDT Partners 2022-03-30 11:19:23 MDT
Created attachment 24148 [details]
Slurm controller log overlapping the time when this happened

We're seeing a number of computes in ALLOC state but there are no jobs running.

This issue is interfering with our logic to clean up unused computes as we set them to DRAIN and wait until they become IDLE before removing. 

Here's the computes:

> sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
on-demand    up 4-00:00:00      0    n/a
sa*          up 4-00:00:00     10   drng compute[12973,12979,12987,12993,12995,13000,13004,13026,13029,13081]
spot         up 4-00:00:00     10   drng compute[12973,12979,12987,12993,12995,13000,13004,13026,13029,13081]

> squeue -a
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
<< empty >>
Comment 1 PDT Partners 2022-03-30 11:20:51 MDT
Created attachment 24149 [details]
Slurm configuration
Comment 2 Jason Booth 2022-03-30 15:31:16 MDT
 Please run "scontrol show nodes". Please also attach the slurmd.log from one of the compute nodes in this state.

 We have a suspicion that the node is stuck in completing, and the "show nodes" output would help us determine. "sinfo" can be misleading with nodes in the completing state, so in this case this output is not the most useful.

 Are these nodes still powered up, and are the slurmds still active on that node?
Comment 3 Ben Roberts 2022-03-31 14:30:13 MDT
In addition to the things Jason asked for, I'm curious whether this script is working in some cases and not others.  Or was it working previously and then it stopped working after some sort of change?  

Thanks,
Ben
Comment 4 PDT Partners 2022-04-01 10:30:58 MDT
This error happened on a production cluster and we had to manually clean up, so I don't have any more logs for this. 

We've only seen this happen once so far.
Comment 5 Ben Roberts 2022-04-01 11:59:35 MDT
I'm glad to hear this isn't more of an issue for you, but that does make it harder to track down.  Looking at the logs I see that the script does drain the nodes and mark them as future a few times successfully.  Focusing on compute12973, I see that the most recent log entries for it show that it was only set to DRAINING.  One of the last jobs scheduled on the node was '8814326'.  It looks like that job, and others that were scheduled on that node, were killed:

[2022-03-30T01:58:00.973] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=8814326 uid 564800353

It looks like there are quite a few log entries like this for other jobs, so I assume the script (or another script) is killing jobs.  I am curious what sacct shows about this job.  Can I have you run and send the output:
sacct -j 8814326

Can you explain why there is a script that is killing jobs so often?  Was there something that was failing before that caused you to forcefully remove them?

Thanks,
Ben
Comment 6 Ben Roberts 2022-04-27 13:03:15 MDT
I wanted to follow up and see if you've had this happen again, and if so whether you've been able to capture any additional information about what was happening.  Let me know if you still need help with this ticket.

Thanks,
Ben
Comment 7 Ben Roberts 2022-05-19 10:39:40 MDT
It sounds like this hasn't been a recurring issue and we have limited information to go on right now.  Since it doesn't appear to have happened again in over a month I'll go ahead and close this ticket.  Feel free to update the ticket if it does happen again.

Thanks,
Ben