13734 – Computes are showing as ALLOC but there are no jobs

Ticket 13734 - Computes are showing as ALLOC but there are no jobs

Summary: Computes are showing as ALLOC but there are no jobs

Status:	RESOLVED CANNOTREPRODUCE

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	20.11.7
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Ben Roberts
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2022-03-30 11:19 MDT by PDT Partners
Modified:	2022-05-19 10:39 MDT (History)
CC List:	1 user (show)

See Also:
Site:	PDT Partners
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Slurm controller log overlapping the time when this happened (88.28 MB, text/plain) 2022-03-30 11:19 MDT, PDT Partners	Details
Slurm configuration (5.71 KB, text/plain) 2022-03-30 11:20 MDT, PDT Partners	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description PDT Partners 2022-03-30 11:19:23 MDT

Created attachment 24148 [details]
Slurm controller log overlapping the time when this happened

We're seeing a number of computes in ALLOC state but there are no jobs running.

This issue is interfering with our logic to clean up unused computes as we set them to DRAIN and wait until they become IDLE before removing. 

Here's the computes:

> sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
on-demand    up 4-00:00:00      0    n/a
sa*          up 4-00:00:00     10   drng compute[12973,12979,12987,12993,12995,13000,13004,13026,13029,13081]
spot         up 4-00:00:00     10   drng compute[12973,12979,12987,12993,12995,13000,13004,13026,13029,13081]

> squeue -a
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
<< empty >>

Comment 1 PDT Partners 2022-03-30 11:20:51 MDT

Created attachment 24149 [details]
Slurm configuration

Comment 2 Jason Booth 2022-03-30 15:31:16 MDT

 Please run "scontrol show nodes". Please also attach the slurmd.log from one of the compute nodes in this state.

 We have a suspicion that the node is stuck in completing, and the "show nodes" output would help us determine. "sinfo" can be misleading with nodes in the completing state, so in this case this output is not the most useful.

 Are these nodes still powered up, and are the slurmds still active on that node?

Comment 3 Ben Roberts 2022-03-31 14:30:13 MDT

In addition to the things Jason asked for, I'm curious whether this script is working in some cases and not others.  Or was it working previously and then it stopped working after some sort of change?  

Thanks,
Ben

Comment 4 PDT Partners 2022-04-01 10:30:58 MDT

This error happened on a production cluster and we had to manually clean up, so I don't have any more logs for this. 

We've only seen this happen once so far.

Comment 5 Ben Roberts 2022-04-01 11:59:35 MDT

I'm glad to hear this isn't more of an issue for you, but that does make it harder to track down.  Looking at the logs I see that the script does drain the nodes and mark them as future a few times successfully.  Focusing on compute12973, I see that the most recent log entries for it show that it was only set to DRAINING.  One of the last jobs scheduled on the node was '8814326'.  It looks like that job, and others that were scheduled on that node, were killed:

[2022-03-30T01:58:00.973] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=8814326 uid 564800353

It looks like there are quite a few log entries like this for other jobs, so I assume the script (or another script) is killing jobs.  I am curious what sacct shows about this job.  Can I have you run and send the output:
sacct -j 8814326

Can you explain why there is a script that is killing jobs so often?  Was there something that was failing before that caused you to forcefully remove them?

Thanks,
Ben

Comment 6 Ben Roberts 2022-04-27 13:03:15 MDT

I wanted to follow up and see if you've had this happen again, and if so whether you've been able to capture any additional information about what was happening.  Let me know if you still need help with this ticket.

Thanks,
Ben

Comment 7 Ben Roberts 2022-05-19 10:39:40 MDT

It sounds like this hasn't been a recurring issue and we have limited information to go on right now.  Since it doesn't appear to have happened again in over a month I'll go ahead and close this ticket.  Feel free to update the ticket if it does happen again.

Thanks,
Ben