| Summary: | Computes are showing as ALLOC but there are no jobs | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | PDT Partners <customer-pdt> |
| Component: | slurmctld | Assignee: | Ben Roberts <ben> |
| Status: | RESOLVED CANNOTREPRODUCE | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | nick |
| Version: | 20.11.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | PDT Partners | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
Slurm controller log overlapping the time when this happened
Slurm configuration |
||
|
Description
PDT Partners
2022-03-30 11:19:23 MDT
Created attachment 24149 [details]
Slurm configuration
Please run "scontrol show nodes". Please also attach the slurmd.log from one of the compute nodes in this state. We have a suspicion that the node is stuck in completing, and the "show nodes" output would help us determine. "sinfo" can be misleading with nodes in the completing state, so in this case this output is not the most useful. Are these nodes still powered up, and are the slurmds still active on that node? In addition to the things Jason asked for, I'm curious whether this script is working in some cases and not others. Or was it working previously and then it stopped working after some sort of change? Thanks, Ben This error happened on a production cluster and we had to manually clean up, so I don't have any more logs for this. We've only seen this happen once so far. I'm glad to hear this isn't more of an issue for you, but that does make it harder to track down. Looking at the logs I see that the script does drain the nodes and mark them as future a few times successfully. Focusing on compute12973, I see that the most recent log entries for it show that it was only set to DRAINING. One of the last jobs scheduled on the node was '8814326'. It looks like that job, and others that were scheduled on that node, were killed: [2022-03-30T01:58:00.973] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=8814326 uid 564800353 It looks like there are quite a few log entries like this for other jobs, so I assume the script (or another script) is killing jobs. I am curious what sacct shows about this job. Can I have you run and send the output: sacct -j 8814326 Can you explain why there is a script that is killing jobs so often? Was there something that was failing before that caused you to forcefully remove them? Thanks, Ben I wanted to follow up and see if you've had this happen again, and if so whether you've been able to capture any additional information about what was happening. Let me know if you still need help with this ticket. Thanks, Ben It sounds like this hasn't been a recurring issue and we have limited information to go on right now. Since it doesn't appear to have happened again in over a month I'll go ahead and close this ticket. Feel free to update the ticket if it does happen again. Thanks, Ben |