| Summary: | Sbatch rejects job while srun/salloc allow it | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Steve Ford <fordste5> |
| Component: | Scheduling | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED FIXED | QA Contact: | Tim Wickberg <tim> |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | tim |
| Version: | 17.11.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | MSU | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 18.08.6 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
Slurm Config File
Slurmctld log Job submit script Slurmctld log Slurmctld log |
||
Hi There are slight differences in checking limits depends on use methods to submitting job, especially for multipart job. But I need more info to understand this better. What partitions have you submitted for these jobs? After knowing where exactly is the problem, we will internally discuss what to do with this. Dominik The jobs partition list is general-short-14,general-short-16,general-short-18,general-long-14,general-long-16,general-long-18,classres-14,classres-16 I suspect the partition causing the rejection in sbatch is general-long-18 Hi I still can't recreate exactly the same behaviour. I tried with: other jobs, partition in down state, reservation, drain nodes. Could you send me slurmctld log maybe I will find there some hints how to recreate this? I recommend the update to current slurm 17.11 version which contains https://github.com/SchedMD/slurm/commit/fef07a409724 This commit doesn't solve this issue but prevents slurmctld segfault and other unsuspect issues. Dominik Created attachment 7668 [details]
Slurmctld log
Created attachment 7669 [details]
Job submit script
Created attachment 7670 [details]
Slurmctld log
Created attachment 7671 [details]
Slurmctld log
Dominik, I attached the slurmctld logs from a time I saw this error. Job 24862 was an srun reqesting 3 nodes that ran successfully. Job 24862 is an salloc requesting 3 nodes that ran successfully. Immediately after those two jobs finished I attempted to submit an sbatch that requested 3 nodes and it was rejected. I included our job_submit script as well since we use it to modify jobs' partition lists. Hi It took some time but finally I localized one possible problem. Are you sure you have access to "classres" account? Dominik Dominik, I'm sorry for not getting back to you sooner. We are using EnforcePartLimits=ANY so we're not seeing this issue anymore. I'm going to leave this issue be and come back to it if we revert back to EnforcePartLimits=ALL. Thanks, Steve Hi Ok good to know. If this is not a problem I would like to leave this bug open. We still need some fix/enhancement in this code area. Dominik Hi This has been fixed in 19.05 in this commit, It is in 18.08.6 and above https://github.com/SchedMD/slurm/commit/233eca355cf1e4d Closing as resolved/fixed Dominik |
Created attachment 7482 [details] Slurm Config File The following job gets rejected by sbatch with the error 'Requested partition configuration not available now': #!/bin/bash #SBATCH --job-name=aug1test #SBATCH --nodes=2 #SBATCH --ntasks-per-node=2 #SBATCH --time=30 #SBATCH --output=%x-%j.SLURMout echo -e 'Aud 1 test.\n' However, if the same resources are requested using salloc/srun, the allocation is granted. $ srun --nodes=2 --ntasks-per-node=2 --time=30 hostname srun: Required node not available (down, drained or reserved) srun: job 8885 queued and waiting for resources srun: job 8885 has been allocated resources $ salloc --nodes=2 --ntasks-per-node=2 --time=30 salloc: Required node not available (down, drained or reserved) salloc: Pending job allocation 8886 salloc: job 8886 queued and waiting for resources salloc: job 8886 has been allocated resources salloc: Granted job allocation 8886 salloc: Waiting for resource configuration It seems like EnforcePartLimits=ALL is not being enforced the same for srun and salloc. I'm not sure if this is expected or not. Our job submit plugin gives these jobs a partition list that includes a partition with only one node. With that partition in the list and EnforcePartLimits set to ALL, I would expect the allocation to be rejected by sbatch/salloc/srun.