| Summary: | Job submission over overlapping partititons | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Paolo Oliveri <paul> |
| Component: | Scheduling | Assignee: | Jacob Jenson <jacob> |
| Status: | RESOLVED INVALID | QA Contact: | |
| Severity: | 6 - No support contract | ||
| Priority: | --- | ||
| Version: | - Unsupported Older Versions | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | -Other- | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | Slurm configuration for our cluster | ||
Created attachment 22727 [details] Slurm configuration for our cluster Hello, We have a Cluster with 12 nodes + 1 master node with CentOS 7.7 and Slurm 18.08.8. I know that I should upgrade to a new version that is supported, but we can't do it now and anyway before upgrading we would like to know if there are other reasons apart the Slurm version for our bug. We configured different partitions and QOS in the system, and some partition share nodes with the other. Every node has 144 threads, with 4 reserved for system and then 140 available for executing jobs. The preemption method is suspend and gangling jobs, together with QOS prioritization. The problem here is that we are experiencing node overloads: if a node is running 140 jobs from one partition, other partition can submit new jobs regardless Slurm should not accept other jobs to it. For example, the node hpc-01 is fully charged by jobs from "normal" partition. scontrol show node hpc-01 NodeName=hpc-01 Arch=x86_64 CoresPerSocket=18 CPUAlloc=140 CPUTot=144 CPULoad=117.54 AvailableFeatures=infiniband ActiveFeatures=infiniband Gres=(null) NodeAddr=hpc-01 NodeHostName=hpc-01 Version=18.08 OS=Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021 RealMemory=515633 AllocMem=430080 FreeMem=484762 Sockets=4 Boards=1 CoreSpecCount=2 CPUSpecList=106-107,142-143 MemSpecLimit=6144 State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=20 Owner=N/A MCS_label=N/A Partitions=interactive,normal,fast,big BootTime=2021-07-30T18:27:35 SlurmdStartTime=2021-07-30T18:29:19 CfgTRES=cpu=144,mem=515633M,billing=144 AllocTRES=cpu=140,mem=420G CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Try to submit a new job from "normal" partition to hpc-01 salloc -n 1 --partition normal --qos normal --nodelist hpc-01 --no-shell salloc: Pending job allocation 1469203 salloc: job 1469203 queued and waiting for resources Try to submit a new job from another partition, "big", that has nodes that overlaps with "normal" partition, such as hpc-01 salloc -n 1 --partition big --qos low --nodelist hpc-01 --no-shell salloc: Granted job allocation 1469204 salloc: Waiting for resource configuration salloc: Nodes hpc-01 are ready for job Do any of you know if there are some problems in Slurm configuration? I attached it for inspection. Sincerely, Paolo Oliveri