| Summary: | cannot run parallel job steps - srun ignores -c when submitting a job step | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Enrico Tagliavini <enrico.tagliavini> |
| Component: | Regression | Assignee: | Jacob Jenson <jacob> |
| Status: | RESOLVED INVALID | QA Contact: | |
| Severity: | 6 - No support contract | ||
| Priority: | --- | ||
| Version: | 20.11.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | -Other- | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | CentOS | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | slurm.conf | ||
|
Description
Enrico Tagliavini
2021-07-05 06:34:57 MDT
Actually it looks like also the -c 1 option seems to be ignored and a single step is billed for all 20. updated to 20.11.8 , problem persist Adding --overlap to the srun command seems to help, but it should not be required.
For the steps the ReqMem field from sacct still says 22Gn, but the billing confirms the memory is 1G as specified by srun. However the billing for the cpus is wrong, still set to 20, despite the -n 1 -c 1
Billing of salloc: billing=20,cpu=20,mem=22G,node=1
billing of step .0: cpu=20,mem=1G,node=1 billing of step .1: cpu=20,mem=1G,node=1
Created attachment 20245 [details]
slurm.conf
attached slurm.conf . We don't use a cli filter plugin, so this cannot be the source of the issue.
I added DebugFlags=Steps to slurm.conf and found the following [2021-07-06T11:58:41.225] STEPS: _pick_step_nodes: JobId=31564 Currently running steps use 0 of allocated 20 CPUs on node compute01 [2021-07-06T11:58:41.225] _pick_step_nodes: step pick 1-1 nodes, avail:compute01 idle:compute01 picked:NONE [2021-07-06T11:58:41.225] STEPS: _pick_step_nodes: step picked 0 of 1 nodes [2021-07-06T11:58:41.225] STEPS: Picked nodes compute01 when accumulating from compute01 [2021-07-06T11:58:41.225] STEPS: step alloc on job node 0 (compute01) used 20 of 20 CPUs [2021-07-06T11:58:41.231] STEPS: _slurm_rpc_job_step_create: JobId=31564 StepId=2 compute01 usec=6982 Which confirms all 20 CPUs are allocated to the step, despite -c and -n being specified and equal to 1. Adding --exact to the srun commands seems to make it work as intended [2021-07-06T12:35:26.004] STEPS: _pick_step_nodes: JobId=31564 Currently running steps use 0 of allocated 20 CPUs on node compute01 [2021-07-06T12:35:26.004] _pick_step_nodes: step pick 1-1 nodes, avail:compute01 idle:compute01 picked:NONE [2021-07-06T12:35:26.004] STEPS: _pick_step_nodes: step picked 0 of 1 nodes [2021-07-06T12:35:26.004] STEPS: Picked nodes compute01 when accumulating from compute01 [2021-07-06T12:35:26.004] STEPS: step alloc on job node 0 (compute01) used 1 of 20 CPUs [2021-07-06T12:35:26.010] STEPS: _slurm_rpc_job_step_create: JobId=31564 StepId=6 compute01 usec=6119 However may I ask why -c and -n do not imply --exact? This is very confusing because whatever the user is asking for is ignored and replace with the whole allocation. When srun is called with -c and -n the calling user might be asking for the --exact behavior. |