Ticket 10627

Summary: Single step without options not allowed to run in batch step
Product: Slurm Reporter: Marcin Stolarek <cinek>
Component: OtherAssignee: Nate Rini <nate>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: csc-slurm-tickets, gensyshpe, remi.lacroix
Version: 20.02.6   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=10389
Site: IDRIS Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 20.11.6, 21.08pre1
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: Slurmctld log debug5
Job output
Slurmd log
Slurmctld log with debugflags SelectType/TraceJobs/Steps
Slurmd log with debugflags SelectType/TraceJobs/Steps
debug patch
Slurmd log after mem patch
Slurmctld log after mem patch
test patch
Slurmctld log
Slurmd log
patch to add more logging
Slurmctld log with more logging
Slurmd log with more logging
Patches applied to Slurm v20.02.6
Slurmctld and slurmd logs
patch for 20.02.6 (IDRIS only)

Description Marcin Stolarek 2021-01-14 01:09:45 MST
Created attachment 17469 [details]
slurm.conf

Splitting this from Bug 10474 comment 12

When running:

    sbatch --ntasks=8 --cpus-per-task=10 --hint=nomultithread --qos=qos_cpu-dir --account=xyz --exclusive <<EOF
    > #!/bin/bash
    > srun hostname
    > EOF

Job output:

    srun: error: Unable to create step for job 506: More processors requested than permitted

The same sbatch command without exclusive produces the excepted result. It also works with exclusive when using srun directly:

    srun --ntasks=8 --cpus-per-task=10 --hint=nomultithread --qos=qos_cpu-dir --account=xyz --exclusive hostname

---
This may be related to bug 10389
Comment 1 Nate Rini 2021-01-14 11:29:26 MST
(In reply to IDRIS System Team from bug#10474 comment #12)
> Hi!
> 
> The patch seems to solve the current issue (job no longer blocked, no
> block_sync_core_bitmap error) but we experience now a problem when using
> exclusive. We don't know if it's related to the patch.
> 
> When running:
> 
>     sbatch --ntasks=8 --cpus-per-task=10 --hint=nomultithread
> --qos=qos_cpu-dir --account=xyz --exclusive <<EOF
>     > #!/bin/bash
>     > srun hostname
>     > EOF
> 
> Job output:
> 
>     srun: error: Unable to create step for job 506: More processors
> requested than permitted
> 
> The same sbatch command without exclusive produces the excepted result. It
> also works with exclusive when using srun directly:
> 
>     srun --ntasks=8 --cpus-per-task=10 --hint=nomultithread
> --qos=qos_cpu-dir --account=xyz --exclusive hostname

Is it possible to get logs of this job (or a repeat of it) from slurmctld?
Comment 2 IDRIS System Team 2021-01-19 02:42:27 MST
Created attachment 17530 [details]
Slurmctld log debug5
Comment 3 Nate Rini 2021-01-19 08:53:27 MST
(In reply to Nate Rini from comment #1)
> (In reply to IDRIS System Team from bug#10474 comment #12)
> >     sbatch --ntasks=8 --cpus-per-task=10 --hint=nomultithread
> > --qos=qos_cpu-dir --account=xyz --exclusive <<EOF
> >     > #!/bin/bash
> >     > srun hostname
> >     > EOF

Please run this test job instead:
>     sbatch --ntasks=8 --cpus-per-task=10 --hint=nomultithread
> --qos=qos_cpu-dir --account=xyz --exclusive <<EOF
>     > #!/bin/bash
>     > env |grep SLURM
>     > srun -vvvvv --slurmd-debug=debug3 hostname
>     > EOF
Comment 4 IDRIS System Team 2021-01-19 10:11:51 MST
Created attachment 17535 [details]
Job output
Comment 5 Nate Rini 2021-01-19 10:44:32 MST
(In reply to Nate Rini from comment #3)
Please try this test job:
>     sbatch -N2 --ntasks=8 --cpus-per-task=10 --hint=nomultithread --ntasks-per-node=4
> --qos=qos_cpu-dir --account=xyz --exclusive <<EOF
>     > #!/bin/bash
>     > env |grep SLURM
>     > srun -vvvvv --slurmd-debug=debug3 hostname
>     > EOF

Please also attach the slurmd log from the head node of the job.
Comment 6 IDRIS System Team 2021-01-20 01:51:50 MST
Created attachment 17550 [details]
Slurmd log
Comment 8 Nate Rini 2021-01-20 08:41:07 MST
(In reply to IDRIS System Team from comment #6)
> Created attachment 17550 [details]
> Slurmd log

The issue has been reproduced locally. Working on analysis and possible patch.
Comment 9 Nate Rini 2021-01-20 09:59:08 MST
(In reply to Nate Rini from comment #8)
> (In reply to IDRIS System Team from comment #6)
> > Created attachment 17550 [details]
> > Slurmd log
> 
> The issue has been reproduced locally. Working on analysis and possible
> patch.

Please ignore that response, found another bug that causes the same error (opened bug#10669) but is probably not this issue.

When was the last time that slurmctld was restarted?

Is it possible to get a copy of the job submit script? Does it modify any of the job parameters?
> JobSubmitPlugins=lua
Comment 10 IDRIS System Team 2021-01-21 03:12:16 MST
slurmctld was restarted few days ago (2021-01-18T14:54:32)

The job submit script can modify some parameters (partition and account) but we don't think they are relevant here (conditions are not met).

(In reply to Nate Rini from comment #9)
> (In reply to Nate Rini from comment #8)
> > (In reply to IDRIS System Team from comment #6)
> > > Created attachment 17550 [details]
> > > Slurmd log
> > 
> > The issue has been reproduced locally. Working on analysis and possible
> > patch.
> 
> Please ignore that response, found another bug that causes the same error
> (opened bug#10669) but is probably not this issue.
> 
> When was the last time that slurmctld was restarted?
> 
> Is it possible to get a copy of the job submit script? Does it modify any of
> the job parameters?
> > JobSubmitPlugins=lua
Comment 11 Nate Rini 2021-01-21 11:25:48 MST
(In reply to IDRIS System Team from comment #10)
> slurmctld was restarted few days ago (2021-01-18T14:54:32)

Please activate these debugflags in slurmctld:
> scontrol setdebugflags +SelectType
> scontrol setdebugflags +TraceJobs
> scontrol setdebugflags +Steps

Please resubmit the job again and then attach the slurmd and slurmctld logs. Please deactivate the flags after.
> scontrol setdebugflags -SelectType
> scontrol setdebugflags -TraceJobs
> scontrol setdebugflags -Steps
Comment 12 IDRIS System Team 2021-01-22 03:39:54 MST
Created attachment 17578 [details]
Slurmctld log with debugflags SelectType/TraceJobs/Steps
Comment 13 IDRIS System Team 2021-01-22 03:40:21 MST
Created attachment 17579 [details]
Slurmd log with debugflags SelectType/TraceJobs/Steps
Comment 21 Nate Rini 2021-01-25 17:49:44 MST
Created attachment 17616 [details]
debug patch
Comment 22 Nate Rini 2021-01-25 17:50:27 MST
(In reply to Nate Rini from comment #21)
> Created attachment 17616 [details]
> debug patch

Is it possible to apply this patch to slurmctld and then rerun the test from comment #11?
Comment 23 IDRIS System Team 2021-01-28 03:52:06 MST
Hi!

After applying the patch and running the test, the following line appeared in the slurmctld log:

   bug10627: node_tmp=NULL nodes_needed:2 step_spec->max_nodes:2 pick_node_cnt:2 mem_blocked_cpus:0

(In reply to Nate Rini from comment #22)
> (In reply to Nate Rini from comment #21)
> > Created attachment 17616 [details]
> > debug patch
> 
> Is it possible to apply this patch to slurmctld and then rerun the test from
> comment #11?
Comment 24 Nate Rini 2021-02-08 09:54:49 MST
(In reply to IDRIS System Team from comment #23)
>    bug10627: node_tmp=NULL nodes_needed:2 step_spec->max_nodes:2
> pick_node_cnt:2 mem_blocked_cpus:0

That helped determine where the failure was happening. I'm still working on replicating the issue locally.
Comment 26 Nate Rini 2021-02-08 12:52:10 MST
Please try this patch:
> https://github.com/SchedMD/slurm/commit/e8a0930931427a2209a8b27296a8c6ce82f77683

I believe it may solve why the requested jobs have half the memory than my test jobs. 

If not, please provide the same logs as comment #13 with the patch.
Comment 27 IDRIS System Team 2021-02-11 10:12:34 MST
Created attachment 17895 [details]
Slurmd log after mem patch

The patch does not seem to change anything.
Comment 28 Nate Rini 2021-02-11 11:44:35 MST
(In reply to IDRIS System Team from comment #27)
> Created attachment 17895 [details]
> Slurmd log after mem patch
> 
> The patch does not seem to change anything.

Please also attach your slurmctld log with the SelectType and TraceJobs debugflags active.
Comment 29 IDRIS System Team 2021-02-12 00:56:28 MST
Created attachment 17912 [details]
Slurmctld log after mem patch
Comment 33 Nate Rini 2021-02-12 14:13:22 MST
Created attachment 17934 [details]
test patch

(In reply to IDRIS System Team from comment #27)
> Created attachment 17895 [details]
> Slurmd log after mem patch
> 
> The patch does not seem to change anything.

Please give this patch a try. Please provide the same logs as comment #29 if it does not work.
Comment 34 IDRIS System Team 2021-02-17 01:37:00 MST
Created attachment 17957 [details]
Slurmctld log
Comment 35 IDRIS System Team 2021-02-17 01:37:18 MST
Created attachment 17958 [details]
Slurmd log
Comment 36 Nate Rini 2021-02-17 09:25:01 MST
(In reply to IDRIS System Team from comment #34)
> Created attachment 17957 [details]
> Slurmctld log

I'm going to prepare another patch to add more debug loggin, looks like the job is getting allocated 0 memory.
Comment 37 Nate Rini 2021-02-17 09:56:27 MST
Created attachment 17964 [details]
patch to add more logging

(In reply to Nate Rini from comment #36)
> (In reply to IDRIS System Team from comment #34)
> > Created attachment 17957 [details]
> > Slurmctld log
> 
> I'm going to prepare another patch to add more debug loggin, looks like the
> job is getting allocated 0 memory.

Please apply this patch to slurmctld and run the test job. Please then revert it as it is a verbose patch.
Comment 38 IDRIS System Team 2021-02-18 07:36:44 MST
Created attachment 17981 [details]
Slurmctld log with more logging
Comment 39 IDRIS System Team 2021-02-18 07:37:06 MST
Created attachment 17982 [details]
Slurmd log with more logging
Comment 40 IDRIS System Team 2021-02-18 07:49:02 MST
Created attachment 17983 [details]
Patches applied to Slurm v20.02.6

When applying the logging patch, we noticed that we don't have the same source code. We currently use 3 patches to fix bugs #10474, #9670 and #9724. Unfortunately this information was lost when the current issue was split from #10474. Just in case, here is our diff from slurm-20.02.6.tar.bz2.
Comment 42 Nate Rini 2021-02-19 15:23:31 MST
We are still working on analysis based on the logs provided.
Comment 47 IDRIS System Team 2021-02-23 09:59:08 MST
Created attachment 18065 [details]
Slurmctld and slurmd logs
Comment 55 Nate Rini 2021-02-23 16:14:08 MST
Created attachment 18077 [details]
patch for 20.02.6 (IDRIS only)

After a good bit of time of starting at the logs and the current code base, 20.02-6 is significantly different, and attempting to backport the other changes will likely cause more bugs than help.

This patch includes all the fixes as given in comment #40 but does not have any of the patches from this bug.

Please apply this patchset to a clean version of 20.02.6 and test it. We are generally not patching 20.02 anymore.
Comment 56 Nate Rini 2021-03-10 11:58:59 MST
(In reply to Nate Rini from comment #55)
> Please apply this patchset to a clean version of 20.02.6 and test it. We are
> generally not patching 20.02 anymore.

Any updates? Is it possible to test this patch?
Comment 60 IDRIS System Team 2021-03-19 07:13:38 MDT
Hi!

The patch seems to fix all the issues we reported so far.

Now we have a new strange behavior: when we request 4 tasks with 10 physical cores and 4 GPU per nodes (i.e. an entire node), Slurm allocates 2 nodes with 2 tasks and 4 GPU each. The job is charged for 8 GPU which is twice as much as it should be. Of course this doesn't happen when we specify "--node=1" or "--task-per-node=4". Also Slurm allocates only 1 node if we don't request GPU.

Could you tell us if this is an expected behavior or a bug?

$ srun -A abc -n 4 -c 10 --hint=nomultithread --gres=gpu:4 ~/binding_mpi.exe   
srun: job 797 queued and waiting for resources
srun: job 797 has been allocated resources
Hello from level 1: rank= 1, thread level 1= -1, on r7i3n6. (core affinity = 10-19)
Hello from level 1: rank= 0, thread level 1= -1, on r7i3n6. (core affinity = 0-9)
Hello from level 1: rank= 2, thread level 1= -1, on r7i3n7. (core affinity = 0-9)
Hello from level 1: rank= 3, thread level 1= -1, on r7i3n7. (core affinity = 10-19)

$ scontrol show job 797
JobId=797 JobName=binding_mpi.exe
  UserId=user01(1000) GroupId=group01(1000) MCS_label=N/A
  Priority=156250 Nice=0 Account=abc QOS=qos_gpu-t3
  JobState=COMPLETED Reason=None Dependency=(null)
  Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
  RunTime=00:00:01 TimeLimit=00:10:00 TimeMin=N/A
  SubmitTime=2021-03-04T10:55:59 EligibleTime=2021-03-04T10:55:59
  AccrueTime=2021-03-04T10:55:59
  StartTime=2021-03-04T10:55:59 EndTime=2021-03-04T10:56:00 Deadline=N/A
  SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-03-04T10:55:59
  Partition=gpu_p13 AllocNode:Sid=jean-zay2-ib0:68449
  ReqNodeList=(null) ExcNodeList=(null)
  NodeList=r7i3n[6-7]
  BatchHost=r7i3n6
  NumNodes=2 NumCPUs=80 NumTasks=4 CPUs/Task=10 ReqB:S:C:T=0:0:*:1
  TRES=cpu=80,mem=160G,energy=405,node=2,billing=80,gres/gpu=8
  Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
  MinCPUsNode=10 MinMemoryCPU=2G MinTmpDiskNode=0
  Features=(null) DelayBoot=00:00:00
  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
  Command=/linkhome/user01/binding_mpi.exe
  WorkDir=/linkhome/group/user01
  Power=
  TresPerNode=gpu:4
  MailUser=user01 MailType=NONE
Comment 61 Nate Rini 2021-03-19 10:19:56 MDT
(In reply to IDRIS System Team from comment #60)
> The patch seems to fix all the issues we reported so far.
Great. We will QA it for upstream inclusion.
 
> Now we have a new strange behavior: when we request 4 tasks with 10 physical
> cores and 4 GPU per nodes (i.e. an entire node), Slurm allocates 2 nodes
> with 2 tasks and 4 GPU each. The job is charged for 8 GPU which is twice as
> much as it should be. Of course this doesn't happen when we specify
> "--node=1" or "--task-per-node=4". Also Slurm allocates only 1 node if we
> don't request GPU.
Please open a new bug for this to avoid confusing the issues. It doesn't look related to this bug.
Comment 68 Nate Rini 2021-03-25 16:32:01 MDT
A modified patch is upstream for slurm-20.11.6:
> https://github.com/SchedMD/slurm/commit/8cd6af0c8ff73f1543c53ba4dbceec137ab8ca33

Closing ticket, please reply if any more issues are found.

Thanks,
--Nate