Ticket 8251

Summary: Inconsistent behaviour between sbatch and srun regarding options --ntasks and --ntasks-per-node
Product: Slurm Reporter: Chrysovalantis Paschoulas <c.paschoulas>
Component: User CommandsAssignee: Nate Rini <nate>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: cpaschoulas
Version: 19.05.4   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=8328
https://bugs.schedmd.com/show_bug.cgi?id=3079
https://bugs.schedmd.com/show_bug.cgi?id=8733
Site: Jülich Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: Debugging job 141748
Debugging job 141749

Description Chrysovalantis Paschoulas 2019-12-18 09:24:41 MST
We found out that there is an inconsistency between sbatch and srun regarding the submission options --ntasks and --ntasks-per-node.

For sbatch --ntasks-per-node wins over --ntasks.
For srun --ntasks wins over --ntasks-per-node.

This issue appeared I guess because of the fix for an old issue where env var SLURM_NTASKS_PER_NODE was not exported in jobscript when option --ntasks-per-node was given to sbatch.

Anyway it doesn't matter which option should win (I would vote for --ntasks-per-node) but the inconsistency should go away, right???
Comment 2 Nate Rini 2019-12-18 10:33:53 MST
(In reply to Chrysovalantis Paschoulas from comment #0)
> For sbatch --ntasks-per-node wins over --ntasks.
> For srun --ntasks wins over --ntasks-per-node.
Can you please provide an example of this? Do you mean with respect to using environment variables?
Comment 3 Chrysovalantis Paschoulas 2019-12-19 01:41:29 MST
(In reply to Nate Rini from comment #2)
> (In reply to Chrysovalantis Paschoulas from comment #0)
> > For sbatch --ntasks-per-node wins over --ntasks.
For example:
```
$ srun -N 1 -n2 --ntasks-per-node=4 -A root -p batch hostname
j3c011
j3c011

```

> > For srun --ntasks wins over --ntasks-per-node.
For example:
```
$ sbatch -N 1 -n2 --ntasks-per-node=4 -A root -p batch --wrap="srun hostname"
Submitted batch job 121583

$ cat slurm-121583.out
j3c011
j3c011
j3c011
j3c011
```
> Can you please provide an example of this? Do you mean with respect to using
> environment variables?

No this has nothing to do with env vars. I just mention that this issue/regression was introduced after a fix for SLURM_NTASKS_PER_NODE env var.
Comment 5 Nate Rini 2019-12-23 09:23:59 MST
(In reply to Chrysovalantis Paschoulas from comment #3)
> (In reply to Nate Rini from comment #2)
> > (In reply to Chrysovalantis Paschoulas from comment #0)
> $ srun -N 1 -n2 --ntasks-per-node=4 -A root -p batch hostname
> $ sbatch -N 1 -n2 --ntasks-per-node=4 -A root -p batch --wrap="srun hostname"

Can you please call these 2 commands instead and provide the output?
> $ srun -vvv -N 1 -n2 --ntasks-per-node=4 -A root -p batch hostname
> $ sbatch -vvv -N 1 -n2 --ntasks-per-node=4 -A root -p batch --wrap="srun hostname"

I called both commands locally and they got a different answer than in comment #3. Please also provide a copy of your slurm.conf.

> > Can you please provide an example of this? Do you mean with respect to using
> > environment variables?
> 
> No this has nothing to do with env vars.
Thanks for the clarification.
Comment 6 Chrysovalantis Paschoulas 2020-01-10 02:30:41 MST
(In reply to Nate Rini from comment #5)
> (In reply to Chrysovalantis Paschoulas from comment #3)
> > (In reply to Nate Rini from comment #2)
> > > (In reply to Chrysovalantis Paschoulas from comment #0)
> > $ srun -N 1 -n2 --ntasks-per-node=4 -A root -p batch hostname
> > $ sbatch -N 1 -n2 --ntasks-per-node=4 -A root -p batch --wrap="srun hostname"
> 
> Can you please call these 2 commands instead and provide the output?
> > $ srun -vvv -N 1 -n2 --ntasks-per-node=4 -A root -p batch hostname
For srun -vvv -N 1 -n2 --ntasks-per-node=4 -A root -p batch hostname:
```
srun: defined options
srun: -------------------- --------------------
srun: account             : root
srun: licenses            : project@just,scratch@just,home@just
srun: nodes               : 1
srun: ntasks              : 2
srun: ntasks-per-node     : -2
srun: partition           : batch
srun: verbose             : 3
srun: -------------------- --------------------
srun: end of defined options
srun: debug:  propagating RLIMIT_CORE=0
srun: debug:  propagating SLURM_PRIO_PROCESS=0
srun: debug:  propagating UMASK=0022
srun: debug2: srun PMI messages to port=46187
srun: debug:  Entering slurm_allocation_msg_thr_create()
srun: debug:  port from net_stream_listen is 39812
srun: debug:  Entering _msg_thr_internal
srun: debug:  Munge authentication plugin loaded
srun: Waiting for nodes to boot (delay looping 450 times @ 0.100000 secs x index)
srun: debug:  Waited 0.100000 sec and still waiting: next sleep for 0.200000 sec
srun: debug:  Waited 0.300000 sec and still waiting: next sleep for 0.300000 sec
srun: debug:  Waited 0.600000 sec and still waiting: next sleep for 0.400000 sec
srun: debug:  Waited 1.000000 sec and still waiting: next sleep for 0.500000 sec
srun: debug:  Waited 1.500000 sec and still waiting: next sleep for 0.600000 sec
srun: debug:  Waited 2.100000 sec and still waiting: next sleep for 0.700000 sec
srun: debug:  Waited 2.800000 sec and still waiting: next sleep for 0.800000 sec
srun: debug:  Waited 3.600000 sec and still waiting: next sleep for 0.900000 sec
srun: debug:  Waited 4.500000 sec and still waiting: next sleep for 1.000000 sec
srun: Nodes j3c011 are ready for job
srun: jobid 121858: nodes(1):`j3c011', cpu counts: 32(x1)
srun: debug2: creating job with 2 tasks
srun: debug:  requesting job 121858, user 9226, nodes 1 including ((null))
srun: debug:  cpus 2, tasks 2, name hostname, relative 65534
srun: debug2: spank: noturbospank.so: local_user_init = 0
srun: debug2: spank: x11spank.so: local_user_init = 0
srun: CpuBindType=(null type)
srun: debug:  Entering slurm_step_launch
srun: debug:  mpi type = (null)
srun: debug:  Using mpi/pspmi
srun: debug:  Entering _msg_thr_create()
srun: debug:  initialized stdio listening socket, port 33446
srun: debug:  Started IO server thread (47915426354944)
srun: debug:  Entering _launch_tasks
srun: launching 121858.0 on host j3c011, 2 tasks: [0-1]
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: route default plugin loaded
srun: debug2: Tree head got back 0 looking for 1
srun: debug2: Activity on IO listening socket 15
srun: debug2: Entering io_init_msg_read_from_fd
srun: debug2: Leaving  io_init_msg_read_from_fd
srun: debug2: Entering io_init_msg_validate
srun: debug2: Leaving  io_init_msg_validate
srun: debug2: Validated IO connection from 192.168.12.31, node rank 0, sd=17
srun: debug2: Tree head got back 1
srun: debug:  launch returned msg_rc=0 err=0 type=8001
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: eio_message_socket_accept: got message connection from 192.168.12.31:40758 16
srun: debug2: received task launch
srun: Node j3c011, 2 tasks started
srun: debug2: slurm_send_timeout: Socket no longer there
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
j3c011
srun: debug2: Leaving  _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
j3c011
srun: debug2: Leaving  _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: eio_message_socket_accept: got message connection from 192.168.12.31:40760 16
srun: debug2: received task exit
srun: Received task exit notification for 2 tasks of step 121858.0 (status=0x0000).
srun: j3c011: tasks 0-1: Completed
srun: debug:  task 0 done
srun: debug:  task 1 done
srun: debug2:   false, shutdown
srun: debug2:   false, shutdown
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2:   false, shutdown
srun: debug:  IO thread exiting
srun: debug2: slurm_allocation_msg_thr_destroy: clearing up message thread
srun: debug2:   false, shutdown
srun: debug:  Leaving _msg_thr_internal

```
> > $ sbatch -vvv -N 1 -n2 --ntasks-per-node=4 -A root -p batch --wrap="srun hostname"
For sbatch -vvv -N 1 -n2 --ntasks-per-node=4 -A root -p batch --wrap="srun hostname":
```
STDERR:
sbatch: defined options
sbatch: -------------------- --------------------
sbatch: account             : root
sbatch: licenses            : project@just,scratch@just,home@just
sbatch: nodes               : 1
sbatch: ntasks              : 2
sbatch: ntasks-per-node     : 4
sbatch: partition           : batch
sbatch: verbose             : 3
sbatch: wrap                : srun hostname
sbatch: -------------------- --------------------
sbatch: end of defined options
sbatch: debug2: spank: psgw_spank.so: init_post_opt = 0
sbatch: debug2: spank: noturbospank.so: init_post_opt = 0
sbatch: debug2: spank: perfparanoidspank.so: init_post_opt = 0
sbatch: debug2: spank: x11spank.so: init_post_opt = 0
sbatch: debug:  propagating RLIMIT_CORE=0
sbatch: debug:  propagating SLURM_PRIO_PROCESS=0
sbatch: debug:  propagating UMASK=0022
sbatch: debug:  Munge authentication plugin loaded
sbatch: Cray/Aries node selection plugin loaded
Submitted batch job 121859
sbatch: debug2: spank: noturbospank.so: exit = 0
sbatch: debug2: spank: x11spank.so: exit = 0

STDOUT:
j3c011
j3c011
j3c011
j3c011

```
> 
> I called both commands locally and they got a different answer than in
> comment #3. Please also provide a copy of your slurm.conf.
Which config parameters you care for from slurm.conf?

> 
> > > Can you please provide an example of this? Do you mean with respect to using
> > > environment variables?
> > 
> > No this has nothing to do with env vars.
> Thanks for the clarification.
Comment 7 Nate Rini 2020-01-10 11:18:00 MST
(In reply to Chrysovalantis Paschoulas from comment #6)
> (In reply to Nate Rini from comment #5)
> > I called both commands locally and they got a different answer than in
> > comment #3. Please also provide a copy of your slurm.conf.
> Which config parameters you care for from slurm.conf?

In general, the whole slurm.conf file along with gres.conf and cgroup.conf if they exist. That avoids my having to repeatedly request different parameters as I test which slows down the time to solution.

At the very least, please provide the following:
> SelectType
> TaskPlugin
> Priority*
> SchedulerParameters
> SchedulerType
> MpiDefault
> GresTypes
> AccountingStorageEnforce

I don't need any of the filesystem paths, ports or user names. The goal is get my local Slurm install to duplicate your issue.
Comment 8 Chrysovalantis Paschoulas 2020-01-13 02:02:59 MST
Created attachment 12713 [details]
Slurm config
Comment 9 Chrysovalantis Paschoulas 2020-01-13 02:03:23 MST
Created attachment 12714 [details]
Gres config file
Comment 10 Chrysovalantis Paschoulas 2020-01-13 02:04:21 MST
I attached our slurm.conf and gres.conf.
Comment 11 Nate Rini 2020-01-13 10:20:50 MST
(In reply to Chrysovalantis Paschoulas from comment #10)
> I attached our slurm.conf and gres.conf.

Tagged both files as private. I will try to replicate your issue.
Comment 12 Nate Rini 2020-01-13 10:43:33 MST
(In reply to Chrysovalantis Paschoulas from comment #10)
> I attached our slurm.conf and gres.conf.

Which plugins are being loaded via plugstack.conf?

> Gres=mem112:no_consume:1
Is there a reason to use Gres instead of a node feature flag for this?
Comment 15 Nate Rini 2020-03-16 11:23:57 MDT
Timing this ticket out. Please respond to have it reopened automatically.
Comment 16 Chrysovalantis Paschoulas 2020-03-16 11:49:00 MDT
(In reply to Nate Rini from comment #15)
> Timing this ticket out. Please respond to have it reopened automatically.

Please reopen.

Our plugins on one of the big production clusters is:
```
# cat /etc/slurm/plugstack.conf 
required	/opt/parastation/lib64/slurm/psgw_spank.so
required	globresspank.so
required	showglobresspank.so
```

The memXXX are needed as GRESs for some reasons, we have also extra features regarding the memory size for each node type.
Comment 17 Nate Rini 2020-03-16 13:42:09 MDT
(In reply to Chrysovalantis Paschoulas from comment #16)
> (In reply to Nate Rini from comment #15)
> > Timing this ticket out. Please respond to have it reopened automatically.
> 
> Please reopen.
Done

> Our plugins on one of the big production clusters is:
> ```
> # cat /etc/slurm/plugstack.conf 
> required	/opt/parastation/lib64/slurm/psgw_spank.so
Is this https://github.com/ParaStation ?

> required	globresspank.so
> required	showglobresspank.so
Is this a local plugin?
> ```
> 
> The memXXX are needed as GRESs for some reasons, we have also extra features
> regarding the memory size for each node type.
Do you see the inconsistency without these spank plugins?
Comment 18 Chrysovalantis Paschoulas 2020-03-17 02:25:04 MDT
(In reply to Nate Rini from comment #17)
> (In reply to Chrysovalantis Paschoulas from comment #16)
> > (In reply to Nate Rini from comment #15)
> > > Timing this ticket out. Please respond to have it reopened automatically.
> > 
> > Please reopen.
> Done
> 
> > Our plugins on one of the big production clusters is:
> > ```
> > # cat /etc/slurm/plugstack.conf 
> > required	/opt/parastation/lib64/slurm/psgw_spank.so
> Is this https://github.com/ParaStation ?
> 
> > required	globresspank.so
> > required	showglobresspank.so
> Is this a local plugin?
> > ```
> > 
> > The memXXX are needed as GRESs for some reasons, we have also extra features
> > regarding the memory size for each node type.
> Do you see the inconsistency without these spank plugins?

Yes these are local plugins (on a small test cluster we have more plugins) and they export some env vars, modify the licenses, talk to other daemons but I believe all of them have nothing to do with current issue. They shouldn't affect the behavior of sbatch and srun in such way..

Did you try to reproduce the behaviour I described in a test system?
Comment 19 Nate Rini 2020-03-19 14:09:16 MDT
(In reply to Chrysovalantis Paschoulas from comment #18)
> Did you try to reproduce the behaviour I described in a test system?

Yes, when the ticket was first opened. The observed behavior was not replicated in the test system with your configuration. We are going to need to get some debug logs to determine where the task math is having issues.

Please call the following on your controller:
> scontrol setdebug debug4
> scontrol setdebugflags +TraceJobs
> scontrol setdebugflags +Steps
> scontrol setdebugflags +SelectType

Please make sure to submit the test jobs as noted in comment #0. Please attach logs once done.

To reverse changes:
> scontrol setdebug info
> scontrol setdebugflags -TraceJobs
> scontrol setdebugflags -Steps
> scontrol setdebugflags -SelectType
Comment 20 Chrysovalantis Paschoulas 2020-03-20 04:16:21 MDT
Created attachment 13440 [details]
Debugging job 141748

Logs for job 141748
Comment 21 Chrysovalantis Paschoulas 2020-03-20 04:17:04 MDT
Created attachment 13441 [details]
Debugging job 141749

Logs for job 141749
Comment 22 Chrysovalantis Paschoulas 2020-03-20 04:18:47 MDT
(In reply to Nate Rini from comment #19)
> (In reply to Chrysovalantis Paschoulas from comment #18)
> > Did you try to reproduce the behaviour I described in a test system?
> 
> Yes, when the ticket was first opened. The observed behavior was not
> replicated in the test system with your configuration. We are going to need
> to get some debug logs to determine where the task math is having issues.
> 
> Please call the following on your controller:
> > scontrol setdebug debug4
> > scontrol setdebugflags +TraceJobs
> > scontrol setdebugflags +Steps
> > scontrol setdebugflags +SelectType
> 
> Please make sure to submit the test jobs as noted in comment #0. Please
> attach logs once done.
> 
> To reverse changes:
> > scontrol setdebug info
> > scontrol setdebugflags -TraceJobs
> > scontrol setdebugflags -Steps
> > scontrol setdebugflags -SelectType

I have created 2 attachments for each job.

Job 141748:
```
$ srun -N 1 -n2 --ntasks-per-node=4 -A root -p batch hostname
j3c011
j3c011

```

Job 141749:
```
$ sbatch -N 1 -n2 --ntasks-per-node=4 -A root -p batch --wrap="srun hostname"
Submitted batch job 141749

$ cat slurm-141749.out
j3c011
j3c011
j3c011
j3c011

```

Best Regards,
Valantis
Comment 24 Nate Rini 2020-03-25 14:25:13 MDT
There currently appear to be 2 bugs:

1. srun is not parsing "--ntasks-per-node=4": 
#(In reply to Chrysovalantis Paschoulas from comment #6)
> For srun -vvv -N 1 -n2 --ntasks-per-node=4 -A root -p batch hostname:
> srun: ntasks-per-node     : -2
This is clearly wrong but it ends up with the correct number of tasks run (by coincidence).

I have replicated this issue locally and will work on a patch.

2. Something is interfering with "--ntasks-per-node=4" with batch jobs:
The logs show that sbatch is correctly parsing the input. The logs provided show that the controller gets the correct request:
> num_tasks=2 ntasks_per_node=4
Then for reasons unclear, the job is incorrectly started by slurmstepd.

I suspect #2 is caused by a local plugin or change as I don't see it with my test system. Can you please provide a copy of your lua job_submit script and spank plugins?
Comment 27 Nate Rini 2020-03-25 16:16:46 MDT
(In reply to Nate Rini from comment #24)
> 1. srun is not parsing "--ntasks-per-node=4": 
> #(In reply to Chrysovalantis Paschoulas from comment #6)
> > For srun -vvv -N 1 -n2 --ntasks-per-node=4 -A root -p batch hostname:
> > srun: ntasks-per-node     : -2
> This is clearly wrong but it ends up with the correct number of tasks run
> (by coincidence).
> 
> I have replicated this issue locally and will work on a patch.

Looking at the code, this behavior is intentional since ntasks is less than ntasks-per-node. (https://github.com/SchedMD/slurm/commit/daacf5afee9)
Comment 28 Chrysovalantis Paschoulas 2020-03-26 02:38:50 MDT
(In reply to Nate Rini from comment #24)
> There currently appear to be 2 bugs:
> 
> 1. srun is not parsing "--ntasks-per-node=4": 
> #(In reply to Chrysovalantis Paschoulas from comment #6)
> > For srun -vvv -N 1 -n2 --ntasks-per-node=4 -A root -p batch hostname:
> > srun: ntasks-per-node     : -2
> This is clearly wrong but it ends up with the correct number of tasks run
> (by coincidence).
> 
> I have replicated this issue locally and will work on a patch.
> 
> 2. Something is interfering with "--ntasks-per-node=4" with batch jobs:
> The logs show that sbatch is correctly parsing the input. The logs provided
> show that the controller gets the correct request:
> > num_tasks=2 ntasks_per_node=4
> Then for reasons unclear, the job is incorrectly started by slurmstepd.
> 
> I suspect #2 is caused by a local plugin or change as I don't see it with my
> test system. Can you please provide a copy of your lua job_submit script and
> spank plugins?

The SPANK plugins just export some env vars and they don't touch those parameters. The lua submission filter also checks/handles other submission parameters like GRES, etc and it doesn't touch the num_tasks and ntasks_per_node.

According to the source code, could you plz tell me what should be the intended/correct behavior? In both cases num_tasks should always win over ntasks_per_node? So the second example with sbatch is the wrong one?
Comment 29 Nate Rini 2020-03-26 10:57:05 MDT
(In reply to Chrysovalantis Paschoulas from comment #28)
> The SPANK plugins just export some env vars and they don't touch those
> parameters. The lua submission filter also checks/handles other submission
> parameters like GRES, etc and it doesn't touch the num_tasks and
> ntasks_per_node.
> 
> According to the source code, could you plz tell me what should be the
> intended/correct behavior?
With #1, since the number of nodes is 1 the ntasks-per-node is ignored as being redundant. There used to be a warning for this and I'm looking at making it being at least a debug message.

As for #2:

> In both cases num_tasks should always win over
> ntasks_per_node? So the second example with sbatch is the wrong one?

The man page (https://slurm.schedmd.com/srun.html) explains the expected behavior:
> --ntasks-per-node=<ntasks>
> Request that ntasks be invoked on each node. If used with the --ntasks option, the --ntasks option will take precedence and the --ntasks-per-node will be treated as a maximum count of tasks per node.

(In reply to Nate Rini from comment #23)
> > Job 141749:
> > $ sbatch -N 1 -n2 --ntasks-per-node=4 -A root -p batch --wrap="srun hostname"
> > Submitted batch job 141749
> > j3c011
> > j3c011
> > j3c011
> > j3c011
From the logs:
> > num_tasks=2 ntasks_per_node=4 ntasks_per_socket=-1 ntasks_per_core=-1 Nodes=1-[1]
ntasks_per_node should act as a limit per node with num_tasks being the number of tasks executed. There should only be 2x "j3c011" from this command. For reasons unknown, my test system setup with same config is not replicating the issue.

Can you please call this job again but with this command:
> $ sbatch -N 1 -n2 --ntasks-per-node=4 -A root -p batch --wrap="srun --slurmd-debug=debug5 hostname"
Please also attach the slurmd logs from the execution node.

Can you please also verify the version of `slurmd -V` on the node is the same as the controller.
Comment 30 Chrysovalantis Paschoulas 2020-03-27 08:05:55 MDT
I am really sorry, I tested it in another setup and I couldn't reproduce.

I am pretty sure this has something to do with our setup/environment.

Please close this ticket!

Thank you,
Valantis
Comment 31 Nate Rini 2020-03-27 09:56:35 MDT
(In reply to Chrysovalantis Paschoulas from comment #30)
> I am pretty sure this has something to do with our setup/environment.

If this becomes an issue again, please respond to this ticket and we can continue our debugging efforts.

> Please close this ticket!

Closing ticket.

--Nate