|
Description
Chrysovalantis Paschoulas
2019-12-18 09:24:41 MST
(In reply to Chrysovalantis Paschoulas from comment #0) > For sbatch --ntasks-per-node wins over --ntasks. > For srun --ntasks wins over --ntasks-per-node. Can you please provide an example of this? Do you mean with respect to using environment variables? (In reply to Nate Rini from comment #2) > (In reply to Chrysovalantis Paschoulas from comment #0) > > For sbatch --ntasks-per-node wins over --ntasks. For example: ``` $ srun -N 1 -n2 --ntasks-per-node=4 -A root -p batch hostname j3c011 j3c011 ``` > > For srun --ntasks wins over --ntasks-per-node. For example: ``` $ sbatch -N 1 -n2 --ntasks-per-node=4 -A root -p batch --wrap="srun hostname" Submitted batch job 121583 $ cat slurm-121583.out j3c011 j3c011 j3c011 j3c011 ``` > Can you please provide an example of this? Do you mean with respect to using > environment variables? No this has nothing to do with env vars. I just mention that this issue/regression was introduced after a fix for SLURM_NTASKS_PER_NODE env var. (In reply to Chrysovalantis Paschoulas from comment #3) > (In reply to Nate Rini from comment #2) > > (In reply to Chrysovalantis Paschoulas from comment #0) > $ srun -N 1 -n2 --ntasks-per-node=4 -A root -p batch hostname > $ sbatch -N 1 -n2 --ntasks-per-node=4 -A root -p batch --wrap="srun hostname" Can you please call these 2 commands instead and provide the output? > $ srun -vvv -N 1 -n2 --ntasks-per-node=4 -A root -p batch hostname > $ sbatch -vvv -N 1 -n2 --ntasks-per-node=4 -A root -p batch --wrap="srun hostname" I called both commands locally and they got a different answer than in comment #3. Please also provide a copy of your slurm.conf. > > Can you please provide an example of this? Do you mean with respect to using > > environment variables? > > No this has nothing to do with env vars. Thanks for the clarification. (In reply to Nate Rini from comment #5) > (In reply to Chrysovalantis Paschoulas from comment #3) > > (In reply to Nate Rini from comment #2) > > > (In reply to Chrysovalantis Paschoulas from comment #0) > > $ srun -N 1 -n2 --ntasks-per-node=4 -A root -p batch hostname > > $ sbatch -N 1 -n2 --ntasks-per-node=4 -A root -p batch --wrap="srun hostname" > > Can you please call these 2 commands instead and provide the output? > > $ srun -vvv -N 1 -n2 --ntasks-per-node=4 -A root -p batch hostname For srun -vvv -N 1 -n2 --ntasks-per-node=4 -A root -p batch hostname: ``` srun: defined options srun: -------------------- -------------------- srun: account : root srun: licenses : project@just,scratch@just,home@just srun: nodes : 1 srun: ntasks : 2 srun: ntasks-per-node : -2 srun: partition : batch srun: verbose : 3 srun: -------------------- -------------------- srun: end of defined options srun: debug: propagating RLIMIT_CORE=0 srun: debug: propagating SLURM_PRIO_PROCESS=0 srun: debug: propagating UMASK=0022 srun: debug2: srun PMI messages to port=46187 srun: debug: Entering slurm_allocation_msg_thr_create() srun: debug: port from net_stream_listen is 39812 srun: debug: Entering _msg_thr_internal srun: debug: Munge authentication plugin loaded srun: Waiting for nodes to boot (delay looping 450 times @ 0.100000 secs x index) srun: debug: Waited 0.100000 sec and still waiting: next sleep for 0.200000 sec srun: debug: Waited 0.300000 sec and still waiting: next sleep for 0.300000 sec srun: debug: Waited 0.600000 sec and still waiting: next sleep for 0.400000 sec srun: debug: Waited 1.000000 sec and still waiting: next sleep for 0.500000 sec srun: debug: Waited 1.500000 sec and still waiting: next sleep for 0.600000 sec srun: debug: Waited 2.100000 sec and still waiting: next sleep for 0.700000 sec srun: debug: Waited 2.800000 sec and still waiting: next sleep for 0.800000 sec srun: debug: Waited 3.600000 sec and still waiting: next sleep for 0.900000 sec srun: debug: Waited 4.500000 sec and still waiting: next sleep for 1.000000 sec srun: Nodes j3c011 are ready for job srun: jobid 121858: nodes(1):`j3c011', cpu counts: 32(x1) srun: debug2: creating job with 2 tasks srun: debug: requesting job 121858, user 9226, nodes 1 including ((null)) srun: debug: cpus 2, tasks 2, name hostname, relative 65534 srun: debug2: spank: noturbospank.so: local_user_init = 0 srun: debug2: spank: x11spank.so: local_user_init = 0 srun: CpuBindType=(null type) srun: debug: Entering slurm_step_launch srun: debug: mpi type = (null) srun: debug: Using mpi/pspmi srun: debug: Entering _msg_thr_create() srun: debug: initialized stdio listening socket, port 33446 srun: debug: Started IO server thread (47915426354944) srun: debug: Entering _launch_tasks srun: launching 121858.0 on host j3c011, 2 tasks: [0-1] srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: route default plugin loaded srun: debug2: Tree head got back 0 looking for 1 srun: debug2: Activity on IO listening socket 15 srun: debug2: Entering io_init_msg_read_from_fd srun: debug2: Leaving io_init_msg_read_from_fd srun: debug2: Entering io_init_msg_validate srun: debug2: Leaving io_init_msg_validate srun: debug2: Validated IO connection from 192.168.12.31, node rank 0, sd=17 srun: debug2: Tree head got back 1 srun: debug: launch returned msg_rc=0 err=0 type=8001 srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: eio_message_socket_accept: got message connection from 192.168.12.31:40758 16 srun: debug2: received task launch srun: Node j3c011, 2 tasks started srun: debug2: slurm_send_timeout: Socket no longer there srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: Entering _file_write j3c011 srun: debug2: Leaving _file_write srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: Entering _file_write j3c011 srun: debug2: Leaving _file_write srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: eio_message_socket_accept: got message connection from 192.168.12.31:40760 16 srun: debug2: received task exit srun: Received task exit notification for 2 tasks of step 121858.0 (status=0x0000). srun: j3c011: tasks 0-1: Completed srun: debug: task 0 done srun: debug: task 1 done srun: debug2: false, shutdown srun: debug2: false, shutdown srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: false, shutdown srun: debug: IO thread exiting srun: debug2: slurm_allocation_msg_thr_destroy: clearing up message thread srun: debug2: false, shutdown srun: debug: Leaving _msg_thr_internal ``` > > $ sbatch -vvv -N 1 -n2 --ntasks-per-node=4 -A root -p batch --wrap="srun hostname" For sbatch -vvv -N 1 -n2 --ntasks-per-node=4 -A root -p batch --wrap="srun hostname": ``` STDERR: sbatch: defined options sbatch: -------------------- -------------------- sbatch: account : root sbatch: licenses : project@just,scratch@just,home@just sbatch: nodes : 1 sbatch: ntasks : 2 sbatch: ntasks-per-node : 4 sbatch: partition : batch sbatch: verbose : 3 sbatch: wrap : srun hostname sbatch: -------------------- -------------------- sbatch: end of defined options sbatch: debug2: spank: psgw_spank.so: init_post_opt = 0 sbatch: debug2: spank: noturbospank.so: init_post_opt = 0 sbatch: debug2: spank: perfparanoidspank.so: init_post_opt = 0 sbatch: debug2: spank: x11spank.so: init_post_opt = 0 sbatch: debug: propagating RLIMIT_CORE=0 sbatch: debug: propagating SLURM_PRIO_PROCESS=0 sbatch: debug: propagating UMASK=0022 sbatch: debug: Munge authentication plugin loaded sbatch: Cray/Aries node selection plugin loaded Submitted batch job 121859 sbatch: debug2: spank: noturbospank.so: exit = 0 sbatch: debug2: spank: x11spank.so: exit = 0 STDOUT: j3c011 j3c011 j3c011 j3c011 ``` > > I called both commands locally and they got a different answer than in > comment #3. Please also provide a copy of your slurm.conf. Which config parameters you care for from slurm.conf? > > > > Can you please provide an example of this? Do you mean with respect to using > > > environment variables? > > > > No this has nothing to do with env vars. > Thanks for the clarification. (In reply to Chrysovalantis Paschoulas from comment #6) > (In reply to Nate Rini from comment #5) > > I called both commands locally and they got a different answer than in > > comment #3. Please also provide a copy of your slurm.conf. > Which config parameters you care for from slurm.conf? In general, the whole slurm.conf file along with gres.conf and cgroup.conf if they exist. That avoids my having to repeatedly request different parameters as I test which slows down the time to solution. At the very least, please provide the following: > SelectType > TaskPlugin > Priority* > SchedulerParameters > SchedulerType > MpiDefault > GresTypes > AccountingStorageEnforce I don't need any of the filesystem paths, ports or user names. The goal is get my local Slurm install to duplicate your issue. Created attachment 12713 [details]
Slurm config
Created attachment 12714 [details]
Gres config file
I attached our slurm.conf and gres.conf. (In reply to Chrysovalantis Paschoulas from comment #10) > I attached our slurm.conf and gres.conf. Tagged both files as private. I will try to replicate your issue. (In reply to Chrysovalantis Paschoulas from comment #10) > I attached our slurm.conf and gres.conf. Which plugins are being loaded via plugstack.conf? > Gres=mem112:no_consume:1 Is there a reason to use Gres instead of a node feature flag for this? Timing this ticket out. Please respond to have it reopened automatically. (In reply to Nate Rini from comment #15) > Timing this ticket out. Please respond to have it reopened automatically. Please reopen. Our plugins on one of the big production clusters is: ``` # cat /etc/slurm/plugstack.conf required /opt/parastation/lib64/slurm/psgw_spank.so required globresspank.so required showglobresspank.so ``` The memXXX are needed as GRESs for some reasons, we have also extra features regarding the memory size for each node type. (In reply to Chrysovalantis Paschoulas from comment #16) > (In reply to Nate Rini from comment #15) > > Timing this ticket out. Please respond to have it reopened automatically. > > Please reopen. Done > Our plugins on one of the big production clusters is: > ``` > # cat /etc/slurm/plugstack.conf > required /opt/parastation/lib64/slurm/psgw_spank.so Is this https://github.com/ParaStation ? > required globresspank.so > required showglobresspank.so Is this a local plugin? > ``` > > The memXXX are needed as GRESs for some reasons, we have also extra features > regarding the memory size for each node type. Do you see the inconsistency without these spank plugins? (In reply to Nate Rini from comment #17) > (In reply to Chrysovalantis Paschoulas from comment #16) > > (In reply to Nate Rini from comment #15) > > > Timing this ticket out. Please respond to have it reopened automatically. > > > > Please reopen. > Done > > > Our plugins on one of the big production clusters is: > > ``` > > # cat /etc/slurm/plugstack.conf > > required /opt/parastation/lib64/slurm/psgw_spank.so > Is this https://github.com/ParaStation ? > > > required globresspank.so > > required showglobresspank.so > Is this a local plugin? > > ``` > > > > The memXXX are needed as GRESs for some reasons, we have also extra features > > regarding the memory size for each node type. > Do you see the inconsistency without these spank plugins? Yes these are local plugins (on a small test cluster we have more plugins) and they export some env vars, modify the licenses, talk to other daemons but I believe all of them have nothing to do with current issue. They shouldn't affect the behavior of sbatch and srun in such way.. Did you try to reproduce the behaviour I described in a test system? (In reply to Chrysovalantis Paschoulas from comment #18) > Did you try to reproduce the behaviour I described in a test system? Yes, when the ticket was first opened. The observed behavior was not replicated in the test system with your configuration. We are going to need to get some debug logs to determine where the task math is having issues. Please call the following on your controller: > scontrol setdebug debug4 > scontrol setdebugflags +TraceJobs > scontrol setdebugflags +Steps > scontrol setdebugflags +SelectType Please make sure to submit the test jobs as noted in comment #0. Please attach logs once done. To reverse changes: > scontrol setdebug info > scontrol setdebugflags -TraceJobs > scontrol setdebugflags -Steps > scontrol setdebugflags -SelectType Created attachment 13440 [details]
Debugging job 141748
Logs for job 141748
Created attachment 13441 [details]
Debugging job 141749
Logs for job 141749
(In reply to Nate Rini from comment #19) > (In reply to Chrysovalantis Paschoulas from comment #18) > > Did you try to reproduce the behaviour I described in a test system? > > Yes, when the ticket was first opened. The observed behavior was not > replicated in the test system with your configuration. We are going to need > to get some debug logs to determine where the task math is having issues. > > Please call the following on your controller: > > scontrol setdebug debug4 > > scontrol setdebugflags +TraceJobs > > scontrol setdebugflags +Steps > > scontrol setdebugflags +SelectType > > Please make sure to submit the test jobs as noted in comment #0. Please > attach logs once done. > > To reverse changes: > > scontrol setdebug info > > scontrol setdebugflags -TraceJobs > > scontrol setdebugflags -Steps > > scontrol setdebugflags -SelectType I have created 2 attachments for each job. Job 141748: ``` $ srun -N 1 -n2 --ntasks-per-node=4 -A root -p batch hostname j3c011 j3c011 ``` Job 141749: ``` $ sbatch -N 1 -n2 --ntasks-per-node=4 -A root -p batch --wrap="srun hostname" Submitted batch job 141749 $ cat slurm-141749.out j3c011 j3c011 j3c011 j3c011 ``` Best Regards, Valantis There currently appear to be 2 bugs: 1. srun is not parsing "--ntasks-per-node=4": #(In reply to Chrysovalantis Paschoulas from comment #6) > For srun -vvv -N 1 -n2 --ntasks-per-node=4 -A root -p batch hostname: > srun: ntasks-per-node : -2 This is clearly wrong but it ends up with the correct number of tasks run (by coincidence). I have replicated this issue locally and will work on a patch. 2. Something is interfering with "--ntasks-per-node=4" with batch jobs: The logs show that sbatch is correctly parsing the input. The logs provided show that the controller gets the correct request: > num_tasks=2 ntasks_per_node=4 Then for reasons unclear, the job is incorrectly started by slurmstepd. I suspect #2 is caused by a local plugin or change as I don't see it with my test system. Can you please provide a copy of your lua job_submit script and spank plugins? (In reply to Nate Rini from comment #24) > 1. srun is not parsing "--ntasks-per-node=4": > #(In reply to Chrysovalantis Paschoulas from comment #6) > > For srun -vvv -N 1 -n2 --ntasks-per-node=4 -A root -p batch hostname: > > srun: ntasks-per-node : -2 > This is clearly wrong but it ends up with the correct number of tasks run > (by coincidence). > > I have replicated this issue locally and will work on a patch. Looking at the code, this behavior is intentional since ntasks is less than ntasks-per-node. (https://github.com/SchedMD/slurm/commit/daacf5afee9) (In reply to Nate Rini from comment #24) > There currently appear to be 2 bugs: > > 1. srun is not parsing "--ntasks-per-node=4": > #(In reply to Chrysovalantis Paschoulas from comment #6) > > For srun -vvv -N 1 -n2 --ntasks-per-node=4 -A root -p batch hostname: > > srun: ntasks-per-node : -2 > This is clearly wrong but it ends up with the correct number of tasks run > (by coincidence). > > I have replicated this issue locally and will work on a patch. > > 2. Something is interfering with "--ntasks-per-node=4" with batch jobs: > The logs show that sbatch is correctly parsing the input. The logs provided > show that the controller gets the correct request: > > num_tasks=2 ntasks_per_node=4 > Then for reasons unclear, the job is incorrectly started by slurmstepd. > > I suspect #2 is caused by a local plugin or change as I don't see it with my > test system. Can you please provide a copy of your lua job_submit script and > spank plugins? The SPANK plugins just export some env vars and they don't touch those parameters. The lua submission filter also checks/handles other submission parameters like GRES, etc and it doesn't touch the num_tasks and ntasks_per_node. According to the source code, could you plz tell me what should be the intended/correct behavior? In both cases num_tasks should always win over ntasks_per_node? So the second example with sbatch is the wrong one? (In reply to Chrysovalantis Paschoulas from comment #28) > The SPANK plugins just export some env vars and they don't touch those > parameters. The lua submission filter also checks/handles other submission > parameters like GRES, etc and it doesn't touch the num_tasks and > ntasks_per_node. > > According to the source code, could you plz tell me what should be the > intended/correct behavior? With #1, since the number of nodes is 1 the ntasks-per-node is ignored as being redundant. There used to be a warning for this and I'm looking at making it being at least a debug message. As for #2: > In both cases num_tasks should always win over > ntasks_per_node? So the second example with sbatch is the wrong one? The man page (https://slurm.schedmd.com/srun.html) explains the expected behavior: > --ntasks-per-node=<ntasks> > Request that ntasks be invoked on each node. If used with the --ntasks option, the --ntasks option will take precedence and the --ntasks-per-node will be treated as a maximum count of tasks per node. (In reply to Nate Rini from comment #23) > > Job 141749: > > $ sbatch -N 1 -n2 --ntasks-per-node=4 -A root -p batch --wrap="srun hostname" > > Submitted batch job 141749 > > j3c011 > > j3c011 > > j3c011 > > j3c011 From the logs: > > num_tasks=2 ntasks_per_node=4 ntasks_per_socket=-1 ntasks_per_core=-1 Nodes=1-[1] ntasks_per_node should act as a limit per node with num_tasks being the number of tasks executed. There should only be 2x "j3c011" from this command. For reasons unknown, my test system setup with same config is not replicating the issue. Can you please call this job again but with this command: > $ sbatch -N 1 -n2 --ntasks-per-node=4 -A root -p batch --wrap="srun --slurmd-debug=debug5 hostname" Please also attach the slurmd logs from the execution node. Can you please also verify the version of `slurmd -V` on the node is the same as the controller. I am really sorry, I tested it in another setup and I couldn't reproduce. I am pretty sure this has something to do with our setup/environment. Please close this ticket! Thank you, Valantis (In reply to Chrysovalantis Paschoulas from comment #30) > I am pretty sure this has something to do with our setup/environment. If this becomes an issue again, please respond to this ticket and we can continue our debugging efforts. > Please close this ticket! Closing ticket. --Nate |