Hello, If a user specifies a range of nodes, e.g., sbatch -N 200-600, and then combines that with --ntasks-per-node, slurm assumes the ntasks should be the minimum number of nodes multiplied by the minimum number of nodes. This prevents srun from operating without some extra work for the user: dmj@nid00021:~> sinfo --format="%b %t %D" | grep haswell | grep idle haswell idle 52 dmj@nid00021:~> sbatch -N 49 --wrap "sleep 1000" -C haswell -q regular Submitted batch job 1004507 dmj@nid00021:~> sinfo --format="%b %t %D" | grep haswell | grep idle haswell idle 3 dmj@nid00021:~> salloc -C haswell -N 2-4 --ntasks-per-node=24 salloc: Granted job allocation 1004508 salloc: Waiting for resource configuration salloc: Nodes nid00[445-447] are ready for job dmj@nid00445:~> srun hostname | wc -l srun: Warning: can't honor --ntasks-per-node set to 24 which doesn't match the requested tasks 48 with the number of requested nodes 3. Ignoring --ntasks-per-node. 48 dmj@nid00445:~> echo $SLURM_NTASKS 48 dmj@nid00445:~> echo $SLURM_NPROC 48 dmj@nid00445:~> unset SLURM_NTASKS dmj@nid00445:~> unset SLURM_NPROCS dmj@nid00445:~> srun hostname | wc -l 72 dmj@nid00445:~> exit salloc: Relinquishing job allocation 1004508 dmj@nid00021:~> In this case SLURM_NTASKS and SLURM_NPROC are incorrect and need to be unset for srun to work. Tracing the code a bit it looks like the job_ptr->details->ntasks needs to be updated by the job scheduler at job launch time if ntasks_per_node is specified. Another thing to look for with the final resolution of this bug is that SLURM_TASKS_PER_NODE is then set correctly (which I think it will be once the ntasks is updated). Note this issue is present in slurm 17.11 as well. Thanks, Doug
Doug, We have confirmed the issue. Working on determining how best to correct the issue. --Nate
I just re-read my opening line of the bug and I miswrote, it should have read: If a user specifies a range of nodes, e.g., sbatch -N 200-600, and then combines that with --ntasks-per-node, slurm assumes the ntasks should be the minimum number of nodes multiplied by the specified ntasks-per-node I suspect you understood though since the bug was confirmed =)
(In reply to Doug Jacobsen from comment #6) > I suspect you understood though since the bug was confirmed =) Can you provide the value of SLURM_TASKS_PER_NODE from inside of the job?
Doug, Here is the patch correcting the issue: https://github.com/SchedMD/slurm/commit/459cfa6fa916f21bdd0020e4dceda4b3053bd98e It should be in 18.08.4 tagged release. Thanks --Nate
This patch introduces more serious problems. I've re-opened the ticket and upgraded to sev-3. Logs from regression tests 15.24 and 38.4 using the current head of the slurm-18.08 branch are appended. jette@jette:~/Desktop/SLURM/slurm.git/testsuite/expect$ ./test38.4 ============================================ TEST: 38.4 ################################################################ Salloc packjob and verify output from scontrol show job ################################################################ spawn /home/jette/Desktop/SLURM/install.cray/bin/salloc --cpus-per-task=4 --mem-per-cpu=10 --ntasks=1 -t1 : --cpus-per-task=2 --mem-per-cpu=2 --ntasks=1 -t1 : --cpus-per-task=1 --mem-per-cpu=6 --ntasks=1 -t1 env salloc: Pending job allocation 340540 salloc: job 340540 queued and waiting for resources salloc: job 340540 has been allocated resources salloc: Granted job allocation 340540 CLUTTER_IM_MODULE=xim XDG_MENU_PREFIX=gnome- LANG=en_US.UTF-8 MANAGERPID=1676 DISPLAY=:0 OLDPWD=/home/jette/Desktop/SLURM/install.cray/bin INVOCATION_ID=b6af48bb86154929a1e493c1360c2ab5 UNITY_DEFAULT_PROFILE=unity COMPIZ_CONFIG_PROFILE=ubuntu GTK2_MODULES=overlay-scrollbar GTK_CSD=0 COLORTERM=truecolor USERNAME=jette SSH_AUTH_SOCK=/run/user/1001/keyring/ssh MANDATORY_PATH=/usr/share/gconf/unity.mandatory.path USER=jette DESKTOP_SESSION=unity QT4_IM_MODULE=xim TEXTDOMAINDIR=/usr/share/locale/ GNOME_TERMINAL_SCREEN=/org/gnome/Terminal/screen/4f921601_f088_4895_9eee_8b1574ee67a5 DEFAULTS_PATH=/usr/share/gconf/unity.default.path PWD=/home/jette/Desktop/SLURM/slurm.git/testsuite/expect HOME=/home/jette JOURNAL_STREAM=9:32287 TEXTDOMAIN=im-config QT_ACCESSIBILITY=1 XDG_SESSION_TYPE=x11 COMPIZ_BIN_PATH=/usr/bin/ XDG_DATA_DIRS=/usr/share/unity:/usr/local/share:/usr/share:/var/lib/snapd/desktop:/var/lib/snapd/desktop XDG_SESSION_DESKTOP=unity SSH_AGENT_LAUNCHER=gnome-keyring GTK_MODULES=gail:atk-bridge:unity-gtk-module WINDOWPATH=2 GNOME_SESSION_XDG_SESSION_PATH= TERM=xterm-256color SHELL=/bin/bash VTE_VERSION=5202 QT_IM_MODULE=ibus XMODIFIERS=@im=ibus IM_CONFIG_PHASE=2 XDG_CURRENT_DESKTOP=Unity:Unity7:ubuntu GPG_AGENT_INFO=/run/user/1001/gnupg/S.gpg-agent:0:1 GNOME_TERMINAL_SERVICE=:1.105 UNITY_HAS_3D_SUPPORT=true SHLVL=2 GDMSESSION=unity GNOME_DESKTOP_SESSION_ID=this-is-deprecated LOGNAME=jette DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1001/bus XDG_RUNTIME_DIR=/run/user/1001 XAUTHORITY=/run/user/1001/gdm/Xauthority XDG_CONFIG_DIRS=/etc/xdg/xdg-unity:/etc/xdg PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:. LD_PRELOAD=libgtk3-nocsd.so.0 SESSION_MANAGER=local/jette:@/tmp/.ICE-unix/1954,unix/jette:/tmp/.ICE-unix/1954 GTK_IM_MODULE=ibus _=./test38.4 SLURM_SUBMIT_DIR=/home/jette/Desktop/SLURM/slurm.git/testsuite/expect SLURM_SUBMIT_HOST=jette SLURM_PACK_SIZE=3 SLURM_JOB_ID=340540 SLURM_JOB_ID_PACK_GROUP_0=340540 SLURM_JOB_NAME_PACK_GROUP_0=sh SLURM_JOB_NUM_NODES_PACK_GROUP_0=1 SLURM_JOB_NODELIST_PACK_GROUP_0=nid00001 SLURM_NODE_ALIASES_PACK_GROUP_0=(null) SLURM_JOB_PARTITION_PACK_GROUP_0=debug SLURM_JOB_CPUS_PER_NODE_PACK_GROUP_0=4 SLURM_MEM_PER_CPU_PACK_GROUP_0=10 SLURM_JOBID_PACK_GROUP_0=340540 SLURM_NNODES_PACK_GROUP_0=1 SLURM_NODELIST_PACK_GROUP_0=nid00001 SLURM_TASKS_PER_NODE_PACK_GROUP_0=65534 SLURM_JOB_ACCOUNT_PACK_GROUP_0=test SLURM_JOB_QOS_PACK_GROUP_0=normal SLURM_NTASKS_PACK_GROUP_0=65534 SLURM_NPROCS_PACK_GROUP_0=65534 SLURM_CPUS_PER_TASK_PACK_GROUP_0=4 SLURM_JOB_ID_PACK_GROUP_1=340541 SLURM_JOB_NAME_PACK_GROUP_1=sh SLURM_JOB_NUM_NODES_PACK_GROUP_1=1 SLURM_JOB_NODELIST_PACK_GROUP_1=nid00004 SLURM_NODE_ALIASES_PACK_GROUP_1=(null) SLURM_JOB_PARTITION_PACK_GROUP_1=debug SLURM_JOB_CPUS_PER_NODE_PACK_GROUP_1=2 SLURM_MEM_PER_CPU_PACK_GROUP_1=2 SLURM_JOBID_PACK_GROUP_1=340541 SLURM_NNODES_PACK_GROUP_1=1 SLURM_NODELIST_PACK_GROUP_1=nid00004 SLURM_TASKS_PER_NODE_PACK_GROUP_1=65534 SLURM_JOB_ACCOUNT_PACK_GROUP_1=test SLURM_JOB_QOS_PACK_GROUP_1=normal SLURM_NTASKS_PACK_GROUP_1=65534 SLURM_NPROCS_PACK_GROUP_1=65534 SLURM_CPUS_PER_TASK_PACK_GROUP_1=2 SLURM_JOB_ID_PACK_GROUP_2=340542 SLURM_JOB_NAME_PACK_GROUP_2=sh SLURM_JOB_NUM_NODES_PACK_GROUP_2=1 SLURM_JOB_NODELIST_PACK_GROUP_2=nid00002 SLURM_NODE_ALIASES_PACK_GROUP_2=(null) SLURM_JOB_PARTITION_PACK_GROUP_2=debug SLURM_JOB_CPUS_PER_NODE_PACK_GROUP_2=1 SLURM_MEM_PER_CPU_PACK_GROUP_2=6 SLURM_JOBID_PACK_GROUP_2=340542 SLURM_NNODES_PACK_GROUP_2=1 SLURM_NODELIST_PACK_GROUP_2=nid00002 SLURM_TASKS_PER_NODE_PACK_GROUP_2=65534 SLURM_JOB_ACCOUNT_PACK_GROUP_2=test SLURM_JOB_QOS_PACK_GROUP_2=normal SLURM_NTASKS_PACK_GROUP_2=65534 SLURM_NPROCS_PACK_GROUP_2=65534 SLURM_CPUS_PER_TASK_PACK_GROUP_2=1 SLURM_CLUSTER_NAME=cray salloc: Relinquishing job allocation 340540 Job 340540 is DONE (COMPLETED) spawn cat test38.4.out salloc: Pending job allocation 340540 salloc: job 340540 queued and waiting for resources salloc: job 340540 has been allocated resources salloc: Granted job allocation 340540 CLUTTER_IM_MODULE=xim XDG_MENU_PREFIX=gnome- LANG=en_US.UTF-8 MANAGERPID=1676 DISPLAY=:0 OLDPWD=/home/jette/Desktop/SLURM/install.cray/bin INVOCATION_ID=b6af48bb86154929a1e493c1360c2ab5 UNITY_DEFAULT_PROFILE=unity COMPIZ_CONFIG_PROFILE=ubuntu GTK2_MODULES=overlay-scrollbar GTK_CSD=0 COLORTERM=truecolor USERNAME=jette SSH_AUTH_SOCK=/run/user/1001/keyring/ssh MANDATORY_PATH=/usr/share/gconf/unity.mandatory.path USER=jette DESKTOP_SESSION=unity QT4_IM_MODULE=xim TEXTDOMAINDIR=/usr/share/locale/ GNOME_TERMINAL_SCREEN=/org/gnome/Terminal/screen/4f921601_f088_4895_9eee_8b1574ee67a5 DEFAULTS_PATH=/usr/share/gconf/unity.default.path PWD=/home/jette/Desktop/SLURM/slurm.git/testsuite/expect HOME=/home/jette JOURNAL_STREAM=9:32287 TEXTDOMAIN=im-config QT_ACCESSIBILITY=1 XDG_SESSION_TYPE=x11 COMPIZ_BIN_PATH=/usr/bin/ XDG_DATA_DIRS=/usr/share/unity:/usr/local/share:/usr/share:/var/lib/snapd/desktop:/var/lib/snapd/desktop XDG_SESSION_DESKTOP=unity SSH_AGENT_LAUNCHER=gnome-keyring GTK_MODULES=gail:atk-bridge:unity-gtk-module WINDOWPATH=2 GNOME_SESSION_XDG_SESSION_PATH= TERM=xterm-256color SHELL=/bin/bash VTE_VERSION=5202 QT_IM_MODULE=ibus XMODIFIERS=@im=ibus IM_CONFIG_PHASE=2 XDG_CURRENT_DESKTOP=Unity:Unity7:ubuntu GPG_AGENT_INFO=/run/user/1001/gnupg/S.gpg-agent:0:1 GNOME_TERMINAL_SERVICE=:1.105 UNITY_HAS_3D_SUPPORT=true SHLVL=2 GDMSESSION=unity GNOME_DESKTOP_SESSION_ID=this-is-deprecated LOGNAME=jette DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1001/bus XDG_RUNTIME_DIR=/run/user/1001 XAUTHORITY=/run/user/1001/gdm/Xauthority XDG_CONFIG_DIRS=/etc/xdg/xdg-unity:/etc/xdg PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:. LD_PRELOAD=libgtk3-nocsd.so.0 SESSION_MANAGER=local/jette:@/tmp/.ICE-unix/1954,unix/jette:/tmp/.ICE-unix/1954 GTK_IM_MODULE=ibus _=./test38.4 SLURM_SUBMIT_DIR=/home/jette/Desktop/SLURM/slurm.git/testsuite/expect SLURM_SUBMIT_HOST=jette SLURM_PACK_SIZE=3 SLURM_JOB_ID=340540 SLURM_JOB_ID_PACK_GROUP_0=340540 SLURM_JOB_NAME_PACK_GROUP_0=sh SLURM_JOB_NUM_NODES_PACK_GROUP_0=1 SLURM_JOB_NODELIST_PACK_GROUP_0=nid00001 SLURM_NODE_ALIASES_PACK_GROUP_0=(null) SLURM_JOB_PARTITION_PACK_GROUP_0=debug SLURM_JOB_CPUS_PER_NODE_PACK_GROUP_0=4 SLURM_MEM_PER_CPU_PACK_GROUP_0=10 SLURM_JOBID_PACK_GROUP_0=340540 SLURM_NNODES_PACK_GROUP_0=1 SLURM_NODELIST_PACK_GROUP_0=nid00001 SLURM_TASKS_PER_NODE_PACK_GROUP_0=65534 SLURM_JOB_ACCOUNT_PACK_GROUP_0=test SLURM_JOB_QOS_PACK_GROUP_0=normal SLURM_NTASKS_PACK_GROUP_0=65534 SLURM_NPROCS_PACK_GROUP_0=65534 SLURM_CPUS_PER_TASK_PACK_GROUP_0=4 SLURM_JOB_ID_PACK_GROUP_1=340541 SLURM_JOB_NAME_PACK_GROUP_1=sh SLURM_JOB_NUM_NODES_PACK_GROUP_1=1 SLURM_JOB_NODELIST_PACK_GROUP_1=nid00004 SLURM_NODE_ALIASES_PACK_GROUP_1=(null) SLURM_JOB_PARTITION_PACK_GROUP_1=debug SLURM_JOB_CPUS_PER_NODE_PACK_GROUP_1=2 SLURM_MEM_PER_CPU_PACK_GROUP_1=2 SLURM_JOBID_PACK_GROUP_1=340541 SLURM_NNODES_PACK_GROUP_1=1 SLURM_NODELIST_PACK_GROUP_1=nid00004 SLURM_TASKS_PER_NODE_PACK_GROUP_1=65534 SLURM_JOB_ACCOUNT_PACK_GROUP_1=test SLURM_JOB_QOS_PACK_GROUP_1=normal SLURM_NTASKS_PACK_GROUP_1=65534 SLURM_NPROCS_PACK_GROUP_1=65534 SLURM_CPUS_PER_TASK_PACK_GROUP_1=2 SLURM_JOB_ID_PACK_GROUP_2=340542 SLURM_JOB_NAME_PACK_GROUP_2=sh SLURM_JOB_NUM_NODES_PACK_GROUP_2=1 SLURM_JOB_NODELIST_PACK_GROUP_2=nid00002 SLURM_NODE_ALIASES_PACK_GROUP_2=(null) SLURM_JOB_PARTITION_PACK_GROUP_2=debug SLURM_JOB_CPUS_PER_NODE_PACK_GROUP_2=1 SLURM_MEM_PER_CPU_PACK_GROUP_2=6 SLURM_JOBID_PACK_GROUP_2=340542 SLURM_NNODES_PACK_GROUP_2=1 SLURM_NODELIST_PACK_GROUP_2=nid00002 SLURM_TASKS_PER_NODE_PACK_GROUP_2=65534 SLURM_JOB_ACCOUNT_PACK_GROUP_2=test SLURM_JOB_QOS_PACK_GROUP_2=normal SLURM_NTASKS_PACK_GROUP_2=65534 SLURM_NPROCS_PACK_GROUP_2=65534 SLURM_CPUS_PER_TASK_PACK_GROUP_2=1 SLURM_CLUSTER_NAME=cray salloc: Relinquishing job allocation 340540 FAILURE: output of env incorrect matches: SLURM_NTASKS_PACK_GROUP_0=1 (0 != 1) jette@jette:~/Desktop/SLURM/slurm.git/testsuite/expect$ ./test15.24 ============================================ TEST: 15.24 spawn /home/jette/Desktop/SLURM/install.cray/bin/salloc --ntasks=10 --overcommit -N1 -t1 ./test15.24.input salloc: Granted job allocation 340543 OLDPWD=/home/jette/Desktop/SLURM/install.cray/bin PWD=/home/jette/Desktop/SLURM/slurm.git/testsuite/expect SLURM_CLUSTER_NAME=cray SLURM_JOB_ACCOUNT=test SLURM_JOB_CPUS_PER_NODE=1 SLURM_JOB_ID=340543 SLURM_JOBID=340543 SLURM_JOB_NAME=test15.24.input SLURM_JOB_NODELIST=nid00001 SLURM_JOB_NUM_NODES=1 SLURM_JOB_PARTITION=debug SLURM_JOB_QOS=normal SLURM_MEM_PER_NODE=1000 SLURM_NNODES=1 SLURM_NODE_ALIASES=(null) SLURM_NODELIST=nid00001 SLURM_NPROCS=65534 SLURM_NTASKS=65534 SLURM_OVERCOMMIT=1 SLURM_SUBMIT_DIR=/home/jette/Desktop/SLURM/slurm.git/testsuite/expect SLURM_SUBMIT_HOST=jette SLURM_TASKS_PER_NODE=65534 srun: error: Unable to create step for job 340543: Task count specification invalid salloc: Relinquishing job allocation 340543 FAILURE: Did not set desired allocation env vars FAILURE: Did not get proper number of tasks: 10, 0
(In reply to Nate Rini from comment #28) > Doug, > > Here is the patch correcting the issue: > https://github.com/SchedMD/slurm/commit/ > 459cfa6fa916f21bdd0020e4dceda4b3053bd98e > > It should be in 18.08.4 tagged release. > > Thanks > --Nate Doug Please see this patch that fixes a regression related to this bug: https://github.com/SchedMD/slurm/commit/5272103c0f9860a358a6d75128fd32346aeed676 Thanks --Nate
It looks like this is busted again, at least in the master branch. Here are logs from a couple of the failing tests: TEST: 1.59 spawn /home/jette/Desktop/SLURM/install.cray/bin/salloc -N3 -v -t2 ./test1.59.input salloc: defined options for program `salloc' ... salloc: Granted job allocation 344808 (%|#|$|]|[^>]>) *(|[^ ]* *)$/home/jette/Desktop/SLURM/install.cray/bin/srun -l -O printenv SLURMD D_NODENAME srun: error: invalid number of tasks (-n -2) (%|#|$|]|[^>]>) *(|[^ ]* *)$ FAILURE: node names not set from previous srun test1.59 FAILURE TEST: 1.87 spawn /home/jette/Desktop/SLURM/install.cray/bin/salloc -N4 ./test1.87.input salloc: Pending job allocation 344849 salloc: job 344849 queued and waiting for resources salloc: job 344849 has been allocated resources salloc: Granted job allocation 344849 QA_PROMPT: /home/jette/Desktop/SLURM/install.cray/bin/srun -l printenv SLURMD_NODENAME srun: error: invalid number of tasks (-n -2) QA_PROMPT: FAILURE: Did not get hostname of task 0 FAILURE: Did not get hostname of task 1 FAILURE: Did not get hostname of task 2 FAILURE: Did not get hostname of task 3 test1.87 FAILURE
Hi Doug, The regression that Moe mentioned is being tracked through Bug 6008 and the patch there. I will go ahead and close this one again.
So, am I supposed to apply the patch that was provided? or not? I can't see bug 6008, so won't be able to follow it.
Doug, sorry for all the noise, there is nothing for you to do at this point. All the comments on this bug from today should be ignored.
so my read of this is that 459cfa6fa916f21bdd0020e4dceda4b3053bd98e and then 5272103c0f9860a358a6d75128fd32346aeed676 can be applied on my 18.08 builds, and that there is a problem on master that I don't need to worry about, is that right?
Correct, those 2 commits are the only things you have to worry about.