We have ran into an issue with large jobs on our SGI UV. A user attempted to run a job with 512 processes which tried to start with: srun -n 512 --cpu_bind=cores /home/cuiz/421_dm/bin/cp.x but ended up failing with: slurmd[hecate]: execve(): /home/cuiz/421_dm/bin/cp.x: Argument list too long (and lots more of such messages). After some debugging we found that the issue was with SLURM_CPU_BIND_LIST and SLURM_CPU_BIND environment variables. Once they go above MAX_ARG_STRLEN=32*PAGESIZE=32*4096=131072 execve starts failing. This is a limit imposed within kernel and one that we cannot change without a recompile of the kernel (something difficult/risky to do on this machine). If you wonder why are they growing so big - some sample output: SLURM_CPU_BIND_LIST=0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 000000000000000000000000040000000000000000000000000000000000000000000000000000000000000000000000000000000000000000,0x00000000000000000000000000000 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 00000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000000000000000000,0x00000000000000000000000000000000000000000000000000000000000000000000000000000000 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 00000000000000000000000000040000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 000000000000,0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000,........................ SLURM_CPU_BIND=quiet,mask_cpu:0x000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000040000000000000000000000000000000000000000000000000000000000000000000000000000000000000000,..................... wc -m /proc/343511/environ 59155 /proc/343511/environ strings /proc/343511/environ | grep ^SLURM_CPU_BIND=|wc -m 24798 And this is for a job that is using only 66 cores. I see in code that currently there is no way to disable setting of variables. I know that one can unset some vars via TaskProlog but that approach cannot work as execve of taskprolog fails for the same reason... Suggestions? Should we build a version of slurm that comments out: if (setenvf(&env->env, "SLURM_CPU_BIND_LIST", str_bind_list)) { error("Unable to set SLURM_CPU_BIND_LIST"); rc = SLURM_FAILURE; } if (setenvf(&env->env, "SLURM_CPU_BIND", str_bind)) { error("Unable to set SLURM_CPU_BIND"); rc = SLURM_FAILURE; } or do you have other suggestions/ideas (and yes, we absolutely need to bind to cpus or the performance is quite literally TERRIBLE).
These bitmaps identify which CPUs are associated with each task/rank of the application, all in a single environment variable. That works fine for most systems, but not so well for nodes with huge CPU counts. The solution that comes to mind for me is to create a different environment variable for each rank. That means Slurm will need to set a large number of environment variables, but its far more scalable. SLURM_CPU_BIND_RANK_1 SLURM_CPU_BIND_RANK_2 etc. There are equivalent bitmaps for memory binding also which would need the same sort of modification.
That sounds great. If you don't have time to incorporate that solution in the next 14.* release we'd be also happy with something like: if ((strlen(str_bind_list)+strlen("SLURM_CPU_BIND_LIST="))>=MAX_ARG_STRLEN)) { notice("SLURM_CPU_BIND_LIST too big, skipping adding to environment"); } else { // define it } ...
(In reply to Josko Plazonic from comment #2) > That sounds great. > > If you don't have time to incorporate that solution in the next 14.* release > we'd be also happy with something like: > > if ((strlen(str_bind_list)+strlen("SLURM_CPU_BIND_LIST="))>=MAX_ARG_STRLEN)) > { > notice("SLURM_CPU_BIND_LIST too big, skipping adding to environment"); > } else { > // define it > } > ... I doubt we will have time to get this into the next release, but here is a patch as you describe below to at least prevent the exec from failing. This will be in the next release. https://github.com/SchedMD/slurm/commit/34a883e0557638774132cb68b54ab88f74ec7e79.patch
Almost - please correct me if I am reading this wrong - but with this change setenvf will now return ENOMEM which is non zero and therefore if (setenvf(&env->env, "SLURM_CPU_BIND_LIST", str_bind_list)) { error("Unable to set SLURM_CPU_BIND_LIST"); rc = SLURM_FAILURE; } will now be triggered and we'll have a SLURM_FAILURE. In our view this is a non fatal error, at least with these vars, and we need the execution to continue...
(In reply to Josko Plazonic from comment #4) > Almost - please correct me if I am reading this wrong - but with this change > setenvf will now return ENOMEM which is non zero and therefore > if (setenvf(&env->env, "SLURM_CPU_BIND_LIST", > str_bind_list)) { > error("Unable to set SLURM_CPU_BIND_LIST"); > rc = SLURM_FAILURE; > } > will now be triggered and we'll have a SLURM_FAILURE. In our view this is a > non fatal error, at least with these vars, and we need the execution to > continue... This code only logs the event, so the user sees it, but the job will still run (albeit without specific environment variables).
Created attachment 738 [details] Move size test earlier in setenvf or it doesn't work
We tested today on our SGI UV 14.03 and we still couldn't launch large scale jobs (300+ CPUs). Same error. Turns out that we need to move the overflow test earlier - right at the start. Once there it works and despite warning, exactly as you told us, jobs start and work. Attached patch is what we used.
Hvala Jozko, this is the right fix I think. It will be in release 14.03.1 commit 407a814b4b5. David