Ticket 644

Summary: On UV SLURM_CPU_BIND_LIST and SLURM_CPU_BIND are too big
Product: Slurm Reporter: Josko Plazonic <plazonic>
Component: slurmdAssignee: David Bigagli <david>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: da
Version: 2.6.6   
Hardware: Linux   
OS: Linux   
Site: Princeton (PICSciE) Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 14.03.1
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: Move size test earlier in setenvf or it doesn't work

Description Josko Plazonic 2014-03-14 15:28:06 MDT
We have ran into an issue with large jobs on our SGI UV.  A user attempted to run a job with 512 processes which tried to start with:

srun -n 512 --cpu_bind=cores /home/cuiz/421_dm/bin/cp.x

but ended up failing with:

slurmd[hecate]: execve(): /home/cuiz/421_dm/bin/cp.x: Argument list too long

(and lots more of such messages).

After some debugging we found that the issue was with SLURM_CPU_BIND_LIST and SLURM_CPU_BIND environment variables. Once they go above MAX_ARG_STRLEN=32*PAGESIZE=32*4096=131072 execve starts failing. This is a limit imposed within kernel and one that we cannot change without a recompile of the kernel (something difficult/risky to do on this machine).

If you wonder why are they growing so big - some sample output:
SLURM_CPU_BIND_LIST=0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000040000000000000000000000000000000000000000000000000000000000000000000000000000000000000000,0x00000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000,0x00000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000040000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000,0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000,........................
SLURM_CPU_BIND=quiet,mask_cpu:0x000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000040000000000000000000000000000000000000000000000000000000000000000000000000000000000000000,.....................

wc -m /proc/343511/environ 
59155 /proc/343511/environ

strings /proc/343511/environ | grep ^SLURM_CPU_BIND=|wc -m
24798

And this is for a job that is using only 66 cores. 

I see in code that currently there is no way to disable setting of variables.  I know that one can unset some vars via TaskProlog but that approach cannot work as execve of taskprolog fails for the same reason...

Suggestions? Should we build a version of slurm that comments out:

                        if (setenvf(&env->env, "SLURM_CPU_BIND_LIST",
                                    str_bind_list)) {
                                error("Unable to set SLURM_CPU_BIND_LIST");
                                rc = SLURM_FAILURE;
                        }
                        if (setenvf(&env->env, "SLURM_CPU_BIND", str_bind)) {
                                error("Unable to set SLURM_CPU_BIND");
                                rc = SLURM_FAILURE;
                        }

or do you have other suggestions/ideas (and yes, we absolutely need to bind to cpus or the performance is quite literally TERRIBLE).
Comment 1 Moe Jette 2014-03-15 01:27:41 MDT
These bitmaps identify which CPUs are associated with each task/rank of the application, all in a single environment variable. That works fine for most systems, but not so well for nodes with huge CPU counts. The solution that comes to mind for me is to create a different environment variable for each rank. That means Slurm will need to set a large number of environment variables, but its far more scalable.

SLURM_CPU_BIND_RANK_1
SLURM_CPU_BIND_RANK_2
etc.

There are equivalent bitmaps for memory binding also which would need the same sort of modification.
Comment 2 Josko Plazonic 2014-03-16 12:06:42 MDT
That sounds great.

If you don't have time to incorporate that solution in the next 14.* release we'd be also happy with something like:

if ((strlen(str_bind_list)+strlen("SLURM_CPU_BIND_LIST="))>=MAX_ARG_STRLEN)) {
notice("SLURM_CPU_BIND_LIST too big, skipping adding to environment");
} else {
// define it
}
...
Comment 3 Moe Jette 2014-03-16 13:13:50 MDT
(In reply to Josko Plazonic from comment #2)
> That sounds great.
> 
> If you don't have time to incorporate that solution in the next 14.* release
> we'd be also happy with something like:
> 
> if ((strlen(str_bind_list)+strlen("SLURM_CPU_BIND_LIST="))>=MAX_ARG_STRLEN))
> {
> notice("SLURM_CPU_BIND_LIST too big, skipping adding to environment");
> } else {
> // define it
> }
> ...

I doubt we will have time to get this into the next release, but here is a patch as you describe below to at least prevent the exec from failing. This will be in the next release.

https://github.com/SchedMD/slurm/commit/34a883e0557638774132cb68b54ab88f74ec7e79.patch
Comment 4 Josko Plazonic 2014-03-16 15:02:30 MDT
Almost - please correct me if I am reading this wrong - but with this change setenvf will now return ENOMEM which is non zero and therefore
                        if (setenvf(&env->env, "SLURM_CPU_BIND_LIST",
                                    str_bind_list)) {
                                error("Unable to set SLURM_CPU_BIND_LIST");
                                rc = SLURM_FAILURE;
                        } 
will now be triggered and we'll have a SLURM_FAILURE.  In our view this is a non fatal error, at least with these vars, and we need the execution to continue...
Comment 5 Moe Jette 2014-03-17 01:09:35 MDT
(In reply to Josko Plazonic from comment #4)
> Almost - please correct me if I am reading this wrong - but with this change
> setenvf will now return ENOMEM which is non zero and therefore
>                         if (setenvf(&env->env, "SLURM_CPU_BIND_LIST",
>                                     str_bind_list)) {
>                                 error("Unable to set SLURM_CPU_BIND_LIST");
>                                 rc = SLURM_FAILURE;
>                         } 
> will now be triggered and we'll have a SLURM_FAILURE.  In our view this is a
> non fatal error, at least with these vars, and we need the execution to
> continue...

This code only logs the event, so the user sees it, but the job will still run (albeit without specific environment variables).
Comment 6 Josko Plazonic 2014-04-08 13:25:07 MDT
Created attachment 738 [details]
Move size test earlier in setenvf or it doesn't work
Comment 7 Josko Plazonic 2014-04-08 13:27:34 MDT
We tested today on our SGI UV 14.03 and we still couldn't launch large scale jobs (300+ CPUs).  Same error.  Turns out that we need to move the overflow test earlier - right at the start.  Once there it works and despite warning, exactly as you told us, jobs start and work.  Attached patch is what we used.
Comment 8 David Bigagli 2014-04-09 08:25:02 MDT
Hvala Jozko, this is the right fix I think. It will be in release 14.03.1
commit 407a814b4b5.

David