Ticket 1159 - [818258] - after (aprun -B) patch installed - salloc -N1 only allocate 1PE/node
Summary: [818258] - after (aprun -B) patch installed - salloc -N1 only allocate 1PE/node
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Cray ALPS (show other tickets)
Version: 14.03.7
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Danny Auble
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2014-10-11 04:46 MDT by Jason Coverston
Modified: 2014-11-10 04:58 MST (History)
2 users (show)

See Also:
Site: CSCS - Swiss National Supercomputing Centre
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 14.03.10 14.11.0rc3
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Testing summary from customer (4.00 KB, text/plain)
2014-10-11 04:47 MDT, Jason Coverston
Details
attachment-7248-0.html (2.54 KB, text/html)
2014-10-11 05:04 MDT, Danny Auble
Details
spreadsheet showing differences from site. (13.32 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2014-10-14 18:34 MDT, Jason Coverston
Details
spreadsheet showing desired behavior. (14.96 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2014-10-30 09:44 MDT, Jason Coverston
Details
Patch to give ALPS ntasks-per-node if tasks are specified and ntasks-per-node isn't (752 bytes, patch)
2014-10-30 11:48 MDT, Danny Auble
Details | Diff
remove APRUN_DEFAULT_MEMORY from being set (3.00 KB, patch)
2014-10-30 11:55 MDT, Danny Auble
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Jason Coverston 2014-10-11 04:46:29 MDT
Created attachment 1319 [details]
attachment-7248-0.html

After patch for bug#817268 installed, salloc -N(or n) 1 no longer works.

On a system running SLURM v14.03.7 with  node defined in slurm.conf:

NodeName=DEFAULT Sockets=1 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=32768 State=UNKNOWN

Launching " salloc -N1" I' m suppose to get  the entire node and aprun -n 8 should work  - but after the patch "salloc -N1" only give 1 PE/node that' s wrong.

Summary on using "salloc without option , with -N1 or -n1) on 3 differents systems:

1.)  slurm 2.5.4 running on XE6 with 24 PEs:

salloc with no option job allocation 24PEs and therefore "aprun -n2 hostname" works
salloc -N1 or -n1 ALPS still allocate 24PEs - no problem with asking for 2 or more PEs to run on 1 node

2.) 14.03.7 with patch haswell ( 48 CPUS/24CU ):

salloc with no option job allocation is 1 PEs this means "aprun -n2 hostname" fails with claim exceeds reservation's node-count
salloc -N1 or -n1 same result "aprun -n2 hostname" fails with claim exceeds reservation's node-count

3.) 14.03.7 without patch sandybridge (16CPUS/8CU)

salloc with no option job allocation is  8PEs and 8pes-per-node "aprun -n2 hostname" works
salloc -N1 or -n1 also works

See attached file for details.
Comment 1 Jason Coverston 2014-10-11 04:47:27 MDT
Created attachment 1317 [details]
Testing summary from customer
Comment 2 Jason Coverston 2014-10-11 04:47:56 MDT
Note, with the patch installed you can get the desired behavior by specifying --ntasks-per-node.

# salloc -N1 will allocate 1 PE

On a 24 core system:

# salloc -N1 --ntasks-per-node=24 will allocate the entire node.
Comment 3 Jason Coverston 2014-10-11 04:48:14 MDT
From customer:

So this is new - we didn' t have to specify anything in the past.
Have you seen the testcase to exhibit the difference of behavior?
Comment 4 Jason Coverston 2014-10-11 04:48:33 MDT
Running " salloc " with no options on 2 different systems with the same CLE and slurm. see " apstat -r" output.

1.) without patch: salloc with no option allocates 8 tasks/node  

nina@santis01:~ $ salloc
salloc: Granted job allocation 64
salloc: Waiting for resource configuration
salloc: Nodes nid00014 are ready for job
nina@santis01:~ $ apstat -r
  ResId   ApId From     Arch PEs N d Memory State
  39603 140829 batch:64   XT   8 8 1      1 NID list,conf
nina@santis01:~ $ aprun -n2 hostname | wc -l
3

2.) with patch - salloc with no option only give 1 task/node and therefore aprun -2 fails:

nina@brisi01:~ $ salloc
salloc: Granted job allocation 1150
salloc: Waiting for resource configuration
salloc: Nodes nid00057 are ready for job
nina@brisi01:~ $ apstat -r
  ResId   ApId From       Arch PEs N d Memory State
  38193 135232 batch:1150   XT   1 1 1  65536 NID list,conf
nina@brisi01:~ $ aprun -n2 hostname | wc -l
apsched: claim exceeds reservation's node-count
0

Can you pass on to schedMD?
Comment 5 Danny Auble 2014-10-11 05:04:27 MDT
Are you saying without the option turned on you get different behaviour than before?  That seems very strange.  Without the option turned on there shouldn't be any code change.  This is the result I would expect of the option was turned on. 

On October 11, 2014 9:48:33 AM PDT, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=1159
>
>--- Comment #4 from Jason Coverston <jcovers@cray.com> ---
>Running " salloc " with no options on 2 different systems with the same
>CLE and
>slurm. see " apstat -r" output.
>
>1.) without patch: salloc with no option allocates 8 tasks/node  
>
>nina@santis01:~ $ salloc
>salloc: Granted job allocation 64
>salloc: Waiting for resource configuration
>salloc: Nodes nid00014 are ready for job
>nina@santis01:~ $ apstat -r
>  ResId   ApId From     Arch PEs N d Memory State
>  39603 140829 batch:64   XT   8 8 1      1 NID list,conf
>nina@santis01:~ $ aprun -n2 hostname | wc -l
>3
>
>2.) with patch - salloc with no option only give 1 task/node and
>therefore
>aprun -2 fails:
>
>nina@brisi01:~ $ salloc
>salloc: Granted job allocation 1150
>salloc: Waiting for resource configuration
>salloc: Nodes nid00057 are ready for job
>nina@brisi01:~ $ apstat -r
>  ResId   ApId From       Arch PEs N d Memory State
>  38193 135232 batch:1150   XT   1 1 1  65536 NID list,conf
>nina@brisi01:~ $ aprun -n2 hostname | wc -l
>apsched: claim exceeds reservation's node-count
>0
>
>Can you pass on to schedMD?
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
>You are the assignee for the bug.
Comment 6 Jason Coverston 2014-10-11 05:09:23 MDT
(In reply to Danny Auble from comment #5)
> Are you saying without the option turned on you get different behaviour than
> before?  That seems very strange.  Without the option turned on there
> shouldn't be any code change.  This is the result I would expect of the
> option was turned on. 

Hi Danny,

No, they turned off the option in the cray.conf file on one system and the one use case of "salloc -N1" now works as before (they get the entire node). The other use case of course then fails with the option turned off:

#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1

aprun -B hostname
Comment 7 Jason Coverston 2014-10-14 18:33:09 MDT
there is definitely a difference between "salloc -N1" and "sbatch --nodes=1" see attached spreadsheet for more details.

- with " salloc -N1"

nina@brisi01:~/rep-slurm $ salloc -N1
salloc: Granted job allocation 1286
salloc: Waiting for resource configuration
salloc: Nodes nid00057 are ready for job
nina@brisi01:~/rep-slurm $ aprun -n 2 hostname
apsched: claim exceeds reservation's node-count



- with sbatch " #SBATCH --nodes=1"
jobscript:

 #!/bin/bash -l
 #SBATCH --nodes=1
  aprun -n 2 hostname

nina@brisi01:~ $ sbatch nn
Submitted batch job 1285
nina@brisi01:~ $ more slurm-1285.out

nid00057
nid00057
Application 135541 resources: utime ~0s, stime ~0s, Rss ~5900, inblocks ~32, outblocks ~28
Comment 8 Jason Coverston 2014-10-14 18:34:18 MDT
Created attachment 1336 [details]
spreadsheet showing differences from site.
Comment 9 Danny Auble 2014-10-16 05:54:11 MDT
Jason, I have made another commit (edae4a811c) that should give them what they are looking for.  I was able to reproduce all the scenarios your spreadsheet gave, at least it appears they are correct now.  Please reopen if you find anything out of the ordinary.  If all is well we would like to tag a 14.03.9 with all these patches so if you could test and report asap that would be great :).
Comment 10 Danny Auble 2014-10-16 06:03:47 MDT
Ignore that last commit, it wasn't pushed and changed.  The commit is 2c95e2d22f3e6.
Comment 11 Jason Coverston 2014-10-18 01:30:03 MDT
Hi Danny.

Thanks for the patch! All the salloc behavior looks correct!! But there is some curious behavior when using sbatch. Which appears to be ALPS related.

Here is my full update that went into the Cray BUG. I am asking Jim an ALPS developer to comment. Let me know if you have any ideas as well.

Thanks again!

Jason

With sbatch:

(Jim, please take a look at items 4, 5, and 6 below, thanks).

snake-p1:/tmp # cat sbatch.sc
#!/bin/sh
apstat -r
aprun -n1 -N1 hostname
aprun -n2 hostname
aprun -B hostname
snake-p1:/tmp #

1) sbatch

snake-p1:/tmp # /opt/slurm/default/bin/sbatch sbatch.sc

snake-p1:/tmp # cat slurm-17.out
No resource reservations are present
nid00036
Application 9465 resources: utime ~0s, stime ~0s, Rss ~3808, inblocks ~24, outblocks ~28
nid00036
nid00036
Application 9466 resources: utime ~0s, stime ~0s, Rss ~3808, inblocks ~32, outblocks ~28
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
Application 9467 resources: utime ~0s, stime ~0s, Rss ~3808, inblocks ~272, outblocks ~28
snake-p1:/tmp #

2) sbatch -N1

snake-p1:/tmp # /opt/slurm/default/bin/sbatch -N1 sbatch.sc

snake-p1:/tmp # cat slurm-19.out
  ResId ApId From     Arch PEs N d Memory State
  27807 9471 batch:19   XT  32 0 1    256 NID list,conf
nid00036
Application 9472 resources: utime ~0s, stime ~0s, Rss ~3808, inblocks ~24, outblocks ~28
nid00036
nid00036
Application 9473 resources: utime ~0s, stime ~0s, Rss ~3808, inblocks ~32, outblocks ~28
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
Application 9474 resources: utime ~0s, stime ~0s, Rss ~3808, inblocks ~272, outblocks ~28
snake-p1:/tmp #

3) sbatch --ntasks=1

snake-p1:/tmp # /opt/slurm/default/bin/sbatch --ntasks=1 sbatch.sc

snake-p1:/tmp # cat slurm-20.out
  ResId ApId From     Arch PEs N d Memory State
  27809 9479 batch:20   XT  32 0 1    256 NID list,conf
nid00036
Application 9476 resources: utime ~0s, stime ~0s, Rss ~3808, inblocks ~24, outblocks ~28
nid00036
nid00036
Application 9477 resources: utime ~0s, stime ~0s, Rss ~3808, inblocks ~32, outblocks ~28
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
Application 9478 resources: utime ~0s, stime ~0s, Rss ~3808, inblocks ~272, outblocks ~28
snake-p1:/tmp #

4) sbatch --ntasks=1 --ntasks-per-node=1

*** this is different, notice aprun -n2 which should fail.

snake-p1:/tmp # /opt/slurm/default/bin/sbatch --ntasks=1 --ntasks-per-node=1 sbatch.sc

snake-p1:/tmp # cat slurm-23.out
  ResId ApId From     Arch PEs N d Memory State
  27812 9486 batch:23   XT   1 1 1   8192 NID list,conf
nid00036
Application 9487 resources: utime ~0s, stime ~0s, Rss ~3916, inblocks ~24, outblocks ~28
nid00036
nid00036
Application 9488 resources: utime ~0s, stime ~0s, Rss ~3916, inblocks ~32, outblocks ~28
nid00036
Application 9489 resources: utime ~0s, stime ~0s, Rss ~3916, inblocks ~24, outblocks ~28
snake-p1:/tmp #

More info with salloc:

nake-p1:/tmp # /opt/slurm/default/bin/salloc --ntasks=1 --ntasks-per-node=1
salloc: Granted job allocation 28
snake-p1:/tmp # apstat -rvvv
  ResId ApId From     Arch PEs N d Memory State
  27818 9498 batch:28   XT   1 1 1   8192 NID list,conf

Reservation detail
Res[0]: apid 9498, pagg 0x1d00000018, resId 27818, user root, NID list,
       gid 0, account 0, time 0, normal
  Batch System ID = 28
  Reservation flags = 0x840000
  Created at Sat Oct 18 07:47:06 2014
  Number of commands 1, control network fanout 32
  Cmd[0]: BASIL -n 1 -N 1 -j 0, 8192MB, XT, nodes 1, shared
  Reservation list entries: 1
    PE 0, cmd 0, nid 36, CPU 0x1, map 0x3, accels 0
snake-p1:/tmp #
snake-p1:/tmp # aprun -n2 hostname
apsched: claim exceeds reservation's node-count

And with sbatch:

snake-p1:/tmp # cat slurm-31.out
  ResId ApId From     Arch PEs N d Memory State
  27821 9503 batch:31   XT   1 1 1   8192 NID list,conf

Reservation detail
Res[0]: apid 9503, pagg 0x1d00000024, resId 27821, user root, NID list,
       gid 0, account 0, time 0, normal
  Batch System ID = 31
  Reservation flags = 0x840000
  Created at Sat Oct 18 07:50:01 2014
  Number of commands 1, control network fanout 32
  Cmd[0]: BASIL -n 1 -N 1 -j 0, 8192MB, XT, nodes 1, shared
  Reservation list entries: 1
    PE 0, cmd 0, nid 36, CPU 0x1, map 0x3, accels 0
nid00036
nid00036
Application 9504 resources: utime ~0s, stime ~0s, Rss ~3916, inblocks ~32, outblocks ~28
snake-p1:/tmp #

They are the exact same reservation. I am not sure why apsched is allowing the aprun inside of the sbatch job to run when resFullNode is not set:

snake-p1:/tmp # cat /etc/opt/cray/alps/alps.conf | grep resFull
#	resFullNode	1
snake-p1:/tmp #

Further, I can use the entire node:

snake-p1:/tmp # cat slurm-34.out
  ResId ApId From     Arch PEs N d Memory State
  27825 9507 batch:34   XT   1 1 1   8192 NID list,conf

Reservation detail
Res[0]: apid 9507, pagg 0x1d00000026, resId 27825, user root, NID list,
       gid 0, account 0, time 0, normal
  Batch System ID = 34
  Reservation flags = 0x840000
  Created at Sat Oct 18 07:54:10 2014
  Number of commands 1, control network fanout 32
  Cmd[0]: BASIL -n 1 -N 1 -j 0, 8192MB, XT, nodes 1, shared
  Reservation list entries: 1
    PE 0, cmd 0, nid 36, CPU 0x1, map 0x3, accels 0
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
Application 9508 resources: utime ~0s, stime ~0s, Rss ~3916, inblocks ~272, outblocks ~28
snake-p1:/tmp #

But not with salloc:

snake-p1:/tmp # /opt/slurm/default/bin/salloc --ntasks=1 --ntasks-per-node=1
salloc: Granted job allocation 35
snake-p1:/tmp #
snake-p1:/tmp # apstat -rvvv
  ResId ApId From     Arch PEs N d Memory State
  27826 9509 batch:35   XT   1 1 1   8192 NID list,conf

Reservation detail
Res[0]: apid 9509, pagg 0x1d00000025, resId 27826, user root, NID list,
       gid 0, account 0, time 0, normal
  Batch System ID = 35
  Reservation flags = 0x840000
  Created at Sat Oct 18 07:56:38 2014
  Number of commands 1, control network fanout 32
  Cmd[0]: BASIL -n 1 -N 1 -j 0, 8192MB, XT, nodes 1, shared
  Reservation list entries: 1
    PE 0, cmd 0, nid 36, CPU 0x1, map 0x3, accels 0
snake-p1:/tmp #
snake-p1:/tmp # aprun -n32 hostname
apsched: claim exceeds reservation's node-count
snake-p1:/tmp #
snake-p1:/tmp # apstat -rvvv
  ResId ApId From     Arch PEs N d Memory State
  27826 9509 batch:35   XT   1 1 1   8192 NID list,conf

Reservation detail
Res[0]: apid 9509, pagg 0x1d00000025, resId 27826, user root, NID list,
       gid 0, account 0, time 0, normal
  Batch System ID = 35
  Reservation flags = 0x840000
  Created at Sat Oct 18 07:56:38 2014
  Number of commands 1, control network fanout 32
  Cmd[0]: BASIL -n 1 -N 1 -j 0, 8192MB, XT, nodes 1, shared
  Reservation list entries: 1
    PE 0, cmd 0, nid 36, CPU 0x1, map 0x3, accels 0
snake-p1:/tmp #

Adding Jim for comment.

5) sbatch --nodes=1 --ntasks=1 --ntasks-per-node=1 

Same thing with these options. Jim, look at item 5 in comment 13.

snake-p1:/tmp # /opt/slurm/default/bin/sbatch --nodes=1 --ntasks=1 --ntasks-per-node=1 sbatch.sc

snake-p1:/tmp # cat slurm-36.out
  ResId ApId From     Arch PEs N d Memory State
  27827 9510 batch:36   XT   1 1 1   8192 NID list,conf

Reservation detail
Res[0]: apid 9510, pagg 0x1d00000027, resId 27827, user root, NID list,
       gid 0, account 0, time 0, normal
  Batch System ID = 36
  Reservation flags = 0x840000
  Created at Sat Oct 18 07:59:19 2014
  Number of commands 1, control network fanout 32
  Cmd[0]: BASIL -n 1 -N 1 -j 0, 8192MB, XT, nodes 1, shared
  Reservation list entries: 1
    PE 0, cmd 0, nid 36, CPU 0x1, map 0x3, accels 0
nid00036
Application 9511 resources: utime ~0s, stime ~0s, Rss ~3916, inblocks ~24, outblocks ~28
nid00036
nid00036
Application 9512 resources: utime ~0s, stime ~0s, Rss ~3916, inblocks ~32, outblocks ~28
nid00036
Application 9513 resources: utime ~0s, stime ~0s, Rss ~3916, inblocks ~24, outblocks ~28
snake-p1:/tmp #

6) sbatch --ntasks=2 --ntasks-per-node=1

snake-p1:/tmp # /opt/slurm/default/bin/sbatch --ntasks=2 --ntasks-per-node=1 sbatch.sc

snake-p1:/tmp # cat slurm-38.out
  ResId ApId From     Arch PEs N d Memory State
  27830 9517 batch:38   XT   2 1 1   8192 NID list,conf

Reservation detail
Res[0]: apid 9517, pagg 0x1d00000029, resId 27830, user root, NID list,
       gid 0, account 0, time 0, normal
  Batch System ID = 38
  Reservation flags = 0x840000
  Created at Sat Oct 18 08:02:14 2014
  Number of commands 1, control network fanout 32
  Cmd[0]: BASIL -n 2 -N 1 -j 0, 8192MB, XT, nodes 2, shared
  Reservation list entries: 2
    PE 0, cmd 0, nid 36, CPU 0x1, map 0x3, accels 0
    PE 1, cmd 0, nid 37, CPU 0x1, map 0x3, accels 0
nid00036
Application 9518 resources: utime ~0s, stime ~0s, Rss ~3916, inblocks ~24, outblocks ~28
nid00037
nid00037
Application 9519 resources: utime ~0s, stime ~0s, Rss ~3808, inblocks ~32, outblocks ~28
nid00036
nid00037
Application 9520 resources: utime ~0s, stime ~0s, Rss ~3916, inblocks ~48, outblocks ~56
snake-p1:/tmp #

Trying aprun -n32:

snake-p1:/tmp # cat slurm-39.out
  ResId ApId From     Arch PEs N d Memory State
  27831 9521 batch:39   XT   2 1 1   8192 NID list,conf

Reservation detail
Res[0]: apid 9521, pagg 0x1d0000002a, resId 27831, user root, NID list,
       gid 0, account 0, time 0, normal
  Batch System ID = 39
  Reservation flags = 0x840000
  Created at Sat Oct 18 08:03:51 2014
  Number of commands 1, control network fanout 32
  Cmd[0]: BASIL -n 2 -N 1 -j 0, 8192MB, XT, nodes 2, shared
  Reservation list entries: 2
    PE 0, cmd 0, nid 36, CPU 0x1, map 0x3, accels 0
    PE 1, cmd 0, nid 37, CPU 0x1, map 0x3, accels 0
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
Application 9522 resources: utime ~0s, stime ~1s, Rss ~3916, inblocks ~272, outblocks ~28
snake-p1:/tmp #

That should fail as it does with salloc:

snake-p1:/tmp # /opt/slurm/default/bin/salloc --ntasks=2 --ntasks-per-node=1
salloc: Granted job allocation 40
snake-p1:/tmp #
snake-p1:/tmp # apstat -rvvv
  ResId ApId From     Arch PEs N d Memory State
  27832 9523 batch:40   XT   2 1 1   8192 NID list,conf

Reservation detail
Res[0]: apid 9523, pagg 0x1d0000002b, resId 27832, user root, NID list,
       gid 0, account 0, time 0, normal
  Batch System ID = 40
  Reservation flags = 0x840000
  Created at Sat Oct 18 08:05:03 2014
  Number of commands 1, control network fanout 32
  Cmd[0]: BASIL -n 2 -N 1 -j 0, 8192MB, XT, nodes 2, shared
  Reservation list entries: 2
    PE 0, cmd 0, nid 36, CPU 0x1, map 0x3, accels 0
    PE 1, cmd 0, nid 37, CPU 0x1, map 0x3, accels 0
snake-p1:/tmp #
snake-p1:/tmp # aprun -n32 hostname
apsched: claim exceeds reservation's node-count
snake-p1:/tmp #
Comment 12 Jason Coverston 2014-10-30 04:52:44 MDT
Hi Danny,

Ok, we have identified the issue:

> salloc was setting APRUN_DEFAULT_MEMORY=8192 in the shell that it
> spawned; that's why salloc and
> sbatch behave differently.  When I 'unset APRUN_DEFAULT_MEMORY'
> in the salloc shell, salloc and sbatch worked identically.

Can you comment on why this is different?

Reopening.
Comment 13 Danny Auble 2014-10-30 06:02:22 MDT
It appears this happened in commit 1ac178c4d708 when was in 2.5.5.  It appears before that CRAY_AUTO_APRUN_OPTIONS was set with -m$value.  But it only appears to be set in salloc (or if they use srun) and not in sbatch.  The old CRAY_AUTO_APRUN_OPTIONS option is used in sbatch still (bug).  I can easily change that though.

Just to be clear you like the way salloc works correct?  I could fix that as I stated (or they could you srun instead of aprun which would fix it as well :)).

If you like the way sbatch works then the question is if we unset the APRUN_DEFAULT_MEMORY does the correct thing happen or not?
Comment 14 Jason Coverston 2014-10-30 09:10:56 MDT
(In reply to Danny Auble from comment #13)
> It appears this happened in commit 1ac178c4d708 when was in 2.5.5.  It
> appears before that CRAY_AUTO_APRUN_OPTIONS was set with -m$value.  But it
> only appears to be set in salloc (or if they use srun) and not in sbatch. 
> The old CRAY_AUTO_APRUN_OPTIONS option is used in sbatch still (bug).  I can
> easily change that though.
> 
> Just to be clear you like the way salloc works correct?  I could fix that as
> I stated (or they could you srun instead of aprun which would fix it as well
> :)).
> 
> If you like the way sbatch works then the question is if we unset the
> APRUN_DEFAULT_MEMORY does the correct thing happen or not?

Hi Danny,

Ok, I think I finally have this straightened out. 

1) We, Cray, are requesting the APRUN_DEFAULT_MEMORY env variable setting be removed from the salloc command. It causes some undesirable effects when the ALPS resFullNode feature is enabled. This was designed so users can place their applications with varying shapes with minimum hassle within their reservation.

2) We found another issue with this patch. Requesting --ntasks=1 allocates the entire node. The customer wants only 1 PE (or whatever they ask for) reserved with ALPS when only specifying --ntasks on the command line. 

From the BUG:

The customer is expecting the following aprun -B command to return 1 PE. Right now it's allocating the entire node. 

snake-p1:~ # /opt/slurm/default/bin/salloc --ntasks=1
salloc: Granted job allocation 85
snake-p1:~ #
snake-p1:~ # apstat -r
  ResId  ApId From     Arch PEs N d Memory State
  28416 10157 batch:85   XT  32 0 1    256 NID list,conf
snake-p1:~ #

snake-p1:~ # aprun -B hostname
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
nid00036
Application 10159 resources: utime ~0s, stime ~0s, Rss ~3816, inblocks ~272, outblocks ~28
snake-p1:~ #
Comment 15 Danny Auble 2014-10-30 09:37:08 MDT
So removing APRUN_DEFAULT_MEMORY will not cause other issues?  I can easily take it out.  I thought it was needed though when doing gang scheduling.  Otherwise couldn't the user use more memory than was granted them?

On the -n1 front, according to the spreadsheet don't they want aprun -B hostname to return 24?  The title of the bug seems to state they don't want what you said in comment 14.

I am sort of confused on what they want.

I can reproduce all that is in the spreadsheet with current code.  It sounds like they want things different though.
Comment 16 Jason Coverston 2014-10-30 09:43:57 MDT
(In reply to Danny Auble from comment #15)
> So removing APRUN_DEFAULT_MEMORY will not cause other issues?  I can easily
> take it out.  I thought it was needed though when doing gang scheduling. 
> Otherwise couldn't the user use more memory than was granted them?
> 

I checked with the ALPS developer who did gang scheduling. But I will double check again. I am worried what issues this may cause by removing it...



> On the -n1 front, according to the spreadsheet don't they want aprun -B
> hostname to return 24?  The title of the bug seems to state they don't want
> what you said in comment 14.
> 
> I am sort of confused on what they want.
> 
> I can reproduce all that is in the spreadsheet with current code.  It sounds
> like they want things different though.

I am confused as well. It's taking a lot of time to just get everything straight.

The title of the BUG is -N1, nodes. They do want 24 returned with aprun -B for this case. But for -n1 (tasks) they only want 1PE.

I have an updated spreadsheet. Uploading now.
Comment 17 Jason Coverston 2014-10-30 09:44:44 MDT
Created attachment 1388 [details]
spreadsheet showing desired behavior.
Comment 18 Jason Coverston 2014-10-30 09:52:46 MDT
> 
> The title of the BUG is -N1, nodes. They do want 24 returned with aprun -B
> for this case. But for -n1 (tasks) they only want 1PE.
> 
> I have an updated spreadsheet. Uploading now.

And for the case where no options are requested they want the entire node.

If you have looked at the spreadsheet you will notice that for the ntasks case the customer's expectation is now different than what they were used to with 2.5.4.
Comment 19 Danny Auble 2014-10-30 09:56:27 MDT
(In reply to Jason Coverston from comment #18)
> > 
> > The title of the BUG is -N1, nodes. They do want 24 returned with aprun -B
> > for this case. But for -n1 (tasks) they only want 1PE.
> > 
> > I have an updated spreadsheet. Uploading now.
> 
> And for the case where no options are requested they want the entire node.
> 
> If you have looked at the spreadsheet you will notice that for the ntasks
> case the customer's expectation is now different than what they were used to
> with 2.5.4.

I was just going to point that out as well :).

I would try to point out to them (again) if they use srun all of these should work as expected with no code change.

I am scared to take away the APRUN_DEFAULT_MEMORY env var, but can, I just don't want to have another bug open that says "Hey we need APRUN_DEFAULT_MEMORY set now".  You know what I mean?
Comment 20 Jason Coverston 2014-10-30 10:13:52 MDT
(In reply to Danny Auble from comment #19)
> (In reply to Jason Coverston from comment #18)
> > > 
> > > The title of the BUG is -N1, nodes. They do want 24 returned with aprun -B
> > > for this case. But for -n1 (tasks) they only want 1PE.
> > > 
> > > I have an updated spreadsheet. Uploading now.
> > 
> > And for the case where no options are requested they want the entire node.
> > 
> > If you have looked at the spreadsheet you will notice that for the ntasks
> > case the customer's expectation is now different than what they were used to
> > with 2.5.4.
> 
> I was just going to point that out as well :).
> 
> I would try to point out to them (again) if they use srun all of these
> should work as expected with no code change.

I asked this again. This is what I got back this morning from the Cray onsite analyst:

I have been told by (customer) a numbers of time that using "srun" is not an option for (the customer). To migrate users to srun takes time and needs to be plan in advance.


> 
> I am scared to take away the APRUN_DEFAULT_MEMORY env var, but can, I just
> don't want to have another bug open that says "Hey we need
> APRUN_DEFAULT_MEMORY set now".  You know what I mean?


I know exactly what you mean!! Lets hold off until I hear back again from the ALPS developer regarding the removal of the env variable.

His opinion before was that he had no opinion. Either remove it or add it to sbatch. I'm of the opinion to not have it added since I think things like this should be explicitly requested by a user or added by a queue or something.

One way to solve this so it won't break anything would be to add it to sbatch. So salloc and sbatch are the same. Then make it configurable in the cray.conf file. Or, even better, is there some predefined env list, like --export=<environment variables ? Kind of like the pbs_environment file?
Comment 21 Jason Coverston 2014-10-30 11:13:07 MDT
> 
> > 
> > I am scared to take away the APRUN_DEFAULT_MEMORY env var, but can, I just
> > don't want to have another bug open that says "Hey we need
> > APRUN_DEFAULT_MEMORY set now".  You know what I mean?
> 
> 
> I know exactly what you mean!! Lets hold off until I hear back again from
> the ALPS developer regarding the removal of the env variable.
> 
> His opinion before was that he had no opinion. Either remove it or add it to
> sbatch. I'm of the opinion to not have it added since I think things like
> this should be explicitly requested by a user or added by a queue or
> something.
> 
> One way to solve this so it won't break anything would be to add it to
> sbatch. So salloc and sbatch are the same. Then make it configurable in the
> cray.conf file. Or, even better, is there some predefined env list, like
> --export=<environment variables ? Kind of like the pbs_environment file?

From the ALPS team:

This may have been an issue with the old gang scheduling code,
but shouldn't be an issue with suspend/resume; the defaults will
give an app a reasonable amount of memory.

This should not be an issue since gang scheduling is no longer be used ...
Comment 22 Danny Auble 2014-10-30 11:18:57 MDT
(In reply to Jason Coverston from comment #21)
> > 
> > > 
> > > I am scared to take away the APRUN_DEFAULT_MEMORY env var, but can, I just
> > > don't want to have another bug open that says "Hey we need
> > > APRUN_DEFAULT_MEMORY set now".  You know what I mean?
> > 
> > 
> > I know exactly what you mean!! Lets hold off until I hear back again from
> > the ALPS developer regarding the removal of the env variable.
> > 
> > His opinion before was that he had no opinion. Either remove it or add it to
> > sbatch. I'm of the opinion to not have it added since I think things like
> > this should be explicitly requested by a user or added by a queue or
> > something.
> > 
> > One way to solve this so it won't break anything would be to add it to
> > sbatch. So salloc and sbatch are the same. Then make it configurable in the
> > cray.conf file. Or, even better, is there some predefined env list, like
> > --export=<environment variables ? Kind of like the pbs_environment file?
> 
> From the ALPS team:
> 
> This may have been an issue with the old gang scheduling code,
> but shouldn't be an issue with suspend/resume; the defaults will
> give an app a reasonable amount of memory.
> 
> This should not be an issue since gang scheduling is no longer be used ...

I will take it out then.  But I will not be happy if it breaks something the ALPS team had forgot :(.  As it didn't work with sbatch perhaps no real issue will come of it.
Comment 23 Danny Auble 2014-10-30 11:48:44 MDT
Created attachment 1389 [details]
Patch to give ALPS ntasks-per-node if tasks are specified and ntasks-per-node isn't

(In reply to Jason Coverston from comment #20)
> (In reply to Danny Auble from comment #19)
> > (In reply to Jason Coverston from comment #18)
> > > > 
> > > > The title of the BUG is -N1, nodes. They do want 24 returned with aprun -B
> > > > for this case. But for -n1 (tasks) they only want 1PE.
> > > > 
> > > > I have an updated spreadsheet. Uploading now.
> > > 
> > > And for the case where no options are requested they want the entire node.
> > > 
> > > If you have looked at the spreadsheet you will notice that for the ntasks
> > > case the customer's expectation is now different than what they were used to
> > > with 2.5.4.
> > 
> > I was just going to point that out as well :).

It turns out commit 2c95e2d22f3 actually took this kind of functionality away.  I have modified the original code there to give more of what I think they expect.

Ask them to try the attached patch and see if it gives them what they want.  If so I will commit it.  It appears in my limited testing to do what the new spreadsheet wants.

> > 
> > I would try to point out to them (again) if they use srun all of these
> > should work as expected with no code change.
> 
> I asked this again. This is what I got back this morning from the Cray
> onsite analyst:
> 
> I have been told by (customer) a numbers of time that using "srun" is not an
> option for (the customer). To migrate users to srun takes time and needs to
> be plan in advance.

That is sad as life would be so much easier for them if they actually used all of Slurm instead of just part of it.  They need to start this kind of transition though as we really want people to begin running native and can see a time in the future when the ALPS interface will not be supported.
Comment 24 Danny Auble 2014-10-30 11:55:33 MDT
Created attachment 1390 [details]
remove APRUN_DEFAULT_MEMORY from being set

This patch will remove APRUN_DEFAULT_MEMORY from being set.  If it does what is expected I will commit.  Please test it out.
Comment 25 Jason Coverston 2014-10-30 12:00:14 MDT
> That is sad as life would be so much easier for them if they actually used
> all of Slurm instead of just part of it.  They need to start this kind of
> transition though as we really want people to begin running native and can
> see a time in the future when the ALPS interface will not be supported.

I think we can start this conversation once we get through acceptance.

I'll test this out now.
Comment 26 Jason Coverston 2014-10-30 12:32:29 MDT
Hi Danny,

Thanks!! So far this looks great!

I updated the Cray BUG. Should we let this digest with them tomorrow before you integrate the changes?

Here is my update:

SchedMD provided me with an initial diff to test. So far this looks good. Please confirm before I tell them it's ok to integrate it.

1) salloc --ntasks=2 --ntasks-per-node=1

snake-p1:~ # /opt/slurm/default/bin/salloc --ntasks=2 --ntasks-per-node=1
salloc: Granted job allocation 98
snake-p1:~ # printenv | grep APRUN_DEFAULT_MEMORY
snake-p1:~ #
snake-p1:~ # apstat -rvv
  ResId  ApId From     Arch PEs N d Memory State
  28430 10186 batch:98   XT   2 1 1   8192 NID list,conf

Reservation detail
Res[0]: apid 10186, pagg 0x1d00000018, resId 28430, user root, NID list,
       gid 0, account 0, time 0, normal
  Batch System ID = 98
  Reservation flags = 0x940000
  Created at Thu Oct 30 19:11:36 2014
  Number of commands 1, control network fanout 32
  Cmd[0]: BASIL -n 2 -N 1 -j 0, 8192MB, XT, nodes 2, exclusive
  Reservation list entries: 2
  Reservation list: 36-37
snake-p1:~ #
snake-p1:~ # aprun -n1 hostname
nid00036
Application 10187 resources: utime ~0s, stime ~0s, Rss ~3816, inblocks ~24, outblocks ~28
snake-p1:~ #
snake-p1:~ # aprun -n2 hostname
nid00036
nid00036
Application 10188 resources: utime ~0s, stime ~0s, Rss ~3816, inblocks ~32, outblocks ~28

** this is correct because resFullNode is set within ALPS and the default behavior of ALPS is to pack as many PEs onto a node unless otherwise specified. If you want your application placed like which was requested with salloc (--ntasks-per-node=1) then either use -N1 on the aprun line or use aprun -B:

snake-p1:~ # aprun -B hostname
nid00036
nid00037
Application 10189 resources: utime ~0s, stime ~0s, Rss ~3816, inblocks ~48, outblocks ~56
snake-p1:~ #
snake-p1:~ #


2) sbatch --ntasks=2 --ntasks-per-node=1 sbatch.sc

snake-p1:/tmp # cat sbatch.sc
#!/bin/bash

sleep 1
printenv | grep APRUN_DEFAULT_MEMORY
apstat -rvv

aprun -n1 hostname
aprun -n2 hostname
aprun -B hostname

snake-p1:/tmp # /opt/slurm/default/bin/sbatch --ntasks=2 --ntasks-per-node=1 sbatch.sc
Submitted batch job 2

snake-p1:/tmp # cat slurm-2.out
  ResId  ApId From    Arch PEs N d Memory State
  28432 10191 batch:2   XT   2 1 1   8192 NID list,conf

Reservation detail
Res[0]: apid 10191, pagg 0x1d00000023, resId 28432, user root, NID list,
       gid 0, account 0, time 0, normal
  Batch System ID = 2
  Reservation flags = 0x940000
  Created at Thu Oct 30 19:22:06 2014
  Number of commands 1, control network fanout 32
  Cmd[0]: BASIL -n 2 -N 1 -j 0, 8192MB, XT, nodes 2, exclusive
  Reservation list entries: 2
  Reservation list: 36-37
nid00036
Application 10192 resources: utime ~0s, stime ~0s, Rss ~3816, inblocks ~24, outblocks ~28
nid00037
nid00037
Application 10193 resources: utime ~0s, stime ~0s, Rss ~3804, inblocks ~32, outblocks ~28
nid00036
nid00037
Application 10194 resources: utime ~0s, stime ~0s, Rss ~3816, inblocks ~48, outblocks ~56
snake-p1:/tmp #

salloc and sbatch now behave the same.


Testing the spreadsheet items on line 28 and 29:

3) salloc --ntasks=1

snake-p1:/tmp # /opt/slurm/default/bin/salloc --ntasks=1
salloc: Granted job allocation 4
snake-p1:/tmp #
snake-p1:/tmp # apstat -r
  ResId  ApId From    Arch PEs N d Memory State
  28434 10198 batch:4   XT   1 1 1   8192 NID list,conf
snake-p1:/tmp #
snake-p1:/tmp # aprun -n1 hostname
nid00036
Application 10199 resources: utime ~0s, stime ~0s, Rss ~3816, inblocks ~24, outblocks ~28
snake-p1:/tmp #
snake-p1:/tmp # aprun -n2 hostname
nid00036
nid00036
Application 10200 resources: utime ~0s, stime ~0s, Rss ~3816, inblocks ~32, outblocks ~28
snake-p1:/tmp #
snake-p1:/tmp # aprun -B hostname
nid00036
Application 10201 resources: utime ~0s, stime ~0s, Rss ~3816, inblocks ~24, outblocks ~28
snake-p1:/tmp #

Note: aprun -n2 does not fail as requested in the spreadsheet. But this is correct because resFullNode is set to yes. If the customer wants it to fail then they need to set resFullNode to no inside of ALPS.
Comment 27 Jason Coverston 2014-10-31 03:56:32 MDT
Hi Danny,

All looks good. The customer has confirmed this as well!

Please commit and pass along the hashes so I can generate patches. 

Thanks!!

Jason
Comment 28 Danny Auble 2014-10-31 05:18:32 MDT
Excellent,

the hashes are

2e2de6a4d1d
18fb57b73ae

The will be in the next (most likely the last) 14.03 release.

If we were to tag a 14.03.10 would it be possible to install that instead of having them carry all these patches around?  At the bear minimum upgrading to 14.03.9 with the few patches that 14.03.10 has in it.  I hope if anything was learned here is that upgrading, or at least testing new version, is a good way to prevent changes they don't like coming in when they are forced to upgrade years after the fact.

Hopefully this bug can be closed now.
Comment 29 Jason Coverston 2014-10-31 08:30:43 MDT
(In reply to Danny Auble from comment #28)
> Excellent,
> 
> the hashes are
> 
> 2e2de6a4d1d
> 18fb57b73ae
> 
> The will be in the next (most likely the last) 14.03 release.
> 
> If we were to tag a 14.03.10 would it be possible to install that instead of
> having them carry all these patches around?  At the bear minimum upgrading
> to 14.03.9 with the few patches that 14.03.10 has in it.  I hope if anything
> was learned here is that upgrading, or at least testing new version, is a
> good way to prevent changes they don't like coming in when they are forced
> to upgrade years after the fact.
> 
> Hopefully this bug can be closed now.

FYI, one of the above wasn't the correct hash. Here are the correct ones:

2e2de6a4d1dda41ffc98a63d7216afe407ac296c
d62896db865bdf654bb544b128be313938cc1c75

I will push for an upgrade to 14.03.10 after we get this and [1148] sorted. And then breathe for a second :)
Comment 30 Jason Coverston 2014-11-10 04:58:32 MST
FYI, I have closed the Cray BUG as well.

Thanks for all your work!

Jason