Ticket 1366

Summary: Using nonzero gres count fails
Product: Slurm Reporter: David Gloe <david.gloe>
Component: OtherAssignee: David Bigagli <david>
Status: RESOLVED FIXED QA Contact:
Severity: 2 - High Impact    
Priority: --- CC: brian, da
Version: 15.08.x   
Hardware: Linux   
OS: Linux   
Site: CRAY Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 15.08.0pre2 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description David Gloe 2015-01-14 07:09:30 MST
On the latest 15.08 code I can't specify a nonzero value for gres.

dgloe@opal-p2:~> srun -n 1 --gres=gpu:1 hostname
srun: error: Unable to allocate resources: Invalid generic resource (gres) specification

dgloe@opal-p2:~> srun -n 1 --gres=craynetwork:1 hostname
srun: error: Unable to allocate resources: Invalid generic resource (gres) specification

dgloe@opal-p2:~> grep Gres /etc/opt/slurm/slurm.conf
GresTypes=craynetwork,gpu
NodeName=nid000[24-27] Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 Gres=craynetwork:4 # RealMemory=32768 (set by FastSchedule 0)
NodeName=nid000[32-35] Sockets=1 CoresPerSocket=8 ThreadsPerCore=2 Gres=craynetwork:4,gpu:1 # RealMemory=32768 (set by FastSchedule 0)
NodeName=nid000[20-23,40-43,48-51,56-59] Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 Gres=craynetwork:4 # RealMemory=65536 (set by FastSchedule 0)

[2015-01-14T14:55:32.816] Invalid gres job specification gpu:1
[2015-01-14T14:55:32.816] _slurm_rpc_allocate_resources: Invalid generic resource (gres) specification

[2015-01-14T14:55:58.931] Invalid gres job specification craynetwork:1
[2015-01-14T14:55:58.931] _slurm_rpc_allocate_resources: Invalid generic resource (gres) specification
Comment 1 David Gloe 2015-01-14 07:10:40 MST
I also had a slurmctld core dump in gres code:

Core was generated by `/opt/slurm/15.08.0-1.0000.68b34ea.0.0.ari/sbin/slurmctld'.
Program terminated with signal 6, Aborted.
#0  0x00007f177823f885 in raise () from /lib64/libc.so.6
(gdb) bt full
#0  0x00007f177823f885 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007f1778240e61 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x00000000004e66f6 in slurm_xmalloc (size=34359738352, clear=true, file=0x662e29 "gres.c", line=2906, func=0x662e28 "") at xmalloc.c:90
        new = 0x7fff6c5fed27
        p = 0x0
        total_size = 34359738368
#3  0x00000000005f3dec in gres_plugin_job_state_unpack (gres_list=0x7fff6c5fef08, buffer=0x89dce0, job_id=832, protocol_version=7424) at gres.c:2904
        i = 0
        rc = 0
        magic = 1133130964
        plugin_id = 4047587904
        utmp32 = 1
        rec_cnt = 0
        has_more = 1 '\001'
        gres_ptr = 0x4ffde4 <unpackstr_array+100>
        gres_job_ptr = 0x917010
        __PRETTY_FUNCTION__ = "gres_plugin_job_state_unpack"
#4  0x0000000000449634 in _load_job_state (buffer=0x89dce0, protocol_version=7424) at job_mgr.c:1411
...
(gdb) frame 3
#3  0x00000000005f3dec in gres_plugin_job_state_unpack (gres_list=0x7fff6c5fef08, buffer=0x89dce0, job_id=832, protocol_version=7424) at gres.c:2904
2904    gres.c: No such file or directory.
(gdb) print *gres_job_ptr
$1 = {type_model = 0x916fc0 "\001x00", gres_cnt_alloc = 4294967296, node_cnt = 4294967294, gres_bit_alloc = 0x0, gres_bit_step_alloc = 0x0, gres_cnt_step_alloc = 0x0}
Comment 2 David Bigagli 2015-01-14 07:13:11 MST
The most recent code in the master change the gres variable lenght
from 32 to 64 bit. Commit 166a4eb87418f0d. This could be the case
for the failure. Let me investigate but for now you can checkout an
earlier commit.

David
Comment 3 David Gloe 2015-01-14 07:15:53 MST
The previous version I had running was from December 3, so any change since then could have caused this.
Comment 4 David Bigagli 2015-01-14 08:17:44 MST
David, the commit e32bf4dc671075a0 in 15.08.0pre2 should get you 
going. Please let me know if you still have problems.

David
Comment 5 David Bigagli 2015-01-20 04:30:42 MST
Please reopen if necessary.

David
Comment 6 David Bigagli 2015-01-20 04:31:31 MST
Please reopen if necessary.

David