| Summary: | Using nonzero gres count fails | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | David Gloe <david.gloe> |
| Component: | Other | Assignee: | David Bigagli <david> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | brian, da |
| Version: | 15.08.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | CRAY | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 15.08.0pre2 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
David Gloe
2015-01-14 07:09:30 MST
I also had a slurmctld core dump in gres code:
Core was generated by `/opt/slurm/15.08.0-1.0000.68b34ea.0.0.ari/sbin/slurmctld'.
Program terminated with signal 6, Aborted.
#0 0x00007f177823f885 in raise () from /lib64/libc.so.6
(gdb) bt full
#0 0x00007f177823f885 in raise () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007f1778240e61 in abort () from /lib64/libc.so.6
No symbol table info available.
#2 0x00000000004e66f6 in slurm_xmalloc (size=34359738352, clear=true, file=0x662e29 "gres.c", line=2906, func=0x662e28 "") at xmalloc.c:90
new = 0x7fff6c5fed27
p = 0x0
total_size = 34359738368
#3 0x00000000005f3dec in gres_plugin_job_state_unpack (gres_list=0x7fff6c5fef08, buffer=0x89dce0, job_id=832, protocol_version=7424) at gres.c:2904
i = 0
rc = 0
magic = 1133130964
plugin_id = 4047587904
utmp32 = 1
rec_cnt = 0
has_more = 1 '\001'
gres_ptr = 0x4ffde4 <unpackstr_array+100>
gres_job_ptr = 0x917010
__PRETTY_FUNCTION__ = "gres_plugin_job_state_unpack"
#4 0x0000000000449634 in _load_job_state (buffer=0x89dce0, protocol_version=7424) at job_mgr.c:1411
...
(gdb) frame 3
#3 0x00000000005f3dec in gres_plugin_job_state_unpack (gres_list=0x7fff6c5fef08, buffer=0x89dce0, job_id=832, protocol_version=7424) at gres.c:2904
2904 gres.c: No such file or directory.
(gdb) print *gres_job_ptr
$1 = {type_model = 0x916fc0 "\001x00", gres_cnt_alloc = 4294967296, node_cnt = 4294967294, gres_bit_alloc = 0x0, gres_bit_step_alloc = 0x0, gres_cnt_step_alloc = 0x0}
The most recent code in the master change the gres variable lenght from 32 to 64 bit. Commit 166a4eb87418f0d. This could be the case for the failure. Let me investigate but for now you can checkout an earlier commit. David The previous version I had running was from December 3, so any change since then could have caused this. David, the commit e32bf4dc671075a0 in 15.08.0pre2 should get you going. Please let me know if you still have problems. David Please reopen if necessary. David Please reopen if necessary. David |