Ticket 4376 - Core specialization doesn't work in interactive job
Summary: Core specialization doesn't work in interactive job
Status: RESOLVED DUPLICATE of ticket 4003
Alias: None
Product: Slurm
Classification: Unclassified
Component: User Commands (show other tickets)
Version: 17.02.7
Hardware: Cray XC Linux
: 4 - Minor Issue
Assignee: Brian Christiansen
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-11-13 11:56 MST by Jason Repik
Modified: 2017-12-11 10:20 MST (History)
0 users

See Also:
Site: Sandia National Laboratories
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (4.80 KB, text/plain)
2017-11-13 14:15 MST, Jason Repik
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Jason Repik 2017-11-13 11:56:42 MST
We are seeing the following behaviour in an interactive job:

jjrepik@mutrino-int:~> salloc -N1 --time=00:20:00 -S 1
salloc: Granted job allocation 2461767
jjrepik@nid00012:~> srun -n 1 hostname
slurmstepd: error: job_set_corespec(20673, 1) failed: Invalid argument
slurmstepd: error: core_spec_g_set: Invalid argument
nid00012
jjrepik@nid00012:~> 

We cannot replicate this error on LANL's system (17.02.9).

Core settings in our slurm.conf:

CoreSpecPlugin=cray
NodeName=nid00[012-047,076-127,140-147,160-179] Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 Gres=craynetwork:4 Feature=haswell,compute #RealMemory=124928
NodeName=nid00[192-311] Sockets=1 CoresPerSocket=68 ThreadsPerCore=4 Gres=craynetwork:4 Feature=knl,compute State=UNKNOWN #RealMemory=92160


Their system slurm.conf:

CoreSpecPlugin=cray
NodeName=nid00[012-047,076-111,140-147,160-179] Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 Gres=craynetwork:4 Feature=haswell #RealMemory=124928
NodeName=nid00[192-291] Sockets=1 CoresPerSocket=68 ThreadsPerCore=4 Gres=craynetwork:4 Feature=knl State=UNKNOWN #RealMemory=92160

I'm not sure what's different between the two systems.
Could you point me where I might look?
Comment 1 Moe Jette 2017-11-13 14:06:59 MST
job_set_corespec(20673, 1) is the Cray API call and arguments. The first argument is the container ID and the second is the core count. Since a core count of 1 really should be good, I'll guess the problem is with the container ID. What is your configured ProctrackType value? It should be "proctrack/cray". If that is what you already have, then please attach your slurm.conf file.
Comment 2 Jason Repik 2017-11-13 14:14:23 MST
sdb:/etc/opt/slurm # grep -i proc slurm.conf
ProctrackType=proctrack/cray
sdb:/etc/opt/slurm #
Comment 3 Jason Repik 2017-11-13 14:15:37 MST
Created attachment 5553 [details]
slurm.conf
Comment 4 Brian Christiansen 2017-11-17 16:37:27 MST
This is the same error as seen in Bug 4008. The following patch was added to 17.02.8 to address the issue. 

https://github.com/SchedMD/slurm/commit/525cde12e8d4ea771ca73aec01102a924bc369ca

This is most likely why you aren't seeing the issue on LANL's systems -- since they are using 17.02.9 which as the patch.

Can you apply the patch or upgrade and confirm the patch fixes it for you?
Comment 5 Jason Repik 2017-11-20 07:54:07 MST
We will be installing 17.02.9 next week.   I'll run test case again after the update.  Thanks.
Comment 6 Brian Christiansen 2017-12-11 10:11:47 MST
Did the upgrade happen and were you able to test?
Comment 7 Jason Repik 2017-12-11 10:16:20 MST
Yes, the upgrade did happen last week and I apologize for not updating the case.

mutrino:~/yaml/yaml-cpp-master/build> salloc -N1 --time=00:20:00 -S 1
salloc: Granted job allocation 3272739
nid00109:~/yaml/yaml-cpp-master/build> srun -n 1 hostname
nid00109
nid00109:~/yaml/yaml-cpp-master/build> 

Everything seems to work as expected.
Comment 8 Brian Christiansen 2017-12-11 10:20:38 MST
No problem. Good to hear. I'll close the bug.

Thanks,
Brian

*** This ticket has been marked as a duplicate of ticket 4003 ***