We are seeing the following behaviour in an interactive job: jjrepik@mutrino-int:~> salloc -N1 --time=00:20:00 -S 1 salloc: Granted job allocation 2461767 jjrepik@nid00012:~> srun -n 1 hostname slurmstepd: error: job_set_corespec(20673, 1) failed: Invalid argument slurmstepd: error: core_spec_g_set: Invalid argument nid00012 jjrepik@nid00012:~> We cannot replicate this error on LANL's system (17.02.9). Core settings in our slurm.conf: CoreSpecPlugin=cray NodeName=nid00[012-047,076-127,140-147,160-179] Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 Gres=craynetwork:4 Feature=haswell,compute #RealMemory=124928 NodeName=nid00[192-311] Sockets=1 CoresPerSocket=68 ThreadsPerCore=4 Gres=craynetwork:4 Feature=knl,compute State=UNKNOWN #RealMemory=92160 Their system slurm.conf: CoreSpecPlugin=cray NodeName=nid00[012-047,076-111,140-147,160-179] Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 Gres=craynetwork:4 Feature=haswell #RealMemory=124928 NodeName=nid00[192-291] Sockets=1 CoresPerSocket=68 ThreadsPerCore=4 Gres=craynetwork:4 Feature=knl State=UNKNOWN #RealMemory=92160 I'm not sure what's different between the two systems. Could you point me where I might look?
job_set_corespec(20673, 1) is the Cray API call and arguments. The first argument is the container ID and the second is the core count. Since a core count of 1 really should be good, I'll guess the problem is with the container ID. What is your configured ProctrackType value? It should be "proctrack/cray". If that is what you already have, then please attach your slurm.conf file.
sdb:/etc/opt/slurm # grep -i proc slurm.conf ProctrackType=proctrack/cray sdb:/etc/opt/slurm #
Created attachment 5553 [details] slurm.conf
This is the same error as seen in Bug 4008. The following patch was added to 17.02.8 to address the issue. https://github.com/SchedMD/slurm/commit/525cde12e8d4ea771ca73aec01102a924bc369ca This is most likely why you aren't seeing the issue on LANL's systems -- since they are using 17.02.9 which as the patch. Can you apply the patch or upgrade and confirm the patch fixes it for you?
We will be installing 17.02.9 next week. I'll run test case again after the update. Thanks.
Did the upgrade happen and were you able to test?
Yes, the upgrade did happen last week and I apologize for not updating the case. mutrino:~/yaml/yaml-cpp-master/build> salloc -N1 --time=00:20:00 -S 1 salloc: Granted job allocation 3272739 nid00109:~/yaml/yaml-cpp-master/build> srun -n 1 hostname nid00109 nid00109:~/yaml/yaml-cpp-master/build> Everything seems to work as expected.
No problem. Good to hear. I'll close the bug. Thanks, Brian *** This ticket has been marked as a duplicate of ticket 4003 ***