| Summary: | Slurm tries to run syscfg on non-KNL nodes | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Christopher Samuel <chris> |
| Component: | KNL | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | alex |
| Version: | 17.11.0 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=4027 https://bugs.schedmd.com/show_bug.cgi?id=4461 |
||
| Site: | Swinburne | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 17.11.1 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Christopher Samuel
2017-12-07 17:33:39 MST
Just noting that I see KNL features being enabled on non-KNL nodes too. Dec 11 10:24:54 skylake1 slurmd[99390]: AllowMCDRAM=cache,hybrid,flat,equal,auto AllowNUMA=a2a,snc2,hemi Dec 11 10:24:54 skylake1 slurmd[99390]: AllowUserBoot=ALL Dec 11 10:24:54 skylake1 slurmd[99390]: BootTIme=300 Dec 11 10:24:54 skylake1 slurmd[99390]: DefaultMCDRAM=cache DefaultNUMA=a2a Dec 11 10:24:54 skylake1 slurmd[99390]: McPath=/sys/devices/system/edac/mc Dec 11 10:24:54 skylake1 slurmd[99390]: SyscfgPath=/apps/syscfg/14b6/bin/syscfg Hi Despite several attempts, I couldn't reproduce this error. Could you send me slurmd.log? Dominik (In reply to Dominik Bartkiewicz from comment #3) > Hi Hi Dominik, > Despite several attempts, I couldn't reproduce this error. > Could you send me slurmd.log? Unfortunately there isn't one, the site wants no local logs for slurmd so it all gets sent to a single syslog at the moment. One thing you may need to do is set: SlurmdSyslogDebug = info it may be that it's not getting logged at your level. Here's a complete log of the startup just now of slurmd on a node. Dec 11 22:57:55 skylake1 slurmd[10332]: error: Domain socket directory /var/spool/slurmd: No such file or directory Dec 11 22:57:55 skylake1 slurmd[10332]: Message aggregation disabled Dec 11 22:57:55 skylake1 slurmd[10332]: gpu device number 0(/dev/nvidia0):c 195:0 rwm Dec 11 22:57:55 skylake1 slurmd[10332]: gpu device number 1(/dev/nvidia1):c 195:1 rwm Dec 11 22:57:55 skylake1 slurmd[10332]: CPU frequency setting not configured for this node Dec 11 22:57:56 skylake1 systemd: PID file /var/run/slurmd.pid not readable (yet?) after start. Dec 11 22:57:56 skylake1 slurmd[10337]: slurmd version 17.11.0 started Dec 11 22:57:56 skylake1 slurmd[10337]: AllowMCDRAM=cache,hybrid,flat,equal,auto AllowNUMA=a2a,snc2,hemi Dec 11 22:57:56 skylake1 slurmd[10337]: AllowUserBoot=ALL Dec 11 22:57:56 skylake1 slurmd[10337]: BootTIme=300 Dec 11 22:57:56 skylake1 slurmd[10337]: DefaultMCDRAM=cache DefaultNUMA=a2a Dec 11 22:57:56 skylake1 slurmd[10337]: McPath=/sys/devices/system/edac/mc Dec 11 22:57:56 skylake1 slurmd[10337]: SyscfgPath=/apps/syscfg/14b6/bin/syscfg Dec 11 22:57:56 skylake1 slurmd[10337]: SyscfgTimeout=1000 msec Dec 11 22:57:56 skylake1 slurmd[10337]: SystemType=Dell Dec 11 22:57:56 skylake1 slurmd[10337]: UmeCheckInterval=0 Dec 11 22:57:56 skylake1 slurmd[10337]: slurmd started on Mon, 11 Dec 2017 22:57:56 +1100 Dec 11 22:57:56 skylake1 slurmd[10337]: node_features_p_node_state: syscfg program not found, can not get KNL modes Dec 11 22:57:56 skylake1 slurmd[10337]: CPUs=36 Boards=1 Sockets=2 Cores=18 Threads=1 Memory=191908 TmpDisk=360896 Uptime=7619 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) Here is "scontrol show node" for this node. NodeName=skylake1 Arch=x86_64 CoresPerSocket=18 CPUAlloc=0 CPUErr=0 CPUTot=36 CPULoad=0.01 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:p100:2 NodeAddr=skylake1 NodeHostName=skylake1 Version=17.11 OS=Linux 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 RealMemory=191000 AllocMem=0 FreeMem=187682 Sockets=2 Boards=1 State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=skylake,debug BootTime=2017-12-11T20:50:57 SlurmdStartTime=2017-12-11T22:57:56 CfgTRES=cpu=36,mem=191000M,billing=36 AllocTRES= CapWatts=n/a CurrentWatts=138 LowestJoules=555430 ConsumedJoules=23597 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Not responding [slurm@2017-12-11T21:24:15] I don't know where the error for /var/spool/slurmd is coming from, that's not in the config anywhere. cheers, Chris Here's the knl_generic.conf file too: # Sample knl_generic.conf SyscfgPath=/apps/syscfg/14b6/bin/syscfg DefaultNUMA=a2a # NUMA=all2all AllowNUMA=a2a,snc2,hemi DefaultMCDRAM=cache # MCDRAM=cache SystemType=Dell The syscfg program isn't actually present. Hi Good news: I found spot which caused "error: Domain socket directory..." and it will be fixed in the next release of 17.11 . I still can't recreate main problem, in attached slurmd.log I also don't see this error. If you attach slurmd log which contain error, this will give me some clues how to recreate and fix this issue. Dominik (In reply to Dominik Bartkiewicz from comment #7) > Hi Hi Dominik, > Good news: I found spot which caused "error: Domain socket directory..." and > it will be fixed in the next release of 17.11 . Great, thanks so much! > I still can't recreate main problem, in attached slurmd.log I also don't see > this error. > If you attach slurmd log which contain error, this will give me some clues > how to recreate and fix this issue. As far as I can tell *none* of this from the log you replied to should be present on a standard (non-KNL) node (apart from the "slurmd started" line). The last line has the error that it cannot find the binary. Dec 11 22:57:56 skylake1 slurmd[10337]: AllowUserBoot=ALL Dec 11 22:57:56 skylake1 slurmd[10337]: BootTIme=300 Dec 11 22:57:56 skylake1 slurmd[10337]: DefaultMCDRAM=cache DefaultNUMA=a2a Dec 11 22:57:56 skylake1 slurmd[10337]: McPath=/sys/devices/system/edac/mc Dec 11 22:57:56 skylake1 slurmd[10337]: SyscfgPath=/apps/syscfg/14b6/bin/syscfg Dec 11 22:57:56 skylake1 slurmd[10337]: SyscfgTimeout=1000 msec Dec 11 22:57:56 skylake1 slurmd[10337]: SystemType=Dell Dec 11 22:57:56 skylake1 slurmd[10337]: UmeCheckInterval=0 Dec 11 22:57:56 skylake1 slurmd[10337]: slurmd started on Mon, 11 Dec 2017 22:57:56 +1100 Dec 11 22:57:56 skylake1 slurmd[10337]: node_features_p_node_state: syscfg program not found, can not get KNL modes The reason the original error might have gone away is that I had started the slurmd processes up when there was the Intel (not Dell) syscfg binary there and that is a 32-bit program (even though Intel say they no longer support 32-bit distros with it) and our nodes have no 32-bit compatibility libraries installed. This results in an error trying to run it. When I restarted slurmd after removing the binary (as then I found I needed a different one from Dell which I don't currently have) it looks like it reported that it couldn't find the binary so I'm guessing that it then doesn't try and run it - or at least doesn't complain as before. So whilst the error might have been suppressed there is still the issue of why is the knl_generic config being used on nodes that do not have the knl feature set on them? All the best, Chris (In reply to Christopher Samuel from comment #8) > why is the knl_generic config being used on nodes that do not have the knl > feature set on them? Oops - I meant knl_generic plugin, not config, then! All the best, Chris Hi Now I understand how this could happen :) I tried to find code path which gives this error without syscfg. Currently slurmd base on syscfg availability. https://github.com/SchedMD/slurm/commit/ea2a0d25d11 I will check if it is possible to improve this behavior. Dominik (In reply to Dominik Bartkiewicz from comment #10) > Hi Hi Dominik, > Now I understand how this could happen :) > I tried to find code path which gives this error without syscfg. > > Currently slurmd base on syscfg availability. > > https://github.com/SchedMD/slurm/commit/ea2a0d25d11 Thanks so much! That explains why it changed/. > I will check if it is possible to improve this behavior. Could I suggest it should only happen on nodes that have the "knl" feature set, that's certainly the behaviour I would expect from the documentation here: https://slurm.schedmd.com/intel_knl.html # The node features plugin manages the available and active features # information available for each KNL node. All the best, Chris I received a direct email from Chris regarding the bug. Responding here: I am not sure how to reliably determine if a node is KNL. Most sites assign the nodes a feature of "knl", but I am not sure that all do, especially of all nodes in the cluster are are KNL. I'm not found of the idea, but we avoid using the plugin under some situations, perhaps something like this: 1. No syscfg program found 2. No feature of "knl" AND 3. CPU count under 256 There was a missing "could" in my message.
Regarding Dominik's idea of using /proc/cpuinfo: We do some development/testing of the knl_generic plugin on non-KNL nodes, so the information in /proc/cpuinfo may not say "Xeon Phi" (at least in a development/test environment).
> I received a direct email from Chris regarding the bug. Responding here:
>
> I am not sure how to reliably determine if a node is KNL.
>
> Most sites assign the nodes a feature of "knl", but I am not sure that all
> do, especially of all nodes in the cluster are are KNL. I'm not found of the
> idea, but we **could** avoid using the plugin under some situations, perhaps something
> like this:
> 1. No syscfg program found
> 2. No feature of "knl" AND
> 3. CPU count under 256
It's good idea and it's Alex's not mine. Dominik (In reply to Moe Jette from comment #13) > I am not sure how to reliably determine if a node is KNL. > > Most sites assign the nodes a feature of "knl", but I am not sure that all > do, especially of all nodes in the cluster are are KNL. From the documentation I'd assumed that the plugin was only loaded on nodes with the knl feature set, but it appears to not be the case. > I'm not found of the > idea, but we avoid using the plugin under some situations, perhaps something > like this: > > 1. No syscfg program found All our nodes (KNL and non-KNL) will have identical images and the syscfg program will live on a shared filesystem, so finding the program is not a reliable way of determining that it's a KNL node. Plus the Intel syscfg program is used on non-KNL hardware for BIOS settings and so has a legitimate purpose on non-KNL nodes. > 2. No feature of "knl" AND > 3. CPU count under 256 That would work for us. All the best, Chris Hi Commit https://github.com/SchedMD/slurm/commit/412846edceb2 adds extra check to ensure node is KNL. I hope this will work for you. Dominik On 20/12/17 3:22 am, bugs@schedmd.com wrote: > adds extra check to ensure node is KNL. > I hope this will work for you. Thanks so much Dominik (and my thanks to Moe too), that looks very promising. I won't get a chance to test this out until I'm back from the US in February (as I fly tomorrow) but one of the first things I plan when back is to upgrade to the latest release which will have this in. Have a happy festive time all! All the best, Chris Hi Any news? Dominik Hi Dominik, On 05/01/18 03:54, bugs@schedmd.com wrote: > Any news? I'm still on leave in the USA until February I'm afraid, won't get a chance to check this until I return I'm afraid. All the best, Chris Hi I'm going to go ahead and mark this as Resolved, please feel free to re-open this or open new bugs if there's anything else we can help with. Dominik |