Hi Moshin. Sorry for the delay on this, I'm currently on a 2-week on-site training and so have intermittent time to address bugs. I or another support engineer will be looking into this and come back to you as soon as possible. Thanks. Hi Is there a reason for setting ThreadsPerCore=1 for nodes in slurm.conf.TDS? Could you send me slurmd.log with enabling debug log level? Dominik Mohsin Can you please provide your cgroup.conf for production and TDS. Can you run your test job on the TDS with slurmd in debug mode. This can be done by calling slurmd with "-vvvvv" or setting this in slurm.conf: > SlurmdDebug=debug5 Slurmd will need to be restarted if slurm.conf is changed. Please revert this setting after the test avoid filling log partitions (or syslog). Can you run your test job with the following arguments: > srun -vvvvv --mem-bind=verbose --cpu-bind=verbose $TESTJOB Please dump the output of the 'env' or the job environment from your srun. Please attach all the logs to this ticket, preferably as a compressed tarball. Thanks, --Nate Created attachment 9500 [details]
slurmd logs and cgroup.conf files from tds and production systems
Hi Nate,
I have run the tests you required. The logs are in the attached tarball slurmd_logs_bug6552.tar
(In reply to Mohsin Ahmed from comment #7) > Created attachment 9500 [details] > slurmd logs and cgroup.conf files from tds and production systems > > Hi Nate, > I have run the tests you required. The logs are in the attached tarball > slurmd_logs_bug6552.tar We are reviewing your logs now. (In reply to Nate Rini from comment #8) > (In reply to Mohsin Ahmed from comment #7) > > Created attachment 9500 [details] > > slurmd logs and cgroup.conf files from tds and production systems > > > > Hi Nate, > > I have run the tests you required. The logs are in the attached tarball > > slurmd_logs_bug6552.tar > > We are reviewing your logs now. The logs appears to be provided for a single task job? > SLURM_STEP_NUM_NODES=1 > SLURM_STEP_NUM_TASKS=1 > SLURM_STEP_TASKS_PER_NODE=1 Can we get a log for "-n 24" as in the original comment? Created attachment 9501 [details]
srun env dump on 24 numtasks.
Hi Nate,
Added the env dump with -n 24 from TDS.
Regards
Mohsin
The slurmd_logs.tds also show a different job? The core assignment however looks like its getting rotated around. > Rank 0 thread 0 on nid00036 core 0 > Rank 4 thread 0 on nid00036 core 1 > Rank 8 thread 0 on nid00036 core 2 > Rank 12 thread 0 on nid00036 core 3 > Rank 16 thread 0 on nid00036 core 4 > Rank 20 thread 0 on nid00036 core 5 > Rank 1 thread 0 on nid00036 core 6 > Rank 5 thread 0 on nid00036 core 7 > Rank 9 thread 0 on nid00036 core 8 > Rank 13 thread 0 on nid00036 core 9 > Rank 17 thread 0 on nid00036 core 10 > Rank 21 thread 0 on nid00036 core 11 > Rank 2 thread 0 on nid00036 core 24 > Rank 6 thread 0 on nid00036 core 25 > Rank 10 thread 0 on nid00036 core 26 > Rank 14 thread 0 on nid00036 core 27 > Rank 18 thread 0 on nid00036 core 28 > Rank 22 thread 0 on nid00036 core 29 > Rank 3 thread 0 on nid00036 core 30 > Rank 7 thread 0 on nid00036 core 31 > Rank 11 thread 0 on nid00036 core 32 > Rank 15 thread 0 on nid00036 core 33 > Rank 19 thread 0 on nid00036 core 34 > Rank 23 thread 0 on nid00036 core 35 Breaking up the SLURM_CPU_BIND_LIST variable, we see that the tasks are getting handed out to single threads (without repeats): > 0x000000000001 > 0x000000000002 > 0x000000000004 > 0x000000000008 > 0x000000000010 > 0x000000000020 > 0x000000000040 > 0x000000000080 > 0x000000000100 > 0x000000000200 > 0x000000000400 > 0x000000000800 > 0x000001000000 > 0x000002000000 > 0x000004000000 > 0x000008000000 > 0x000010000000 > 0x000020000000 > 0x000040000000 > 0x000080000000 > 0x000100000000 > 0x000200000000 > 0x000400000000 > 0x000800000000 Looks like it ran on nid00036 which has MaxCPUsPerNode=48 from your slurm.conf which explains why the cpus are not all in order. Can you please run this job again and then call this using srun: > cat /proc/self/status > lstopo --of console --taskset > numactl -s > lscpu The slurm.conf has this setting for Magnus: > SelectTypeParameters=CR_ONE_TASK_PER_CORE,CR_CORE_Memory,other_cons_res But Chaos slurm.conf has this setting: > SelectTypeParameters=CR_CORE_Memory,other_cons_res Slurm should only configured to schedule by cores with CR_Core_Memory or CR_ONE_TASK_PER_CORE depending if you want users to be able to choose to schedule by threads. It is interesting that this broke on Chaos with 18.08. I also note the TDS slurmd log (which looks like the srun log) has this: >srun: threads-per-core : 1 Was this set manually or was CR_ONE_TASK_PER_CORE set when this job was run? Hi I can't find slurmd log, only logs from srun. Have you set intentional ThreadsPerCore=1 for nodes in slurm.conf.TDS? Dominik So when you say,
> Looks like it ran on nid00036 which has MaxCPUsPerNode=48 from your
> slurm.conf which explains why the cpus are not all in order.
should we take it as saying that because the "acceptance" queue
doesn't explictly specify a MaxCPUsPerNode value which the workq
and gpuq do, (MaxCPUsPerNode=24, MaxCPUsPerNode=8 respectively)
that the "acceptance" queue will get whatever "default" that
SLURM gets back from interogating the OS ?
As in, the job IS NOT getting MaxCPUsPerNode=48 from our slurm.conf ?
Kevin M. Buckley
--
Supercomputing Systems Administrator
Pawsey Supercomputing Centre
> Have you set intentional ThreadsPerCore=1 for nodes in slurm.conf.TDS?
Yes.
Here's what happens.
When we do the automated slurm.conf generation, as part of the install,
we get the following on the TDS, where ThreadsPerCore=2 because Cray
seem unable to turn off hyperthreading in their BIOS.
#NodeName=nid000[32-35] Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 Gres=craynetwork:4 # RealMemory=65536
#NodeName=nid000[36-39] Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 Gres=craynetwork:4 # RealMemory=65536
#NodeName=nid000[13-15] Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 Gres=craynetwork:4 # RealMemory=65536
#NodeName=nid000[16-19] Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 Gres=craynetwork:4 # RealMemory=65536
#NodeName=nid000[24-27] Sockets=1 CoresPerSocket=8 ThreadsPerCore=2 Gres=craynetwork:4,gpu # RealMemory=32768
As you will see from the TDS slurm.conf, we comment those definitions out, and
even though no-one here believes it ever did anything, we explictly put in
ThreadsPerCore=1
as a default in the line
NodeName=DEFAULT Sockets=2 ThreadsPerCore=1 Gres=craynetwork:4
before going on to define other node-specific values, vis:
NodeName=nid000[32-35] CoresPerSocket=10 Feature=ivybridge RealMemory=64394
NodeName=nid000[36-39] CoresPerSocket=12 Feature=haswell RealMemory=64298
NodeName=nid000[13-15] CoresPerSocket=10 Feature=ivybridge RealMemory=64394
NodeName=nid000[16-19] CoresPerSocket=12 Feature=haswell RealMemory=64298
NodeName=nid000[24-27] CoresPerSocket=8 Feature=sandybridge,gpu,tesla RealMemory=32154 Sockets=1 Gres=craynetwork:4,gpu
that DON'T, however, override the Deafult.
Hope the provenance of that setting is useful.
Kevin M. Buckley
--
Supercomputing Systems Administrator
Pawsey Supercomputing Centre
Created attachment 9502 [details]
revised_logs run on same node with slurmd logs included.
Hello Dominik,
I have added a revised logs tarball containing the srun outputs, Slurmd logs, the node environment and the cgroup.conf from both the TDS and the Production system.
The slurmd logs are trimmed to capture the logs for the jobID of interest.
Regards
Mohsin
Hi Could you generate slurmd.log with higher debug level, see Nate comment 6. NodeName=DEFAULT works, Could you set in slurm conf configuration that is matching node hardware? And then make another test. I think the proper configuration should work fine. Dominik Created attachment 9503 [details]
srun with slurm_debug set to level 6
Hi Dominik,
Apologies for not adding the required slurm_debug level.
I have repeated the run with the following command line this time (instead of making changes to slurm.conf and restart daemon):
srun -n 24 --export=all --slurmd-debug=-vvvvv --mem-bind=verbose --cpu-bind=verbose ./xthi | sort -k 2 -n
regards
Mohsin
Hi Sorry for bugging you about slurmd.log, but this time you have only attached slurmstepd log, could you send me both slurmd and slurmstepd log? Did you have a chance to test config with ThreadsPerCore=2? Dominik Created attachment 9531 [details] nid00036.log On 2019/03/12 17:03, bugs@schedmd.com wrote: > Sorry for bugging you about slurmd.log, but this time you have only attached > slurmstepd log, could you send me both slurmd and slurmstepd log? Mohsin is away today (Tue 12th), hence the lack of updates from him on this. I have just seen your latest comment as I was about to leave, so, to try and progress this ... ... find attached the slurmd log from the job for which Mohsin has already supplied the slurmstepd log. Note that this is the full log, not just the section from Mohsin's last run, as I don't have time to cut it down. > Did you have a chance to test config with ThreadsPerCore=2? Just of interest though, what's the thinking there? Surely, if that parameter actually does anything at all, then having it set to 1, as we have on the TDS, should limit the Threads Per Core to 1 ? The suggestion that we set ThreadsPerCore=2, so that we can get SLURM to only use 1 Thread Per Core seems a bit obtuse, but naybe it is supposed to be? Kevin (In reply to Kevin Buckley from comment #21) > Created attachment 9531 [details] > nid00036.log > > On 2019/03/12 17:03, bugs@schedmd.com wrote: > > > Sorry for bugging you about slurmd.log, but this time you have only attached > > slurmstepd log, could you send me both slurmd and slurmstepd log? > > Mohsin is away today (Tue 12th), hence the lack of updates from > him on this. > > I have just seen your latest comment as I was about to leave, > so, to try and progress this ... > > ... find attached the slurmd log from the job for which Mohsin > has already supplied the slurmstepd log. Note that this is the > full log, not just the section from Mohsin's last run, as I > don't have time to cut it down. > Hi This is still not what we asked for, we need slurmd log with enabled debug. To check how slurmd translates abstract core from slurmctld to physical CPUs o node we need at least debug2. debug4 should show us a full mapping between abstract and physical CPUs Check: comment 6, comment 17 > > > Did you have a chance to test config with ThreadsPerCore=2? > > Just of interest though, what's the thinking there? > > Surely, if that parameter actually does anything at all, then > having it set to 1, as we have on the TDS, should limit the > Threads Per Core to 1 ? > > The suggestion that we set ThreadsPerCore=2, so that we can > get SLURM to only use 1 Thread Per Core seems a bit obtuse, > but naybe it is supposed to be? > > Kevin This is just nodes definition and it should match to hardware on nodes. otherwise, slurm has a problem with right assigning cpus to job. Slurm provides mechanisms for allocating job with only one task per core CR_ONE_TASK_PER_CORE, or task/affinity plugin option --hint/SLURM_HINT. Dominik On 2019/03/12 19:13, bugs@schedmd.com wrote: > This is still not what we asked for, we need slurmd log with enabled debug. > To check how slurmd translates abstract core from slurmctld to physical CPUs o > node we need at least debug2. > debug4 should show us a full mapping between abstract and physical CPUs > Check: comment 6, comment 17 So, just to be clear, Mohsin's invocation of the job >>> srun -n 24 --export=all --slurmd-debug=-vvvvv --mem-bind=verbose --cpu-bind=verbose ./xthi | sort -k 2 -n hasn't altered the debug level: you need the slurmd started with those settings in the SLURM config ? >> The suggestion that we set ThreadsPerCore=2, so that we can >> get SLURM to only use 1 Thread Per Core seems a bit obtuse, >> but naybe it is supposed to be? > > This is just nodes definition and it should match to hardware on nodes. > otherwise, slurm has a problem with right assigning cpus to job. > > Slurm provides mechanisms for allocating job with only one task per core > CR_ONE_TASK_PER_CORE, or task/affinity plugin option --hint/SLURM_HINT. I'm still none the wiser there but, you're the expert, so I guess we can set it up as you suggest. (In reply to Kevin Buckley from comment #23) > So, just to be clear, Mohsin's invocation of the job > > >>> srun -n 24 --export=all --slurmd-debug=-vvvvv --mem-bind=verbose --cpu-bind=verbose ./xthi | sort -k 2 -n > > hasn't altered the debug level: you need the slurmd started with those > settings in the SLURM config ? Yes, this argument only takes effect for job steps. To get the logs we need, we either need slurmd to be started with "-vvvvv" or for "SlurmdDebug=debug5" to be set in the slurm.conf and SIGHUP sent to the slurmd daemon. The slurm.conf change can be limited to the single node and then should be reversed once the logs have been retrieved. (In reply to Kevin Buckley from comment #23) > >> The suggestion that we set ThreadsPerCore=2, so that we can > >> get SLURM to only use 1 Thread Per Core seems a bit obtuse, > >> but naybe it is supposed to be? > > > > This is just nodes definition and it should match to hardware on nodes. > > otherwise, slurm has a problem with right assigning cpus to job. > > > > Slurm provides mechanisms for allocating job with only one task per core > > CR_ONE_TASK_PER_CORE, or task/affinity plugin option --hint/SLURM_HINT. > > I'm still none the wiser there but, you're the expert, so I guess > we can set it up as you suggest. Slurm can be configured to have core count in CPUs field with CR_Core_Memory or to use thread count in CPUs field with CR_ONE_TASK_PER_CORE. CR_ONE_TASK_PER_CORE allows users to choose if they want to schedule by threads if request tasks-per-cpu but accounting will be done by threads. Either option should work, please decide which one best fits your site's needs. The ThreadsPerCore field should always reflect the value returned by calling "slurmd -C". On 2019/03/13 12:01, bugs@schedmd.com wrote: > Yes, this argument only takes effect for job steps. To get the logs we need, we > either need slurmd to be started with "-vvvvv" or for "SlurmdDebug=debug5" to > be set in the slurm.conf and SIGHUP sent to the slurmd daemon. The slurm.conf > change can be limited to the single node and then should be reversed once the > logs have been retrieved. Understood We are now running with debug5 and ThreadsPerCore=2 49c49 < SlurmdDebug=info --- > SlurmdDebug=debug5 51c51 < SlurmdSyslogDebug=info --- > SlurmdSyslogDebug=debug5 134c134 < NodeName=DEFAULT Sockets=2 ThreadsPerCore=1 Gres=craynetwork:4 --- > NodeName=DEFAULT Sockets=2 ThreadsPerCore=2 Gres=craynetwork:4 Mohsin should be running his job again shortly (now 1245 AWST) Kevin Created attachment 9554 [details] slurmd and slurmstepd logs with very verbose debug levels. Hi Dominik, Please see attached, which to my understanding, should be aligned with what was instructed in Comment#6 by Nate. Regards Mohsin Hi Thank you for slurmd log. Could tell me what specification has job present in this log? Does SLURM_HINT env work correctly with new configuration? Have you a chance to test CR_ONE_TASK_PER_CORE. Dominik > Have you a chance to test CR_ONE_TASK_PER_CORE. We have not, as yet, looked into what, if any, differences we might get in operation, were we to alter the TDS config to have: SelectTypeParameters=CR_ONE_TASK_PER_CORE instead of what it currently has, vis: SelectTypeParameters=CR_CORE_Memory as this is not considered relevant to the Hyperthreading issue that we we reported seeing in this issue and which SchedMD have now solved for us. To recap: We were using our TDS to investiagte the effects of upgrading SLURM from 17 to 18, as there is a desire to do so on our production systems. The production system (Magnus), has a SLURM config that does not give rise to Hyperthreading when running under SLURM 17. The TDS had a SLURM config that did not give rise to Hyperthreading when we were running it under SLURM 17. When we upgraded the TDS SLURM to 18, without changing its config, we saw Hyperthreading. SchedMD's inspection of our configs pointed out that: the TDS config had had the SLURM-determined ThreadsPerCore value overridden in an attempt to control Hyperthreading - Magnus's config had not - although neither value gave rise to Hyperthreading when both were running under SLURM 17. SchedMD's suggestions have shown us that if we removed the change we made to the SLURM-determined ThreadsPerCore value, we no longer see any Hyperthreading with the TDS running SLURM 18. This has been confirmed, both in our test program output and by inspection of an increased level of run-time diagnostics. The Hyperthreading issue is, therefore, closed to our satisfaction. Thanks for the time spent on this, and for bearing with us as we got our heads around the effects of tweaking SLURM-determined config parameters, for which the name implies they are giving the user something to tweak, when, in fact, tweaking them may cause some problems for the operation of SLURM. As regards the concern expressed by SchedMD in comment 12, as to our production system's config specifying two values that SchedMD believe to be orthogonal, that is a matter that we intend to look into and for which we will begin a new issue with SchedMD, once we have started that investigation. Hi Thanks for the info. I’m going to close this ticket now. If I misunderstand you please reopen. Dominik |
Created attachment 9236 [details] Tarball with slurm.conf files and a test code. Hi, On our production system we use SLURM version 17.11.9 and we are now testing version 18.08.5 on our Cray test and development system (TDS) so it can be rolled out to the the production environment. On the production system running 17.11.9, we were managing hyperthreading by the SLURM environment variable SLURM_HINT=nomultithread. This works fine with the attached configuration file slurm.conf.magnus. When tested with the a C code that queries the affinity of the CPU set, we get the following output on a compute node of our production system: mshaikh@nid00011:/group/pawsey0001/mshaikh/Application_testsuite/pawseyapplicationtestsuite/resourcesdir/slurm/xthi> srun -n 24 ./xthi |sort Rank 0, thread 0, on nid00011. core = 0. Rank 1, thread 0, on nid00011. core = 12. Rank 10, thread 0, on nid00011. core = 5. Rank 11, thread 0, on nid00011. core = 17. Rank 12, thread 0, on nid00011. core = 6. Rank 13, thread 0, on nid00011. core = 18. Rank 14, thread 0, on nid00011. core = 7. Rank 15, thread 0, on nid00011. core = 19. Rank 16, thread 0, on nid00011. core = 8. Rank 17, thread 0, on nid00011. core = 20. Rank 18, thread 0, on nid00011. core = 9. Rank 19, thread 0, on nid00011. core = 21. Rank 2, thread 0, on nid00011. core = 1. Rank 20, thread 0, on nid00011. core = 10. Rank 21, thread 0, on nid00011. core = 22. Rank 22, thread 0, on nid00011. core = 11. Rank 23, thread 0, on nid00011. core = 23. Rank 3, thread 0, on nid00011. core = 13. Rank 4, thread 0, on nid00011. core = 2. Rank 5, thread 0, on nid00011. core = 14. Rank 6, thread 0, on nid00011. core = 3. Rank 7, thread 0, on nid00011. core = 15. Rank 8, thread 0, on nid00011. core = 4. Rank 9, thread 0, on nid00011. core = 16. This is an expected behaviour where MPI tasks are distributed on sockets in Roundrobin fashion, as seen by the core IDs of each MPI task. When testing on the TDS running 18.08.5, we see a different behaviour where each physical core is running two logical CPUs hence each core is oversubscribed by two MPI tasks and on the same socket. The output from a compute node of test and development system is as follows: mshaikh@chaos-int:/group/pawsey0001/mshaikh/Application_testsuite/pawseyapplicationtestsuite/resourcesdir/slurm/xthi> srun -n 24 ./xthi |sort Rank 0, thread 0, on nid00036. core = 0. Rank 1, thread 0, on nid00036. core = 6. Rank 10, thread 0, on nid00036. core = 26. Rank 11, thread 0, on nid00036. core = 32. Rank 12, thread 0, on nid00036. core = 3. Rank 13, thread 0, on nid00036. core = 9. Rank 14, thread 0, on nid00036. core = 27. Rank 15, thread 0, on nid00036. core = 33. Rank 16, thread 0, on nid00036. core = 4. Rank 17, thread 0, on nid00036. core = 10. Rank 18, thread 0, on nid00036. core = 28. Rank 19, thread 0, on nid00036. core = 34. Rank 2, thread 0, on nid00036. core = 24. Rank 20, thread 0, on nid00036. core = 5. Rank 21, thread 0, on nid00036. core = 11. Rank 22, thread 0, on nid00036. core = 29. Rank 23, thread 0, on nid00036. core = 35. Rank 3, thread 0, on nid00036. core = 30. Rank 4, thread 0, on nid00036. core = 1. Rank 5, thread 0, on nid00036. core = 7. Rank 6, thread 0, on nid00036. core = 25. Rank 7, thread 0, on nid00036. core = 31. Rank 8, thread 0, on nid00036. core = 2. Rank 9, thread 0, on nid00036. core = 8. As it is evident from the above output from the TDS that the MPI tasks are running on hyperthreads. This can be confirmed by looking at the CPU IDs which are more then 23 whereas the total physical cores on the compute node are 24 (i.e 0-23). In this case SLURM_HINT did not have any effect and the same output is see when SLURM_HINT=nomultithread or multithread. I have added the slurm.conf for both systems for reference. A prompt guidance in this matter would be highly appreciated as we need to resolve this issue before we can decide on migrating to new SLURM version on our production system.