| Summary: | Slurm Configuration settings | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Akshaynath <akshaynath.17> |
| Component: | slurmctld | Assignee: | Director of Support <support> |
| Status: | RESOLVED WONTFIX | QA Contact: | |
| Severity: | 6 - No support contract | ||
| Priority: | --- | CC: | jacob |
| Version: | 19.05.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | -Other- | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
Hi, I am unable to figure out how to configure GPU in my server. Below are the details of slurm.conf and gres.conf I have 2 servers which has 8 GPU and CPU 64 I am getting ERROR called : gres/gpu count too low (0 < 8), Low socket*core*thread count root@cl-gpusrv1 slurm]# sinfo -N -o "%.20N %.15C %.10t %.10m %.15P %.15G %.35E" NODELIST CPUS(A/I/O/T) STATE MEMORY PARTITION GRES REASON cl-gpusrv1 0/0/64/64 drain 192072 debug* gpu:8 gres/gpu count too low (0 < 8), Low cl-gpusrv2 0/64/0/64 idle 192072 debug* gpu:8 none root@cl-gpusrv1 slurm]# scontrol show nodes NodeName=cl-gpusrv1 Arch=x86_64 CoresPerSocket=16 CPUAlloc=0 CPUTot=64 CPULoad=0.01 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:8 NodeAddr=cl-gpusrv1 NodeHostName=cl-gpusrv1 Version=18.08 OS=Linux 3.10.0-957.el7.x86_64 #1 SMP Thu Nov 8 23:39:32 UTC 2018 RealMemory=192072 AllocMem=0 FreeMem=189856 Sockets=2 Boards=1 State=IDLE+DRAIN ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=debug BootTime=2019-02-11T16:01:49 SlurmdStartTime=2019-02-11T16:58:12 CfgTRES=cpu=64,mem=192072M,billing=64 AllocTRES= CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=gres/gpu count too low (0 < 8), Low socket*core*thread count, Low CPUs [slurm@2019-02-08T12:52:57] NodeName=cl-gpusrv2 CoresPerSocket=16 CPUAlloc=0 CPUTot=64 CPULoad=0.03 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:8 NodeAddr=cl-gpusrv2 NodeHostName=cl-gpusrv2 RealMemory=192072 AllocMem=0 FreeMem=189696 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=debug BootTime=None SlurmdStartTime=None CfgTRES=cpu=64,mem=192072M,billing=64 AllocTRES= CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s slurm.conf # slurm.conf file generated by configurator easy.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=cl-gpusrv1 ControlAddr=10.192.12.28 # #MailProg=/bin/mail MpiDefault=none #MpiParams=ports=#-# ProctrackType=proctrack/pgid ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid #SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid #SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm #SlurmdUser=root StateSaveLocation=/var/spool/slurmctld SwitchType=switch/none TaskPlugin=task/none # # # TIMERS #KillWait=30 #MinJobAge=300 #SlurmctldTimeout=120 #SlurmdTimeout=300 # # # SCHEDULING FastSchedule=1 SchedulerType=sched/backfill #SchedulerPort=7321 SelectType=select/linear # # # LOGGING AND ACCOUNTING AccountingStorageType=accounting_storage/none DebugFlags=CPU_Bind,gres ClusterName=buhpc #JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none #SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurmctld.log #SlurmdDebug=3 SlurmdLogFile=/var/log/slurmd.log # # # COMPUTE NODES GresTypes=gpu NodeName=cl-gpusrv1 Gres=gpu:8 Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 State=UNKNOWN RealMemory=192072 #NodeName=buhpc3 NodeAddr=128.197.115.176 CPUs=64 State=UNKNOWN #NodeName=cl-gpusrv1 Gres=gpu:8 Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 State=UNKNOWN RealMemory=192072 NodeName=cl-gpusrv2 Gres=gpu:8 Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 State=UNKNOWN RealMemory=192072 PartitionName=debug Nodes=cl-gpusrv[1-2] Default=YES MaxTime=INFINITE State=UP gres.conf Name=gpu File=/dev/nvidia0 Cores=0,1 Name=gpu File=/dev/nvidia1 Cores=0,1 Name=gpu File=/dev/nvidia2 Cores=2,3 Name=gpu File=/dev/nvidia3 Cores=2,3 Name=gpu File=/dev/nvidia4 Cores=4,5 Name=gpu File=/dev/nvidia5 Cores=4,5 Name=gpu File=/dev/nvidia6 Cores=6,7 Name=gpu File=/dev/nvidia7 Cores=6,7 [root@cl-gpusrv1 slurm]# tail -20 /var/log/slurmctld.log [2019-02-11T17:07:39.333] gres_bit_alloc:NULL [2019-02-11T17:07:39.333] gres_used:(null) [2019-02-11T17:07:39.333] gres/gpu: state for cl-gpusrv2 [2019-02-11T17:07:39.333] gres_cnt found:TBD configured:8 avail:8 alloc:0 [2019-02-11T17:07:39.333] gres_bit_alloc:NULL [2019-02-11T17:07:39.333] gres_used:(null) [2019-02-11T17:07:39.333] Recovered state of 0 reservations [2019-02-11T17:07:39.333] _preserve_plugins: backup_controller not specified [2019-02-11T17:07:39.333] Running as primary controller [2019-02-11T17:07:39.334] No parameter for mcs plugin, default values set [2019-02-11T17:07:39.334] mcs: MCSParameters = (null). ondemand set. [2019-02-11T17:07:43.352] error: Node cl-gpusrv1 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2019-02-11T17:07:43.352] gres/gpu: state for cl-gpusrv1 [2019-02-11T17:07:43.352] gres_cnt found:0 configured:8 avail:8 alloc:0 [2019-02-11T17:07:43.352] gres_bit_alloc:NULL [2019-02-11T17:07:43.352] gres_used:(null) [2019-02-11T17:07:43.352] error: Setting node cl-gpusrv1 state to DRAIN [2019-02-11T17:07:43.352] drain_nodes: node cl-gpusrv1 state set to DRAIN [2019-02-11T17:07:43.352] error: _slurm_rpc_node_registration node=cl-gpusrv1: Invalid argument [2019-02-11T17:08:39.549] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2 ERROR -- 1 . gres/gpu count too low (0 < 8), Low socket*core*thread count 2. Setting node cl-gpusrv1 state to DRAIN --- Requirement We need to create two job submission queues. Jobs in one of the queues can make use of GPUs in the system (let’s call it “GPU_queue”). A number of CPU cores equal to the number of GPUs in the given server (let’s call this number “g”) should be reserved for jobs in this queue. Jobs in other queue can only use CPUs (let’s call it “CPU_queue”) and at max, they can make use of at most “n” cores, where n = t – g and “t” is the total number of CPU cores in the given server. Can you please see if you can implement such