| Summary: | nodes staying offline despite "ReturnToService=2" | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | James Powell <James.Powell> |
| Component: | slurmctld | Assignee: | Tim Wickberg <tim> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 15.08.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | CSIRO | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
James Powell
2016-06-29 19:54:47 MDT
forgot to show how cpu node is defined in slurm.conf cm01:~ # grep g001 /etc/slurm/slurm.conf NodeName=g001 CoresPerSocket=8 Sockets=2 ThreadsPerCore=1 Gres=gpu:3 Feature=gpu_ex,gpu_pro,gpu_ex_pro,gpu_def PartitionName=gpu Default=NO MinNodes=1 DefaultTime=00:10:00 MaxTime=7-00:00:00 AllowGroups=ALL Priority=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=512 AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO State=UP Nodes=g001 If I understand correctly, slurmd is starting up with a gres.conf missing the devices? slurmd does not re-read the file at any point until a SIGHUP is sent to it. (Or 'scontrol reconfigure' for the cluster, but I don't believe this would work when the node is still marked as down.) In that case the node would not register properly - ReturnToService will not allow a node missing resources to start running jobs. It sounds like the real fix would be to make the slurmd service on the node depend on Bright finishing up all of its setup tasks. Hi Tim, Thanks for clarifying how "returntoservice" works in this instance. I'll pursue starting the slurmd service later as the solution. Please consider this ticket closed. Cheers James |