| Summary: | Help needed with taming a misbehaving node | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Jurij Pečar <jurij.pecar> |
| Component: | slurmctld | Assignee: | Jason Booth <jbooth> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 20.11.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | EMBL | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Jurij Pečar
2021-11-12 01:33:06 MST
Would you run the following commands on the scheduler and on bn05 then send back the output? > nslookup bn05 > ping bn05 Please also attach your slurm.conf > agent/is_node_resp: node:bn05 RPC:REQUEST_ACCT_GATHER_UPDATE : Can't find an address, check slurm.conf I am assuming this is from the slurmctld.log? Would you please also attach your slurmctd and slurmd logs to this issue as well? Funny enough, after sleeping things over, next day this node works as expected.
Must have been some negative dns cache somewhere ...
Btw, while browsing the code, I saw lots of error('...') lines that looked like they intend to print out information that I would be interested in. But I didn't see any of them in the logs, even at debug5 level. How do I get these error lines to reach some stdout/stderr/log file?
> Funny enough, after sleeping things over, next day this node works as expected. > Must have been some negative dns cache somewhere ... Great. > Btw, while browsing the code, I saw lots of error('...') lines that looked like they intend to print out information that I would be interested in. But I didn't see any of them in the logs, even at debug5 level. How do I get these error lines to reach some stdout/stderr/log file? Would you highlight those logs you are interested? You can also copy past the git line number too. For example: https://github.com/SchedMD/slurm/blob/slurm-20-11-7-1/src/slurmctld/node_scheduler.c#L320 For example, line like https://github.com/SchedMD/slurm/blob/slurm-20-11-7-1/src/common/forward.c#L360 From what I could see, slurm_conf_get_addr was returning SLURM_ERROR here but I couldn't see what exactly it was passing for "name" to verify that is's something reasonable and expected. Anyway, you can now close this issue. Resolving |