| Summary: | Prolog error | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Torkil Svensgaard <torkil> |
| Component: | slurmd | Assignee: | Marcin Stolarek <cinek> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | cinek, rkv, tripiana |
| Version: | 20.11.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=8398 | ||
| Site: | DRCMR | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
Where add attachment link is
Log from bigger7 Bigger7 slurm.conf Slurm.conf from controller Slurmctld.log Bigger13 slurmd.log |
||
|
Description
Torkil Svensgaard
2021-06-02 01:02:58 MDT
On the controller there's only this: [2021-06-02T08:05:53.398] error: validate_node_specs: Prolog or job env setup failure on node bigger7, draining the node [2021-06-02T08:05:53.398] drain_nodes: node bigger7 state set to DRAIN Mvh. Torkil Generally, slurmd logs should contain some information about the cause of the error. What's SlurmdDebug set? Did you confirm that slurmd logs to /var/log/slurmd.log file? Could you please share the log(slurmd side) with us? cheers, Marcin Hi Marcin How do I go about sending you guys a large log file? From the conf file: SlurmdDebug=debug SlurmdLogFile=/var/log/slurmd.log Mvh. Torkil Created attachment 19775 [details] Where add attachment link is >How do I go about sending you guys a large log file? It's usually fine to attach a compressed file to the case. (I using "Add an attachment" link - screen shot attached). >SlurmdDebug=debug >SlurmdLogFile=/var/log/slurmd.log Looks good, did you check if you see new logs appearing there? cheers, Marcin Created attachment 19793 [details]
Log from bigger7
Hi Marcin Log uploaded. Log is active, yes. Mvh. Torkil Torkil,
Looking at the log file I see that starting from 2 June you're getting a step level messages like:
>[2021-06-02T08:05:49.645] [215133.batch] debug level is 'error'
From the configuration you shared with us some time ago I see that it may be running config less, could you please check the content of /run/slurm/conf/slurm.conf on the node?
Was there any configuration change around that date?
cheers,
Marcin
Created attachment 19796 [details]
Bigger7 slurm.conf
Created attachment 19797 [details]
Slurm.conf from controller
(In reply to Marcin Stolarek from comment #7) > Looking at the log file I see that starting from 2 June you're getting a > step level messages like: > >[2021-06-02T08:05:49.645] [215133.batch] debug level is 'error' > > From the configuration you shared with us some time ago I see that it may be > running config less, could you please check the content of > /run/slurm/conf/slurm.conf on the node? Seems to be up to date. > Was there any configuration change around that date? Yes. Not specifically on bigger7 though, but we added GPU's to other nodes and thus had to change slurm.conf (and run reconfigure). The cluster was fully loaded on that day, and the prolog file is sitting on NFS. Could that be a problem if it is getting hammered, even though it is cached in a lot of places? One other node also had this prolog error right after I created the ticket. Mvh. Torkil Torkil, Looking at the sources I realized that the message: >[2021-06-02T08:05:49.645] [215133.batch] debug level is 'error' Is just misleading and doesn't have any impact on batch step logging. It logs the level used for stderr, which is only meaningful for srun created steps, since those are connected to stderr of slurmstepd, batch step logs to file only. I've actually found the error message related to the node being drained, it's: >error: Waiting for JobId=215583 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec This error means that the request to run batch step timed-out waiting for appropriate launch prolog RPC start processing by the slurmd. Unfortunately, we can't say more about the reason for that to happen. Did you notice any connection errors logged on slurmctld side to bigger7 node? There was a very rare condition where it was happening on the versions prior to 20.02.6 I'm adding the bug as See Also, since may find a few hypothesis on when this can happen. cheers, Marcin (In reply to Marcin Stolarek from comment #11) > I've actually found the error message related to the node being drained, > it's: > >error: Waiting for JobId=215583 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec I just checked bigger13 which also had the same problem at the same time(ish), same thing: " [2021-06-02T13:53:48.689] error: Waiting for JobId=217868 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec [2021-06-02T13:53:48.689] error: Waiting for JobId=217871 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec [2021-06-02T13:53:48.699] error: Waiting for JobId=217870 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec [2021-06-02T13:53:48.774] error: Waiting for JobId=217872 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec " Uploading full log from that node. > This error means that the request to run batch step timed-out waiting for > appropriate launch prolog RPC start processing by the slurmd. Unfortunately, > we can't say more about the reason for that to happen. Did you notice any > connection errors logged on slurmctld side to bigger7 node? I didn't at the time. Uploading slurmctld.log. > There was a very rare condition where it was happening on the versions prior > to 20.02.6 I'm adding the bug as See Also, since may find a few hypothesis > on when this can happen. Thanks, Torkil Created attachment 19808 [details]
Slurmctld.log
Created attachment 19809 [details]
Bigger13 slurmd.log
Torkil, Looking at slurmctld logs I see that it was under high pressure during the time when the issue occured: >[2021-06-02T08:05:53.756] sched: 176 pending RPCs at cycle end, consider configuring max_rpc_cnt >[2021-06-02T13:53:48.326] sched: 193 pending RPCs at cycle end, consider configuring max_rpc_cnt I'd recommend setting MessageTimeout=30 and eventualy consinder setting SchedulerParameters=max_rpc_cnt=100. Looking at your configuration you're running with a default MessageTimeout=10. The prolog error: > REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec shows 20s which is 2xMessageTimeout. This value may be to low for a high throughput environment and result in errors in the burst of RPCs. cheers, Marcin (In reply to Marcin Stolarek from comment #16) > Torkil, Hi Marcin > Looking at slurmctld logs I see that it was under high pressure during the > time when the issue occured: > >[2021-06-02T08:05:53.756] sched: 176 pending RPCs at cycle end, consider configuring max_rpc_cnt > >[2021-06-02T13:53:48.326] sched: 193 pending RPCs at cycle end, consider configuring max_rpc_cnt > > I'd recommend setting MessageTimeout=30 and eventualy consinder setting > SchedulerParameters=max_rpc_cnt=100. Thanks, I've changed the configuration. Feel free to close the ticket, I'll get back to you if it happens again. Mvh. Torkil |