| Summary: | A non superuser tried to complete batch JobId | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Kilian Cavalotti <kilian> |
| Component: | slurmd | Assignee: | Director of Support <support> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 18.08.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=7361 | ||
| Site: | Stanford | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | Sherlock | CLE Version: | |
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | sh-07-08 slurmd log | ||
|
Description
Kilian Cavalotti
2019-07-11 15:58:34 MDT
Hey Kilian, Thanks for the bug report. I'm looking into it. -Michael Hey Kilian, (In reply to Kilian Cavalotti from comment #0) > (In reply to Nate Rini from bug 7361 comment #20) > > That error is dumped by the handler for the REQUEST_COMPLETE_BATCH_SCRIPT > > RPC which is sent by slurmstepd. > > > > I suggest looking at the node's slurmd log for this message: > > > sending REQUEST_COMPLETE_BATCH_SCRIPT, error:%d status:%d > > Indeed, there's quite a number of occurrences (see attached). > Here's that job info: > > # sacct -Xj 45564838 -o jobid,alloccpus,state,exitcode,start,nodelist > JobID AllocCPUS State ExitCode Start > NodeList > ------------ ---------- ---------- -------- ------------------- > --------------- > 45555232_158 8 CANCELLED+ 0:0 2019-07-07T19:29:00 > sh-07-08 > > > Either user is running slurmd as themselves or there is some > > kind of bug causing the RPC to be sent after the slurmd has forked into > > slurmstepd as the user. > > Given the number of occurrences of the "sending > REQUEST_COMPLETE_BATCH_SCRIPT, error:%d status:%d" in the nodes' slurmd > logs, I'd tend to think it's the latter: > > # cat /var/log/slurm/d/* | grep -c "sending REQUEST_COMPLETE_BATCH_SCRIPT, > error > 238063 > > That's 238,000 occurrences in about a day worth of slurmd logs from about > 1,400 nodes. I think that’s a red herring. I don't think that number of "sending REQUEST_COMPLETE_BATCH_SCRIPT" printouts is necessarily indicative of an error. This is the code that prints out this statement, from slurmd/slurmstepd/mgr.c#_send_complete_batch_script_msg(): info("sending REQUEST_COMPLETE_BATCH_SCRIPT, error:%u status %d", err, status); So "sending REQUEST_COMPLETE_BATCH_SCRIPT, error" is going to be printed out every time a REQUEST_COMPLETE_BATCH_SCRIPT RPC is sent from the slurmd whenever a batch script finishes, even if error = 0. However, here is what I do think important: Grepping your slurmctld log from bug 7361, it appears that there is only one user (322658) involved with the "non superuser" error: Jul 9 11:30:02 sh-sl01 slurmctld[181294]: error: A non superuser 322658 tried to complete batch JobId=45564891 Jul 9 11:30:05 sh-sl01 slurmctld[181294]: error: A non superuser 322658 tried to complete batch JobId=45564838 Jul 9 11:30:17 sh-sl01 slurmctld[181294]: error: A non superuser 322658 tried to complete batch JobId=45564891 ... Jul 9 12:20:24 sh-sl01 slurmctld[181294]: error: A non superuser 322658 tried to complete batch JobId=45564838 Jul 9 12:20:39 sh-sl01 slurmctld[181294]: error: A non superuser 322658 tried to complete batch JobId=45564838 Have you seen this “non superuser” error with any other users? Are you still seeing this error in your slurmctld logs? If it’s just with that user, do you know what’s different about their setup? It appears that user 322658 is associated with the jobs that they are trying to complete, so I don't suspect anything malicious. Grepping for `13638` (the PID of the stepd for job 4556483) and `45564838` from the slurmd log, I see this: $ rg 13638\|45564838 sh-07-08.log Jul 7 19:29:01 sh-07-08 slurmstepd[13631]: task/cgroup: /slurm/uid_322658/job_45564838: alloc=32000MB mem.limit=32000MB memsw.limit=32000MB Jul 7 19:29:01 sh-07-08 slurmstepd[13631]: task/cgroup: /slurm/uid_322658/job_45564838/step_extern: alloc=32000MB mem.limit=32000MB memsw.limit=32000MB Jul 7 19:29:01 sh-07-08 slurmd[3071]: Launching batch job 45564838 for UID 322658 Jul 7 19:29:01 sh-07-08 slurmstepd[13638]: task/cgroup: /slurm/uid_322658/job_45564838: alloc=32000MB mem.limit=32000MB memsw.limit=32000MB Jul 7 19:29:01 sh-07-08 slurmstepd[13638]: task/cgroup: /slurm/uid_322658/job_45564838/step_batch: alloc=32000MB mem.limit=32000MB memsw.limit=32000MB Jul 9 11:20:35 sh-07-08 slurmstepd[13638]: error: *** JOB 45564838 STEPD TERMINATED ON sh-07-08 AT 2019-07-09T11:20:34 DUE TO JOB NOT RUNNING *** Jul 9 11:20:35 sh-07-08 slurmstepd[13638]: error: unlink(/var/spool/slurmd/job45564838/slurm_script): Permission denied Jul 9 11:20:35 sh-07-08 slurmstepd[13638]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4027 status 0 Jul 9 11:20:35 sh-07-08 slurmstepd[13638]: Retrying job complete RPC for 45564838.4294967294 ...repeats for MAX_RETRY (240) times in _send_complete_batch_script_msg() Jul 9 12:20:39 sh-07-08 slurmstepd[13638]: Retrying job complete RPC for 45564838.4294967294 Jul 9 12:20:54 sh-07-08 slurmstepd[13638]: error: Unable to send job complete message: Zero Bytes were transmitted or received Jul 9 12:20:54 sh-07-08 slurmstepd[13638]: error: rmdir(/var/spool/slurmd/job45564838): Permission denied Jul 9 12:20:54 sh-07-08 slurmstepd[13638]: error: Unable to unlink domain socket: Permission denied Jul 9 12:20:57 sh-07-08 slurmstepd[13638]: done with job Jul 9 12:20:58 sh-07-08 slurmd[36005]: _handle_stray_script: Purging vestigial job script /var/spool/slurmd/job45564838/slurm_script Somehow, after the `STEPD TERMINATED ON ... DUE TO JOB NOT RUNNING` error (2 days after submission), the stepd must have erroneously switched from the superuser to user 322658, and subsequently lost permission to do Slurm stuff, including file operations. The stepd sent the REQUEST_COMPLETE_BATCH_SCRIPT RPC 240 times until it gave up, because the controller was rejecting it every time with the "non superuser" error. I counted 230 "non superuser" errors, so that alone accounts for 240*230 = 55,200 out of the 238,000 occurrences of "sending REQUEST_COMPLETE_BATCH_SCRIPT". That's all I got for now. If you have any ideas on how to possibly reproduce this, let me know. I'll continue to investigate. I wonder if this has something to do with the "job not running" error. Thanks! -Michael (In reply to Michael Hinton from comment #2) > I counted 230 "non superuser" errors, so that alone > accounts for 240*230 = 55,200 out of the 238,000 occurrences of "sending > REQUEST_COMPLETE_BATCH_SCRIPT". Scratch that, that's not true. For every "sending REQUEST_COMPLETE_BATCH_SCRIPT" and "Retrying job complete RPC for ..." printout in the slurmd, there should be only one corresponding "non superuser" error in the ctld. Hi Michael, Thanks for investigating this. (In reply to Michael Hinton from comment #2) > Have you seen this “non superuser” error with any other users? Are you still > seeing this error in your slurmctld logs? If it’s just with that user, do > you know what’s different about their setup? It seems to be happening for a variety of users, although not all of them: # zgrep "A non superuser" /var/log/slurm/ctld/archives/sh-sl01.log* | awk '{print $10}' | sort | uniq -c 6 482 249868 3358 254732 1687 273089 86 274805 662 283008 5784 293564 6810 297147 2651 298495 2410 320934 3134 322104 1205 322658 322 323472 723 326230 482 326846 241 329911 241 333654 482 336627 723 340570 241 344984 708 345492 241 361204 3133 61594 > Somehow, after the `STEPD TERMINATED ON ... DUE TO JOB NOT RUNNING` error (2 > days after submission), the stepd must have erroneously switched from the > superuser to user 322658, and subsequently lost permission to do Slurm > stuff, including file operations. Is that uid switching something that is expected? Cheers, -- Kilian (In reply to Kilian Cavalotti from comment #4) > Is that uid switching something that is expected? The stepd changes from root to the user (and I think back) with setuid() and friends, so yes, this is expected. This is how Slurm runs processes on behalf of the user without making them root. It's my guess that somewhere this got messed up (maybe we handled a failure wrongly, and instead of exiting, we continued as the user instead of root?). But at any rate, it appears that the BATCH COMPLETE RPC was sent with the user UID when it should have been the root UID. I'll keep looking into it and let you know what I find. Thanks, -Michael Hey Kilian, I'm still trying to figure out how to reproduce this issue. Is it possible to run slurmd at log level debug or debug2 when some of these jobs are terminated? Are these always with jobs that are cancelled by the user? Also, are you seeing any nodes drain with the reason "Job is not running”? Thanks, -Michael Hi Michael, (In reply to Michael Hinton from comment #6) > I'm still trying to figure out how to reproduce this issue. Is it possible > to run slurmd at log level debug or debug2 when some of these jobs are > terminated? I'm not too sure how to approach this, other than mass-enabling debug on all our nodes' slurmd, and wait for those messages to appear? > Are these always with jobs that are cancelled by the user? That doesn't seem to be the case. Actually, most of the jobs that generated this messages have exited on TIMEOUT. I tried to explore that lead, but it doesn't look like jobs that exit on timeout will necessary generate this error. > Also, are you seeing any nodes drain with the reason "Job is not running”? Nope, I haven't seen nodes drained with that reason for a long time. Cheers, -- Kilian (In reply to Kilian Cavalotti from comment #7) > I'm not too sure how to approach this, other than mass-enabling debug on all > our nodes' slurmd, and wait for those messages to appear? I could prepare a debug patch that only prints stuff out in the slurmd at info level when this error occurs, if you were willing to run with it. And you could selectively choose which slurmds to run the patch on. > > Also, are you seeing any nodes drain with the reason "Job is not running”? > > Nope, I haven't seen nodes drained with that reason for a long time. That's really strange, because the code that is being hit also tries to set the node to drain, as far as I can tell. Here’s the code I’m looking at: https://github.com/SchedMD/slurm/blob/0a6cd4121e403994695481a0c685bea59c4ef2ad/src/slurmd/slurmstepd/step_terminate_monitor.c#L145-L155 Maybe the node is failing to be drained due to the same permission errors, which would make sense why the update RPCs don't go through. But I thought you would at the very least have seen some failed node updates in the controller logs. But I don't remember seeing any. Thanks, -Michael Hey Kilian, Apparently, every batch job that is encountering this error has a job step with an `oom-kill event` due to the job exceeding cgroup memory restrictions. So the job step dies, and this somehow puts the batch step in a weird state when it finishes things up. Knowing this, hopefully I'll be able to reproduce the issue. -Michael Hi Kilian, I apologize for the delay on this issue. Have you continued to see this issue while you were on 18.08, and do you still see this in 19.05 after your recent upgrade? Thanks, -Michael Hi Michael, (In reply to Michael Hinton from comment #10) > Hi Kilian, I apologize for the delay on this issue. > > Have you continued to see this issue while you were on 18.08, and do you > still see this in 19.05 after your recent upgrade? Thanks for following up! Actually, I can't find any "A non superuser ... tried to complete batch..." message anymore after the 19.05 update. I'll continue checking, but that issue can probably be closed for now. I'll reopen if I see those again. Thanks! -- Kilian Excellent! Please do reopen if you see this again. Thanks, -Michael |