Ticket 7400

Summary: A non superuser tried to complete batch JobId
Product: Slurm Reporter: Kilian Cavalotti <kilian>
Component: slurmdAssignee: Director of Support <support>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 18.08.7   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=7361
Site: Stanford Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: Sherlock CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: sh-07-08 slurmd log

Description Kilian Cavalotti 2019-07-11 15:58:34 MDT
Created attachment 10882 [details]
sh-07-08 slurmd log

This is branching off #7361

(In reply to Nate Rini from comment #20)
> That error is dumped by the handler for the REQUEST_COMPLETE_BATCH_SCRIPT
> RPC which is sent by slurmstepd. 
> 
> I suggest looking at the node's slurmd log for this message:
> > sending REQUEST_COMPLETE_BATCH_SCRIPT, error:%d status:%d

Indeed, there's quite a number of occurrences (see attached).
Here's that job info:

# sacct -Xj 45564838 -o jobid,alloccpus,state,exitcode,start,nodelist
       JobID  AllocCPUS      State ExitCode               Start        NodeList
------------ ---------- ---------- -------- ------------------- ---------------
45555232_158          8 CANCELLED+      0:0 2019-07-07T19:29:00        sh-07-08

> Either user is running slurmd as themselves or there is some
> kind of bug causing the RPC to be sent after the slurmd has forked into
> slurmstepd as the user.

Given the number of occurrences of the "sending REQUEST_COMPLETE_BATCH_SCRIPT, error:%d status:%d" in the nodes' slurmd logs, I'd tend to think it's the latter:

# cat /var/log/slurm/d/* | grep -c "sending REQUEST_COMPLETE_BATCH_SCRIPT, error
238063

That's 238,000 occurrences in about a day worth of slurmd logs from about 1,400 nodes.

Thanks!
-- 
Kilian
Comment 1 Michael Hinton 2019-07-22 09:22:07 MDT
Hey Kilian,

Thanks for the bug report. I'm looking into it.

-Michael
Comment 2 Michael Hinton 2019-07-22 15:43:24 MDT
Hey Kilian,

(In reply to Kilian Cavalotti from comment #0)
> (In reply to Nate Rini from bug 7361 comment #20)
> > That error is dumped by the handler for the REQUEST_COMPLETE_BATCH_SCRIPT
> > RPC which is sent by slurmstepd. 
> > 
> > I suggest looking at the node's slurmd log for this message:
> > > sending REQUEST_COMPLETE_BATCH_SCRIPT, error:%d status:%d
> 
> Indeed, there's quite a number of occurrences (see attached).
> Here's that job info:
> 
> # sacct -Xj 45564838 -o jobid,alloccpus,state,exitcode,start,nodelist
>        JobID  AllocCPUS      State ExitCode               Start       
> NodeList
> ------------ ---------- ---------- -------- -------------------
> ---------------
> 45555232_158          8 CANCELLED+      0:0 2019-07-07T19:29:00       
> sh-07-08
> 
> > Either user is running slurmd as themselves or there is some
> > kind of bug causing the RPC to be sent after the slurmd has forked into
> > slurmstepd as the user.
> 
> Given the number of occurrences of the "sending
> REQUEST_COMPLETE_BATCH_SCRIPT, error:%d status:%d" in the nodes' slurmd
> logs, I'd tend to think it's the latter:
> 
> # cat /var/log/slurm/d/* | grep -c "sending REQUEST_COMPLETE_BATCH_SCRIPT,
> error
> 238063
> 
> That's 238,000 occurrences in about a day worth of slurmd logs from about
> 1,400 nodes.
I think that’s a red herring. I don't think that number of "sending REQUEST_COMPLETE_BATCH_SCRIPT" printouts is necessarily indicative of an error.

This is the code that prints out this statement, from slurmd/slurmstepd/mgr.c#_send_complete_batch_script_msg():

	info("sending REQUEST_COMPLETE_BATCH_SCRIPT, error:%u status %d",
	     err, status);

So "sending REQUEST_COMPLETE_BATCH_SCRIPT, error" is going to be printed out every time a REQUEST_COMPLETE_BATCH_SCRIPT RPC is sent from the slurmd whenever a batch script finishes, even if error = 0.

However, here is what I do think important:

Grepping your slurmctld log from bug 7361, it appears that there is only one user (322658) involved with the "non superuser" error:

Jul  9 11:30:02 sh-sl01 slurmctld[181294]: error: A non superuser 322658 tried to complete batch JobId=45564891
Jul  9 11:30:05 sh-sl01 slurmctld[181294]: error: A non superuser 322658 tried to complete batch JobId=45564838
Jul  9 11:30:17 sh-sl01 slurmctld[181294]: error: A non superuser 322658 tried to complete batch JobId=45564891
...
Jul  9 12:20:24 sh-sl01 slurmctld[181294]: error: A non superuser 322658 tried to complete batch JobId=45564838
Jul  9 12:20:39 sh-sl01 slurmctld[181294]: error: A non superuser 322658 tried to complete batch JobId=45564838

Have you seen this “non superuser” error with any other users? Are you still seeing this error in your slurmctld logs? If it’s just with that user, do you know what’s different about their setup?

It appears that user 322658 is associated with the jobs that they are trying to complete, so I don't suspect anything malicious. Grepping for `13638` (the PID of the stepd for job 4556483) and `45564838` from the slurmd log, I see this:


$ rg 13638\|45564838 sh-07-08.log
Jul  7 19:29:01 sh-07-08 slurmstepd[13631]: task/cgroup: /slurm/uid_322658/job_45564838: alloc=32000MB mem.limit=32000MB memsw.limit=32000MB
Jul  7 19:29:01 sh-07-08 slurmstepd[13631]: task/cgroup: /slurm/uid_322658/job_45564838/step_extern: alloc=32000MB mem.limit=32000MB memsw.limit=32000MB
Jul  7 19:29:01 sh-07-08 slurmd[3071]: Launching batch job 45564838 for UID 322658
Jul  7 19:29:01 sh-07-08 slurmstepd[13638]: task/cgroup: /slurm/uid_322658/job_45564838: alloc=32000MB mem.limit=32000MB memsw.limit=32000MB
Jul  7 19:29:01 sh-07-08 slurmstepd[13638]: task/cgroup: /slurm/uid_322658/job_45564838/step_batch: alloc=32000MB mem.limit=32000MB memsw.limit=32000MB
Jul  9 11:20:35 sh-07-08 slurmstepd[13638]: error: *** JOB 45564838 STEPD TERMINATED ON sh-07-08 AT 2019-07-09T11:20:34 DUE TO JOB NOT RUNNING ***
Jul  9 11:20:35 sh-07-08 slurmstepd[13638]: error: unlink(/var/spool/slurmd/job45564838/slurm_script): Permission denied
Jul  9 11:20:35 sh-07-08 slurmstepd[13638]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4027 status 0
Jul  9 11:20:35 sh-07-08 slurmstepd[13638]: Retrying job complete RPC for 45564838.4294967294

...repeats for MAX_RETRY (240) times in _send_complete_batch_script_msg()

Jul  9 12:20:39 sh-07-08 slurmstepd[13638]: Retrying job complete RPC for 45564838.4294967294
Jul  9 12:20:54 sh-07-08 slurmstepd[13638]: error: Unable to send job complete message: Zero Bytes were transmitted or received
Jul  9 12:20:54 sh-07-08 slurmstepd[13638]: error: rmdir(/var/spool/slurmd/job45564838): Permission denied
Jul  9 12:20:54 sh-07-08 slurmstepd[13638]: error: Unable to unlink domain socket: Permission denied
Jul  9 12:20:57 sh-07-08 slurmstepd[13638]: done with job
Jul  9 12:20:58 sh-07-08 slurmd[36005]: _handle_stray_script: Purging vestigial job script /var/spool/slurmd/job45564838/slurm_script


Somehow, after the `STEPD TERMINATED ON ... DUE TO JOB NOT RUNNING` error (2 days after submission), the stepd must have erroneously switched from the superuser to user 322658, and subsequently lost permission to do Slurm stuff, including file operations.

The stepd sent the REQUEST_COMPLETE_BATCH_SCRIPT RPC 240 times until it gave up, because the controller was rejecting it every time with the "non superuser" error. I counted 230 "non superuser" errors, so that alone accounts for 240*230 = 55,200 out of the 238,000 occurrences of "sending REQUEST_COMPLETE_BATCH_SCRIPT". 

That's all I got for now. If you have any ideas on how to possibly reproduce this, let me know. I'll continue to investigate. I wonder if this has something to do with the "job not running" error.

Thanks!
-Michael
Comment 3 Michael Hinton 2019-07-22 15:54:37 MDT
(In reply to Michael Hinton from comment #2)
> I counted 230 "non superuser" errors, so that alone
> accounts for 240*230 = 55,200 out of the 238,000 occurrences of "sending
> REQUEST_COMPLETE_BATCH_SCRIPT". 
Scratch that, that's not true. For every "sending REQUEST_COMPLETE_BATCH_SCRIPT" and "Retrying job complete RPC for ..." printout in the slurmd, there should be only one corresponding "non superuser" error in the ctld.
Comment 4 Kilian Cavalotti 2019-07-22 16:02:31 MDT
Hi Michael, 

Thanks for investigating this.

(In reply to Michael Hinton from comment #2)

> Have you seen this “non superuser” error with any other users? Are you still
> seeing this error in your slurmctld logs? If it’s just with that user, do
> you know what’s different about their setup?

It seems to be happening for a variety of users, although not all of them:

# zgrep "A non superuser" /var/log/slurm/ctld/archives/sh-sl01.log* | awk '{print $10}' | sort | uniq -c
      6
    482 249868
   3358 254732
   1687 273089
     86 274805
    662 283008
   5784 293564
   6810 297147
   2651 298495
   2410 320934
   3134 322104
   1205 322658
    322 323472
    723 326230
    482 326846
    241 329911
    241 333654
    482 336627
    723 340570
    241 344984
    708 345492
    241 361204
   3133 61594

> Somehow, after the `STEPD TERMINATED ON ... DUE TO JOB NOT RUNNING` error (2
> days after submission), the stepd must have erroneously switched from the
> superuser to user 322658, and subsequently lost permission to do Slurm
> stuff, including file operations.

Is that uid switching something that is expected?

Cheers,
-- 
Kilian
Comment 5 Michael Hinton 2019-07-23 10:49:29 MDT
(In reply to Kilian Cavalotti from comment #4)
> Is that uid switching something that is expected?
The stepd changes from root to the user (and I think back) with setuid() and friends, so yes, this is expected. This is how Slurm runs processes on behalf of the user without making them root. It's my guess that somewhere this got messed up (maybe we handled a failure wrongly, and instead of exiting, we continued as the user instead of root?). But at any rate, it appears that the BATCH COMPLETE RPC was sent with the user UID when it should have been the root UID.

I'll keep looking into it and let you know what I find.

Thanks,
-Michael
Comment 6 Michael Hinton 2019-07-26 17:09:40 MDT
Hey Kilian,

I'm still trying to figure out how to reproduce this issue. Is it possible to run slurmd at log level debug or debug2 when some of these jobs are terminated? Are these always with jobs that are cancelled by the user?

Also, are you seeing any nodes drain with the reason "Job is not running”?

Thanks,
-Michael
Comment 7 Kilian Cavalotti 2019-07-26 17:37:59 MDT
Hi Michael, 

(In reply to Michael Hinton from comment #6)
> I'm still trying to figure out how to reproduce this issue. Is it possible
> to run slurmd at log level debug or debug2 when some of these jobs are
> terminated? 

I'm not too sure how to approach this, other than mass-enabling debug on all our nodes' slurmd, and wait for those messages to appear?

> Are these always with jobs that are cancelled by the user?

That doesn't seem to be the case. Actually, most of the jobs that generated this messages have exited on TIMEOUT.

I tried to explore that lead, but it doesn't look like jobs that exit on timeout will necessary generate this error.

> Also, are you seeing any nodes drain with the reason "Job is not running”?

Nope, I haven't seen nodes drained with that reason for a long time.

Cheers,
-- 
Kilian
Comment 8 Michael Hinton 2019-07-29 17:34:00 MDT
(In reply to Kilian Cavalotti from comment #7)
> I'm not too sure how to approach this, other than mass-enabling debug on all
> our nodes' slurmd, and wait for those messages to appear?
I could prepare a debug patch that only prints stuff out in the slurmd at info level when this error occurs, if you were willing to run with it. And you could selectively choose which slurmds to run the patch on.

> > Also, are you seeing any nodes drain with the reason "Job is not running”?
> 
> Nope, I haven't seen nodes drained with that reason for a long time.
That's really strange, because the code that is being hit also tries to set the node to drain, as far as I can tell.

Here’s the code I’m looking at: https://github.com/SchedMD/slurm/blob/0a6cd4121e403994695481a0c685bea59c4ef2ad/src/slurmd/slurmstepd/step_terminate_monitor.c#L145-L155

Maybe the node is failing to be drained due to the same permission errors, which would make sense why the update RPCs don't go through. But I thought you would at the very least have seen some failed node updates in the controller logs. But I don't remember seeing any.

Thanks,
-Michael
Comment 9 Michael Hinton 2019-07-30 14:19:05 MDT
Hey Kilian,

Apparently, every batch job that is encountering this error has a job step with an `oom-kill event` due to the job exceeding cgroup memory restrictions. So the job step dies, and this somehow puts the batch step in a weird state when it finishes things up. Knowing this, hopefully I'll be able to reproduce the issue.

-Michael
Comment 10 Michael Hinton 2019-10-18 10:39:50 MDT
Hi Kilian, I apologize for the delay on this issue.

Have you continued to see this issue while you were on 18.08, and do you still see this in 19.05 after your recent upgrade?

Thanks,
-Michael
Comment 11 Kilian Cavalotti 2019-10-18 10:54:25 MDT
Hi Michael, 

(In reply to Michael Hinton from comment #10)
> Hi Kilian, I apologize for the delay on this issue.
> 
> Have you continued to see this issue while you were on 18.08, and do you
> still see this in 19.05 after your recent upgrade?

Thanks for following up!

Actually, I can't find any "A non superuser ... tried to complete batch..." message anymore after the 19.05 update.

I'll continue checking, but that issue can probably be closed for now. I'll reopen if I see those again.

Thanks!
-- 
Kilian
Comment 12 Michael Hinton 2019-10-18 10:59:38 MDT
Excellent! Please do reopen if you see this again.

Thanks,
-Michael