Description
Stephane Thiell
2019-03-21 18:57:34 MDT
slurmctld binary and core dumps made available at https://stanford.box.com/s/o2thc6wx92igd9zmvp46b4anqd5qbaw7 Let me know if you need anything else. My colleague Kilian should be online shortly. Thanks for your assistance. service logs show: Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: recovered JobId=39458398 StepId=Extern Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39458398 Assoc=13174 Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: recovered JobId=39458399 StepId=Extern Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39458399 Assoc=13174 Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39458402 Assoc=3796 Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: recovered JobId=39458403 StepId=Extern Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: recovered JobId=39458403 StepId=0 Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39458403 Assoc=2932 Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39448495_3(39459510) Assoc=7821 Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: recovered JobId=39019164_895(39458407) StepId=Extern Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39019164_895(39458407) Assoc=9368 Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39019164_896(39458408) Assoc=9368 Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: recovered JobId=39019164_897(39458409) StepId=Extern Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39019164_897(39458409) Assoc=9368 Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39019164_898(39458410) Assoc=9368 Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: recovered JobId=39019164_899(39459610) StepId=Extern Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39019164_899(39459610) Assoc=9368 Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: recovered JobId=39458411 StepId=Extern Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39458411 Assoc=16821 Mar 21 17:51:55 sh-sl01.int slurmctld[163629]: Recovered JobId=39458414 Assoc=11509 Mar 21 17:51:55 sh-sl01.int systemd[1]: slurmctld.service: main process exited, code=dumped, status=6/ABRT Mar 21 17:51:55 sh-sl01.int systemd[1]: Unit slurmctld.service entered failed state. Mar 21 17:51:55 sh-sl01.int systemd[1]: slurmctld.service failed. Created attachment 9662 [details]
"thread apply all bt full" output
It looks like the crash happens in GRES functions, on job 39458415. The thing is that job didn't request any GPU, here's the submission script: #!/bin/bash # #BATCH --job-name=test # #SBATCH --time=10:00 #SBATCH --ntasks=1 #SBATCH --gpus-per-task=1 #SBATCH --mem-per-cpu=2G srun /usr/bin/python2.7/python.exe ./SDDLPY/main_in.py Yet, in the "scontrol show job" output we recorded in our prolog, it seems like it has TresPerTask=gpu:1 JobId=39458415 JobName=my_bash UserId=gaddiel(326207) GroupId=rzia(324517) MCS_label=N/A Priority=11442 Nice=0 Account=rzia QOS=normal JobState=CONFIGURING Reason=None Dependency=(null) Requeue=1 Restarts=1 BatchFlag=2 Reboot=0 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:00:04 TimeLimit=00:10:00 TimeMin=N/A SubmitTime=2019-03-21T16:15:29 EligibleTime=2019-03-21T16:15:29 AccrueTime=2019-03-21T16:15:30 StartTime=2019-03-21T17:09:37 EndTime=2019-03-21T17:19:37 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-03-21T17:09:37 Partition=normal AllocNode:Sid=sh-ln07:53306 ReqNodeList=(null) ExcNodeList=(null) NodeList=sh-101-36 BatchHost=sh-101-36 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=2G,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* Nodes=sh-101-36 CPU_IDs=4 Mem=2048 GRES_IDX=gpu(IDX:) MinCPUsNode=1 MinMemoryCPU=2G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/home/users/gaddiel/my_bash WorkDir=/home/users/gaddiel StdErr=/home/users/gaddiel/slurm-39458415.out StdIn=/dev/null StdOut=/home/users/gaddiel/slurm-39458415.out Power= TresPerTask=gpu:1 In any case, we're in urgent need of a way to restart the controller... Thanks! -- Kilian > The thing is that job didn't request any GPU, here's the submission script: It does, right here: > #SBATCH --gpus-per-task=1 Which then matches up correctly with: > Yet, in the "scontrol show job" output we recorded in our prolog, it seems > like it has TresPerTask=gpu:1 We're looking into how to patch around this without dropping the queue. (In reply to Tim Wickberg from comment #5) > > The thing is that job didn't request any GPU, here's the submission script: > > It does, right here: > > > #SBATCH --gpus-per-task=1 Argh, sorry I missed that, I'm too used to see --cpus-per-task, that's pretty unusual for our users. Especially considering that the partition this was submitted to does not have any GRES, so I wasn't expecting that. > We're looking into how to patch around this without dropping the queue. Thanks!! Cheers, -- Kilian Created attachment 9663 [details]
workaround initial crash
Can you apply this and restart the slurmctld?
This should move where the crash happens if nothing else, although I'll caution that you may very well still see another crash.
Logs from slurmctld would be nice to have.
And did you upgrade slurmctld or other parts of your system since that job was submitted? I have not been able to reproduce this yet, and I'm wondering if this job snuck in under an older slurmctld version.
Hi Tim, (In reply to Tim Wickberg from comment #9) > Created attachment 9663 [details] > workaround initial crash > > Can you apply this and restart the slurmctld? Thanks, doing that now. > This should move where the crash happens if nothing else, although I'll > caution that you may very well still see another crash. > > Logs from slurmctld would be nice to have. Ok, will send those back if it crashes again. > And did you upgrade slurmctld or other parts of your system since that job > was submitted? I have not been able to reproduce this yet, and I'm wondering > if this job snuck in under an older slurmctld version. Nope, we've moved to 18.08.6 pretty much on its release date. And I just realized, but --gpus-per-task is not a 18.08.x option, is it? As it it's not in the sbatch/srun man page. I assume it's coming with 19.05? It's clearly a typo from the user, because of the lack of documentation and the absence of GPU in their partition, but nonetheless, the job went in anyway. Anyway, patching now. Cheers, -- Kilian All right, patch applied (with "return 0;", because SLURM_SUCCESS is apparently not defined there), and the controller was able to start! It logged this: Mar 21 20:10:08 sh-sl01 slurmctld[170018]: _sync_nodes_to_comp_job: JobId=39458415 in completing state Mar 21 20:10:08 sh-sl01 slurmctld[170018]: error: job_resources_node_inx_to_cpu_inx: Invalid node_inx Mar 21 20:10:08 sh-sl01 slurmctld[170018]: error: job_update_tres_cnt: problem getting offset of JobId=39458415 Mar 21 20:10:08 sh-sl01 slurmctld[170018]: cleanup_completing: JobId=39458415 completion process took 10825 seconds And the same thing for a few other job ids. It looks like things are working now, thanks! Anything else we should look for, should we cancel those jobs if they risk to make the controller crash other places? Thanks thanks thanks! -- Kilian > Nope, we've moved to 18.08.6 pretty much on its release date. > And I just realized, but --gpus-per-task is not a 18.08.x option, is it? As > it it's not in the sbatch/srun man page. I assume it's coming with 19.05? Yeah... it's used for cons_tres which is coming in 19.05, but had been added into 18.08 at one point (but was intentionally left undocumented). A typo between 'c' and 'g' makes sense there. > It's clearly a typo from the user, because of the lack of documentation and > the absence of GPU in their partition, but nonetheless, the job went in > anyway. Ah... no gpus available in the partition might be a good lead, I did not test that when quickly trying to reproduce. Someone will follow up with you further tomorrow during normal hours - slurmctld logs from right before the crash would likely be of use if you can upload them sometime. - Tim (In reply to Tim Wickberg from comment #12) > Someone will follow up with you further tomorrow during normal hours - > slurmctld logs from right before the crash would likely be of use if you can > upload them sometime. Great, thanks again Tim, you saved our day! I've dropped the severity and I'll send the logs shortly. Cheers, -- Kilian Created attachment 9664 [details]
slurmctld logs
Here's how things happened:
1. a slurmctld restart was initiated around 17:32, to remove a few nodes from the configuration (the sh-04-xx nodes referenced to later on in the log)
Mar 21 17:34:06 sh-sl01 slurmctld[162518]: slurmctld version 18.08.6-2 started on cluster sherlock
2. slurmctld stopped successfully, and the problematic job (39458415) was already in queue: it was submitted about an hour earlier, at 16:15:29
3. at 17:34:16, during the job recovery phase, slurmctld crashed with the backtrace provided earlier
Mar 21 17:34:16 sh-sl01 slurmctld[162518]: recovered JobId=39458411 StepId=Extern
Mar 21 17:34:16 sh-sl01 slurmctld[162518]: Recovered JobId=39458411 Assoc=16821
Mar 21 17:34:16 sh-sl01 slurmctld[162518]: Recovered JobId=39458414 Assoc=11509
Mar 21 17:48:51 sh-sl01 slurmctld[163005]: slurmctld version 18.08.6-2 started on cluster sherlock
The next line in the log is a subsequent restart try.
Mar 21 17:48:51 sh-sl01 slurmctld[163005]: job_submit.lua: job_submit: initialized
Mar 21 17:48:51 sh-sl01 slurmctld[163005]: error: _shutdown_bu_thread:send/recv sh-sl02: Connection refused
Mar 21 17:48:56 sh-sl01 slurmctld[163005]: No memory enforcing mechanism configured.
Mar 21 17:48:56 sh-sl01 slurmctld[163005]: layouts: no layout to initialize
4. we tried a few more restart after having put back sh-04-xx in the config, but the error was the same.
Hi Could you send me slurmd.log from sh-06-28? Dominik Created attachment 9669 [details] sh-06-28 log Hi Dominik, > Could you send me slurmd.log from sh-06-28? Sure, here it is! Cheers, -- Kilian Hi I am able to reproduce this and now I will be working on understanding what is going on there. Dominik Awesome, thanks! Cheers, Hi Dominik, I'm wondering if you had any update on this bug, specifically if your patch had been merged. We currently apply it locally on 18.08.6 and would like to know if we still need to carry over this patch in 18.08.7. Thanks! -- Kilian Hi Sorry but it hasn't been merged to 18.08.7. This commit also fix this issue but it is only in master https://github.com/SchedMD/slurm/commit/89fdeaede7c Dominik (In reply to Dominik Bartkiewicz from comment #25) > Hi > > Sorry but it hasn't been merged to 18.08.7. > This commit also fix this issue but it is only in master > https://github.com/SchedMD/slurm/commit/89fdeaede7c No worries, thanks for the info! I'll keep your patch from this ticket on 18.08.7 for now, then. Cheers, -- Kilian Hi This commit fixes this issue in 18.08: https://github.com/SchedMD/slurm/commit/4c48a84a6edb I'm closing this bug as resolved/fixed. Please reopen if you have additional bugs/problems. Dominik (In reply to Dominik Bartkiewicz from comment #28) > Hi > > This commit fixes this issue in 18.08: > https://github.com/SchedMD/slurm/commit/4c48a84a6edb > I'm closing this bug as resolved/fixed. Please reopen if you have additional > bugs/problems. Thank you! Cheers, -- Kilian *** Ticket 6923 has been marked as a duplicate of this ticket. *** *** Ticket 8006 has been marked as a duplicate of this ticket. *** *** Ticket 8591 has been marked as a duplicate of this ticket. *** |