| Summary: | Jobs Scheduled But Isn't | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Paul Edmon <pedmon> |
| Component: | Scheduling | Assignee: | Danny Auble <da> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 16.05.10 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Harvard University | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 17.02.4 17.11.0-pre1 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
slurm.conf
slurm log Log for holy2a18306 |
||
|
Description
Paul Edmon
2017-04-07 09:25:30 MDT
The delay state is still on going, so it appears to be the default state of the scheduler now. We do run a slurmctld_prolog but its fast and the results of that prolog are appearing as normal. For instance I currently have several jobs I have submitted:
[root@holy2a23301 kovac_lab]# /usr/bin/squeue -u root
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
85391328 kovac runscrip root R 2:54 1 holy2a23310
85391309 kovac runscrip root R 3:06 1 holy2a23310
85391322 kovac runscrip root R 3:06 1 holy2a23310
85391300 kovac runscrip root R 3:20 1 holy2a23310
85391295 kovac runscrip root R 3:27 1 holy2a23310
85391288 kovac runscrip root R 3:43 1 holy2a23310
85391282 kovac runscrip root R 4:18 1 holy2a23310
85391263 kovac runscrip root R 4:30 1 holy2a23310
85391248 kovac runscrip root R 4:32 1 holy2a23310
85391246 kovac runscrip root R 4:50 1 holy2a23310
85391242 kovac runscrip root R 4:59 1 holy2a23310
85391232 kovac runscrip root R 5:03 1 holy2a23310
85391224 kovac runscrip root R 5:05 1 holy2a23310
85391194 kovac runscrip root R 5:54 1 holy2a23310
but the node itself shows nothing running, nor have any logs from the jobs dumped:
[root@holy2a23301 kovac_lab]# ls
runscript
Even though our prolog script has run and printed out the runscript like it was supposed to:
[root@holy-slurm02 328]# ls -ltr
total 16
-rw-r--r-- 1 slurm slurm_users 9187 Apr 7 10:57 85389328
-rw-r--r-- 1 slurm slurm_users 0 Apr 7 11:03 85390328
-rw-r--r-- 1 slurm slurm_users 2338 Apr 7 11:39 85391328
the time stamp there is the time that the job was originally processed by the scheduler. So some delay is occuring between the scheduling and the actual running of the job on the node.
Just an update, we decided to downgrade our slurm version from 16.05.10-2 to 16.05.9. Let you know if it reoccurs but we are pretty convinced it is a bug in 16.05.10-2. There's a few things that may explain this; especially between those two point releases the possible culprits are relatively few and far between. Would it be possible to get a larger chunk of the logs, and an updated version of your slurm.conf? The one I have on file is a bit dated, and I'd like to rule out a few possible complications. - Tim Created attachment 4321 [details]
slurm.conf
Created attachment 4322 [details]
slurm log
Added our conf and a log of the last part of today which includes the period during which there were delays. So far as we can tell nothing obvious showed up in the logs regarding the delays. slurmctld saw everything operating fine. There was just a delay between when slurmctld said it launched the job and the job appearing on the node, as I excerpted before. Paul, has this happen since the downgrade? Nope. 16.05.9 is working as expected and no significant delays detected. So it is definitely something in 16.05.10-2 that was causing the issue. Whether it was a bug or a weird interaction with our set up. I can't really say which. It seemed periodic as it wasn't a constant effect. Really strange. It's entirely possible that this is unique to our environment and 16.05.10-2. I'm satisfied with running on 16.05.9 as a fix as it seems to be working fine. Our plan was to upgrade to 17.02.x at our next maintenance window in June/July so we will bypass 16.05 completely with that one as such 16.05.10-2 won't be a concern for us anymore. That sounds good Paul. I don't think there is much in 16.05.10-2 that you will be missing out on. Would you mind if we drop the severity of the bug now you at least have a work around? I am assuming you downgraded all your cluster (not just the slurmds) to 16.05.9 as well is that correct? I would be interested in finding out what the causer of this situation is though as 17.02 will most likely contain that code. I believe I have narrowed it down to 3 commits that are at least suspect. f6d42fdbb b988531d6 e58c22828 You happen to know if the jobs that were delayed where interactive jobs (salloc) or batch jobs (sbatch)? Could you send your slurmd log from holy2a18306? Created attachment 4340 [details]
Log for holy2a18306
Sure, I've just added the log for holy2a18306. It should cover the same period as the slurmctld log I sent. The lag was for both srun and sbatch. Didn't matter which it was. The delay though was only temporary. It would last for around 20-30 min and then disappear and then reappear seemingly at random. Since there were no errors in the logs and no obvious problems I couldn't get a fix on why it was happening. There were no issues with our network either. When we downgraded to 16.05.9 we did it across the cluster so slurmd, slurmctld, and slurmdbd. So everything is now at 16.05.9. You can downgrade this bug report now as we have a fix which was our swapping back to 16.05.9. Paul I am wondering what your upgrade to 17.02 plans are. I would be interested to see if this is fixed there or not. As we haven't been able to reproduce this we are at a loss for why it is happening. There was only so many lines that changed from 16.05.9 -> .10. A slightly less intrusive approach might be to patch your .9 with the commits mentioned in comment 9 one at a time to see if they are indeed suspect or not. At the moment I am leaning towards f6d42fdbb or e58c22828 If you can reproduce on a test system that would also be helpful as you would be able to test 17.02 to see if things are fixed or not. Thanks! As it stands I was going to roll 17.02.3 on our test cluster some time this week. I will see if I can replicate the problem there. Our full production upgrade will probably be in a month or so. Excellent. I will await your tests. I am hoping you can recreate with 16.05.10 and then not with 17.02.3. Paul did you trip up on this on with 17.02? Not in the testing I have done. So we are going to go ahead with the upgrade and if it comes up again we will let you know. -Paul Edmon- On 06/14/2017 04:56 PM, bugs@schedmd.com wrote: > > *Comment # 16 <https://bugs.schedmd.com/show_bug.cgi?id=3676#c16> on > bug 3676 <https://bugs.schedmd.com/show_bug.cgi?id=3676> from Danny > Auble <mailto:da@schedmd.com> * > Paul did you trip up on this on with 17.02? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Sweet! Would you be ok to close this bug and reopen of it shows up again? Yup, go ahead. -Paul Edmon- On 06/15/2017 10:29 AM, bugs@schedmd.com wrote: > > *Comment # 18 <https://bugs.schedmd.com/show_bug.cgi?id=3676#c18> on > bug 3676 <https://bugs.schedmd.com/show_bug.cgi?id=3676> from Danny > Auble <mailto:da@schedmd.com> * > Sweet! Would you be ok to close this bug and reopen of it shows up again? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Sounds good |