| Summary: | Hundreds of "runaway" jobs | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Mark Bartelt <mark> |
| Component: | slurmdbd | Assignee: | Marshall Garey <marshall> |
| Status: | RESOLVED CANNOTREPRODUCE | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | naveed |
| Version: | 19.05.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=7383 https://bugs.schedmd.com/show_bug.cgi?id=8131 |
||
| Site: | Caltech | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | Output of "sdiag" command | ||
|
Description
Mark Bartelt
2019-12-17 11:37:23 MST
There are multiple possibilities where database updates might be dropped. A few possibilities: - slurmdbd losing connection to slurmctld - mysql timeouts - slurmdbd crashes - slurmctld crashes - Slurm bugs * One bug that caused runaways was fixed in 19.05.1, and I'm not currently aware of any other bug that is causing runaways (though of course that doesn't mean there are none). Could you upload the slurmctld and slurmdbd logs for a period of time where these runaways occurred and let me know some of the runaway job id's from that period of time? Can you also upload the output of sdiag? In particular, if you ever see this message in the slurmctld log file: "slurmdbd: agent queue filling (%d), RESTART SLURMDBD NOW" things are in trouble, and if you ever see this message in the slurmctld log file: "slurmdbd: agent queue is full (%u), discarding %s:%u request" you've definitely lost database updates. Re your request to "upload the slurmctld and slurmdbd logs for a period of time where these runaways occurred", that's not possible, since we have logs going back only as far as early morning of December 9th, but the most recent runaway jobs date from the 3rd. Well, unless older logs are getting copied somewhere, then deleted from /var/log; I don't think that's happening, but I'll ask our Slurm guru to confirm. None of the logs that we do have contain any messages about "agent queue filling" or "agent queue full". I'll attach the output of "sdiag". A question or two: When I do "sacctmgr show RunawayJobs", it asks whether I'd like to fix those runaway jobs, and it goes on to say ... "This will set the end time for each job to the latest out of the start, eligible, or submit times, and set the state to completed." Of the three, wouldn't the start time always be the latest? Anyway, the main thing that I wanted to confirm with you is: If, when I do "sacctmgr show RunawayJobs", I answer "y" to the question it asks, can we be guaranteed that the runaway jobs are the only ones which will have their accounting info mucked with? I.e. all other jobs' information will remain as it is? Created attachment 12597 [details]
Output of "sdiag" command
(In reply to Mark Bartelt from comment #4) > Re your request to "upload the slurmctld and slurmdbd logs > for a period of time where these runaways occurred", that's > not possible, since we have logs going back only as far as > early morning of December 9th, but the most recent runaway > jobs date from the 3rd. > > Well, unless older logs are getting copied somewhere, then > deleted from /var/log; I don't think that's happening, but > I'll ask our Slurm guru to confirm. That's too bad. Next time you see some new runaways occur, save the slurmctld and slurmdbd logs from that time period (if they still exist). That will help us find out what's causing them to occur. > None of the logs that we do have contain any messages about > "agent queue filling" or "agent queue full". Good! > I'll attach the output of "sdiag". Thanks. > A question or two: When I do "sacctmgr show RunawayJobs", > it asks whether I'd like to fix those runaway jobs, and it > goes on to say ... > > "This will set the end time for each job to the latest out > of the start, eligible, or submit times, and set the state > to completed." > > Of the three, wouldn't the start time always be the latest? This actually used to be the behavior. However, this was a bug: Start time is set to 0 until the job starts. If the job is ineligible (for example, held or dependent), then eligible time is set to 0. If an ineligible pending job gets becomes runaway - for example, it gets cancelled but the job cancel update to the database gets dropped - then both the job's start and eligible times are 0. If you fix runaways in that instance, then the database <clustername>_last_ran_table is set to 0 and the database re-rolls from time 0 (the UNIX epoch). There actually was one site where this happened, and it took a couple months to re-roll their database. Job submit time is never zero, however, so we include the job submit time in that list. > Anyway, the main thing that I wanted to confirm with you is: > > If, when I do "sacctmgr show RunawayJobs", I answer "y" to > the question it asks, can we be guaranteed that the runaway > jobs are the only ones which will have their accounting info > mucked with? I.e. all other jobs' information will remain as > it is? Yes, when the next rollup happens. When fixing runaways, we figure out when we need to re-roll usage from. Then, we delete usage from the various assoc and wckey usage tables in the database. When the database re-rolls (which happens every hour), that usage is recalculated and filled in the proper database tables. So just wait until the next hour for rollup and you should see all the usage. If the usage isn't properly re-rolled for some reason, then that's a bug. If that happens (I don't expect it to), I'll give you a few mysql commands to run to get the usage re-rolled. Did the usage re-roll correctly? * The fixed runaway jobs will have 0 as their usage; the rest of the jobs should contribute to usage normally. sdiag didn't show anything strange. I'm afraid that without logs I can't find out what might have caused the runaways. Did more runaways happen? If so, can you upload slurmctld and slurmdbd logs? If not, then can we close this bug as timedout and then you can re-open it if runaways happen again? Things seem OK after using "sacctmgr show RunawayJobs" to get rid of all those runaway jobs. We haven't had any recurrence of this problem. I also have a short script that runs as a hourly cron job, to check whether any new runaway jobs have appeared. So if this problem occurs again, we'll know shortly afterward, and we'll be back in touch. But for now, I think it's OK to close this one. Sounds good. I'll close it as cannotreproduce for now |