A user is reporting jobs blocked by AssocGrpNodeRunMinutes even though they shouldn't be. There are a lot of similar bug reports, with a lot of different solutions: https://bugs.schedmd.com/show_bug.cgi?id=4653 - solved by changing partition priority tiers (setting bf_continue might have worked, too). https://bugs.schedmd.com/show_bug.cgi?id=5425 - issue was possibly due to changing sacctmgr settings while jobs were running. Running `scontrol reconfigure` seemed to help. Marked as a duplicate of 5375, though it's not clear to me that either were root-caused. https://bugs.schedmd.com/show_bug.cgi?id=5375 - workaround was to run `scontrol reconfigure` regularly to cleanup bad data in sshare. Resulted in a bugfix for hetjobs (1da9c5d03c97b0a1411491d911a6e7b10eeb13cb) but it's not clear this helped with the original issue at all. https://lists.schedmd.com/pipermail/slurm-users/2020-May/005395.html - the problem got better by cleaning up runaway jobs, but didn't go away completely. We've been using the following commands to debug: * scontrol show assoc_mgr * SlurmctldDebug=debug2 We're reading a bunch and trying a bunch of things, but the only thing which has helped so far is upping the GrpTRESRunMins limits for the user and/or account in question. From reading the related issues linked above, it seems to me as if there is a bug that causes old "runmins" data to not be cleaned up properly. We are currently running 19.05 and are planning to upgrade to 20.02, but we don't expect this issue to go away since we see nothing in the linked issues above or in the NEWS about this issue being root-caused or fixed. 1. Have I correctly summarized the existing information about this issue? Am I missing anything or characterizing it unfairly? 2. What is the currently recommended workaround for this? Running `scontrol reconfigure` regularly? 3. Is there any information we can provide which may help to find+fix the bug?
Hi Luke, This is something that we've run across before but we haven't been able to track down the root cause. If you are able to reproduce this reliably it would be very helpful to know the steps you take to do so. If you don't have a way to reproduce it then it would be helpful to see the 'debug2' logs you were able to collect. Were these logs covering a time when a problem job was first submitted? When you see the problem do jobs immediately show this as the Reason, or does it take some time for it to show up? Do the jobs eventually run while they show this as the reason or do you have to take some action for them to start? Running 'scontrol reconfigure' has worked in the past, have you found it to be effective in your case? If you don't have exact steps to reproduce the problem, then I would like to see the output of 'sacctmgr show assoc tree' (or enough to show the information for any users you can use as examples). Then if you can have the logs at 'debug2' long enough to catch a job submission when this happens and then while the job is still queued run 'scontrol show jobs' and 'scontrol show nodes' that would be helpful. Thanks, Ben
Ok, Ben. We're scaling up our efforts to debug this. We'll reply here if we have anything interesting to show for it.
> Running 'scontrol reconfigure' has worked in the past, have you found it to be effective in your case? I just verified that yes, this does help. We had a user with non-zero tresrunmins in sshare. After running 'scontrol reconfigure', the value dropped back down to zero. We're now monitoring all users, looking for anyone with non-zero tresrunmins who is not currently running a job. When we get some leaky data again, we should be able to correlate it with the job that caused it. I'll keep you posted.
Thanks Luke, that sounds like a good start point to catch it. I would still be curious to see how the limits are set with the 'sacctmgr show assoc tree' output. If you're using a QOS as well then I would also like to see that information with 'sacctmgr show qos'. Thanks, Ben
Created attachment 14969 [details] sshare data before+during+after job 406075 In the past 12 hours, we have observed data leaking from from 4 of the 1004 jobs which have finished. In one case, it's unclear which job contributed to the leaky data. Here's my investigation on the other 3: >>> JOB 406075 $ grep 406075 /var/log/slurm/slurmctld.log Jul 09 05:03:29 slurm-01 slurmctld[450]: backfill: JobId=406075 time limit changed from 240 to 176 $ sacct -X -j 406075 --format=nnodes,timelimit,elapsed,state NNodes Timelimit Elapsed State -------- ---------- ---------- ---------- 64 02:56:00 02:56:26 TIMEOUT The original timelimit for this job was 4h, but was changed to 2h56m. This wasn't done by the user - I'm not sure why this happened. The job ended in TIMEOUT, so all 2h56m were used. Which leaves a difference of 64m between the actual runtime and the original timelimit. I've attached a screenshot of our monitoring of this association's sshare data before+during+after this job. I'm pulling 'node=([0-9]+)' from 'sshare --format=tresrunmins' to show the "NodeRunMins" data. Two things about this data seem buggy to me. (1) When the job starts, the NodeRunMins value is 15360 (60*240), which shows it's using the original timelimit value rather than the update value. And (2), the graph shows that 4096 minutes are left "leaked" in the accounting data after the job stops. (4h-2h56m)*64nodes = 4096m. QED. >>> Job 404987 $ sacct -X -j 404987 --format=nnodes,timelimit,elapsed,state NNodes Timelimit Elapsed State -------- ---------- ---------- ---------- 32 08:00:00 03:22:30 COMPLETED The timelimit for this job was always 8h (never changed). Just before the job finished, sshare showed 9600 remaining NodeRunMins for the job. After the job ended, it dropped down to 691 (not zero). Maybe the data from 2 or 3 nodes was leaked? Not clear. >>> Job 404981 $ sacct -X -j 404981 --format=nnodes,timelimit,elapsed,state NNodes Timelimit Elapsed State -------- ---------- ---------- ---------- 32 08:00:00 01:26:37 CANCELLED+ The timelimit for this job was always 8h (never changed). Just before the job finished, sshare showed 12960 remaining NodeRunMins for the job. After the job ended, it dropped down to 349 (not zero). Maybe the data from 1 node was leaked? Not clear.
To add to the previous, yesterday we upgraded to slurm 20.02.3 and switched to select/cons_tres, so the data Luke showed are for that combination.
Thanks for collecting that data. The first scenario you described seemed to me like it had a high probability to allow this issue to be reproduced. I spent time this afternoon trying to reproduce the behavior but I haven't succeeded yet. I do have a few more ideas of things I could try to trigger it. I will try again tomorrow, but wanted to provide a status update this afternoon. If you're able to find any more information that might help to reproduce it I would be happy to hear it. Thanks, Ben
Hi Luke, I still haven't been able to reproduce this behavior in my attempts this morning, but I have some follow up questions for you. You show for a couple jobs that sacct shows the jobs in a COMPLETED or CANCELLED state. Is there any kind of delay between the time the job completes and when that is reflected by sacct? I'm wondering if there is a communication delay and if that causes the remaining time to get stuck. For the second and third examples you give can you look at the slurmd logs for the nodes the jobs were running on to look for any kind of error when the jobs completed? I'm trying to reproduce this with some other workload on my test system, but it may require more load than I'm generating. How many jobs do you have queued on average? Do you notice any difference in when this happens, like you see it more when there is a large influx of jobs? Is there a possibility of collecting debug2 slurmctld logs for a long enough period to capture relevant data for a job with this problem? Thanks, Ben
> I still haven't been able to reproduce this behavior in my attempts this morning Bummer! Yes, it seems hard to reproduce on my end, too. Smells like a race condition somewhere. I see at least 10 instances of leaked accounting data since 07/10. I'll dig into those today, taking a look at these things based on your recommendations: * Is there any kind of delay between the time the job completes and when that is reflected by sacct? * can you look at the slurmd logs for the nodes the jobs were running on to look for any kind of error when the jobs completed? > How many jobs do you have queued on average? Do you notice any difference in when this happens, like you see it more when there is a large influx of jobs? On the cluster which is exhibiting the most leakage, we have a pretty steady queue size of 100-200 jobs, with a throughput of ~30 jobs/hr. The leakage seems pretty random - not obviously correlated with queue size or job throughput.
> can you look at the slurmd logs for the nodes the jobs were running on to look for any kind of error when the jobs completed? I looked but didn't find anything particularly interesting. Here's a complete log from the first of 16 nodes allocated to a job which timed out, and which leaked data. I don't see anything concerning here. And I looked at the logs on all the other 15 nodes and didn't see anything interesting there, either. [2020-07-12T20:41:48.180] task_p_slurmd_batch_request: 405824 [2020-07-12T20:41:48.180] task/affinity: job 405824 CPU input mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFF [2020-07-12T20:41:48.180] task/affinity: job 405824 CPU final HW mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFF [2020-07-12T20:42:46.879] _run_prolog: run job script took usec=58708324 [2020-07-12T20:42:46.879] _run_prolog: prolog with lock for job 405824 ran for 58 seconds [2020-07-12T20:42:46.879] prolog for job 405824 ran for 58 seconds [2020-07-12T20:42:46.879] Launching batch job 405824 for UID 22201 [2020-07-12T20:42:46.902] [405824.batch] task/cgroup: /slurm/uid_22201/job_405824: alloc=0MB mem.limit=1546833MB memsw.limit=unlimited [2020-07-12T20:42:46.902] [405824.batch] task/cgroup: /slurm/uid_22201/job_405824/step_batch: alloc=0MB mem.limit=1546833MB memsw.limit=unlimited [2020-07-12T20:42:47.010] [405824.batch] task_p_pre_launch: Using sched_affinity for tasks [2020-07-12T20:42:47.087] launch task 405824.0 request from UID:22201 GID:1007 HOST:10.0.1.3 PORT:26242 [2020-07-12T20:42:47.087] lllp_distribution jobid [405824] manual binding: none [2020-07-12T20:42:47.104] [405824.0] pyxis: skipping duplicate mount entry: /fs/fs1/user1 /fs/fs1/user1 x-create=auto,rbind [2020-07-12T20:42:47.104] [405824.0] pyxis: skipping duplicate mount entry: /fs/fs1/user2 /fs/fs1/user2 x-create=auto,rbind [2020-07-12T20:42:48.687] [405824.0] error: cpu_freq_cpuset_validate: cpu_bind string is null [2020-07-12T20:42:48.688] [405824.0] task/cgroup: /slurm/uid_22201/job_405824: alloc=0MB mem.limit=1546833MB memsw.limit=unlimited [2020-07-12T20:42:48.688] [405824.0] task/cgroup: /slurm/uid_22201/job_405824/step_0: alloc=0MB mem.limit=1546833MB memsw.limit=unlimited [2020-07-12T20:42:48.765] [405824.0] pyxis: importing docker image ... [2020-07-12T20:43:07.338] [405824.0] pyxis: creating container filesystem ... [2020-07-12T20:43:14.432] [405824.0] pyxis: starting container ... [2020-07-12T20:43:15.873] [405824.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-12T20:43:15.874] [405824.0] task_[2020-07-12T20:43:15.874] [405824.0] task_[2020-07-12T20:43:15.874] [405824.0] task_[2020-07-12T20:43:15.874] [405824.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-12T20:43:15.874] [405824.0] task_[2020-07-12T20:43:15.874] [405824.0] task_[2020-07-12T20:43:15.874] [405824.0] task_[2020-07-12T20:43:15.874] [405824.0] task_[2020-07-12T20:43:15.874] [405824.0] task_p_pre_launch: Using sched_affinity for tasks p_pre_launch: Using sched_affinity for tasks [2020-07-12T20:43:15.874] [405824.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-12T20:43:15.874] [405824.0] task_p_pre_launch: Using sched_affinity for tasks p_pre_launch: Using sched_affinity for tasks p_pre_launch: Using sched_affinity for tasks p_pre_launch: Using sched_affinity for tasks p_pre_launch: Using sched_affinity for tasks p_pre_launch: Using sched_affinity for tasks p_pre_launch: Using sched_affinity for tasks [2020-07-12T20:43:15.874] [405824.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-12T20:43:15.874] [405824.0] task_p_pre_launch: Using sched_affinity for tasks [2020-07-12T20:43:15.874] [405824.0] task_[2020-07-12T20:43:15.874] [405824.0] task_p_pre_launch: Using sched_affinity for tasks p_pre_launch: Using sched_affinity for tasks [2020-07-12T23:18:53.762] [405824.batch] error: *** JOB 405824 ON n003 CANCELLED AT 2020-07-12T23:18:53 DUE TO TIME LIMIT *** [2020-07-12T23:18:53.763] [405824.0] error: *** STEP 405824.0 ON n003 CANCELLED AT 2020-07-12T23:18:53 DUE TO TIME LIMIT *** [2020-07-12T23:18:55.205] [405824.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:15 [2020-07-12T23:18:55.208] [405824.batch] done with job [2020-07-12T23:19:46.898] [405824.0] pyxis: removing container filesystem ... [2020-07-12T23:19:50.533] [405824.0] done with job [2020-07-12T23:22:29.067] epilog for job 405824 ran for 156 seconds
Created attachment 15029 [details] sshare data before+during+after job 409438 Ok, I have a repro. 1. I noticed that the "time limit changed" log line only happens when a user specifies '--time-min' for their job. 2. I noticed that all users with leaking data are using the '--time-min' flag for their jobs. 3. I was able to "leak" data using this sequence of commands: # (1) Queue up a job on node1 and wait for it to start firstjobid=$(sbatch -A admin -p admin -w n097 -t 20:00 --wrap='sleep infinity' | awk '{print $4}') while ! squeue -ah -j $firstjobid --Format=state | grep -q RUNNING ; do sleep 1 ; done # (2) Queue up a job on node[1,2] and set it to be the highest priority bigjobid=$(sbatch -A admin -p admin -N2 -w n[097,098] -t 1:00:00 --wrap='sleep infinity' | awk '{print $4}') scontrol top $bigjobid # (3) Queue up a job on node2, setting --time-min lastjobid=$(sbatch -A admin -p admin -w n098 -t 1:00:00 --time-min=10:00 --wrap='sleep infinity' | awk '{print $4}') After a bit, the backfill scheduler decides to start the "lastjob" via decreasing its timelimit: > [2020-07-14T13:54:30.313] backfill: Started JobId=409438 in admin on n098 And now we get the same damning graph I posted earlier (attached).
Oops, this is the interesting log line I meant to post instead: > [2020-07-14T13:54:30.314] backfill: JobId=409438 time limit changed from 60 to 19
Thank you for your work finding that reproducer, that is an excellent lead. I tried reproducing it again this afternoon and I haven't seen it yet, but I haven't exhausted all the things I think might be affecting it either. My attempts so far have just had these jobs on the system and having higher priority jobs queued might be a factor. I've also been testing with 20.02. It's possible that this only affects 19.05 so I'll test with your version as well. I'll look at it some more in the morning, but wanted to send a quick update for now. Thanks, Ben
> I tried reproducing it again this afternoon and I haven't seen it yet Bummer! The cluster on which the reproducer worked for me is 19.05, with select/linear. I tweaked the reproducer a bit to get it to work on the other cluster (with 20.02 and select/cons_tres). The key is to trigger the 'backfill: JobId=416857 time limit changed from 60 to 18' event. Here's the updated reproducer which worked for me on the newer cluster. I don't think these tweaks were necessary because of the different version or configuration - I think I just had to tweak things a bit to ensure that the big job was already in the Reason=Resources state and so that the scheduling was done in the "background" and not in the "foreground". Or something like that. My understanding of how the backfill scheduler works is not very sophisticated. firstjobid=$(sbatch -A admin -p admin-debug -w n03 -t 20:00 --wrap='sleep infinity' | awk '{print $4}') while ! squeue -aht-j $firstjobid --Format=state | grep -q RUNNING ; do sleep 1 done bigjobid=$(sbatch -A admin -p admin-debug -N2 -w n0[3,4] -t 1:00:00 --wrap='sleep infinity' | awk '{print $4}') lastjobid=$(sbatch -H -A admin -p admin-debug -w n04 -t 1:00:00 --time-min=10:00 --wrap='sleep infinity' | awk '{print $4}') sleep 10 scontrol top $bigjobid sleep 10 scontrol release $lastjobid scontrol top $bigjobid
Hi Luke, That was it, I had to introduce other jobs to make sure this job got scheduled by the backfill scheduler and then I am able to reproduce the problem. Thank you for your help in coming up with a reproducer for this. I'm working on how to fix this now. Thanks, Ben
Created attachment 15179 [details] job430139 Last week, we pushed out a change to disable the '--time-min' flag for all jobs via job_submit.lua. Unfortunately, this has not completely solved the problem for us. At least 3 more jobs have failed. One of them is super interesting. $ sacct -XnP -j 430139 --format=start,end,timelimit,elapsed,state,nnodes 2020-07-24T15:47:38|2020-07-24T16:28:02|01:30:00|00:40:24|FAILED|1 Check out the attached screenshot. For the user in question, only one job was running at this time. Noderunmins starts at 90 for the job - correct. It decreases as expected. Then, at around 4:16pm, the value magically increases by 25min. And finally, when the job ends, 50min are leaked. That timestamp corresponds with an 'scontrol reconfigure' event, according to the controller log: Jul 24 16:16:00 slurm-01 slurmctld[450]: Processing RPC: REQUEST_RECONFIGURE from uid=10001 So that's interesting. And potentially a new source of leakage.
Sorry, I misspoke. 25min of data was leaked, not 50. And I also noticed that when the data jumped up, it jumped up to 90min, which was the original full allocation. It didn't seem to take into account the fact that the job was in-progress. So that seems like a pretty well-defined, separate bug report.
Hi Luke, I am able to reproduce the second case you're describing. Letting a job run for a while so that some of its time is consumed and then doing a scontrol reconfigure will cause the controller to count the full requested time of the job again. Then when the job completes the time that was decremented the first time is lost and counts against the users node run minutes until the controller is restarted or reconfigured again. This scenario is different enough from the first that I think it would be worth putting it in a separate ticket. For the issue with the time being updated, I have been working on it between other work I have and have been narrowing down where in the code the problem is, but haven't found it yet. I'll keep working on it and let you know as there's progress. Thanks, Ben
> This scenario is different enough from the first that I think it would be worth putting it in a separate ticket. Makes sense. Want me to create it?
That would be great if you would. I can create a ticket and add you to the CC field, but can't make you the reporter. I think the description you sent above should be enough and feel free to reference this ticket. Thanks, Ben
Ok, new bug created: https://bugs.schedmd.com/show_bug.cgi?id=9477
Increasing to Sev2. We don't have a good workaround for this.
Hi Luke, Thanks for your patience while I looked into this. I have a proposed fix that will undergo an internal review process. I'll let you know as there is progress. Thanks, Ben
Hi Luke, The fix for this issue has been checked in. https://github.com/SchedMD/slurm/commit/2655ec414a43edcf8ac3f6450b0447daf268b8d0 I'm sorry that it didn't quite make it before the 20.02.4 deadline, we were looking into whether there were other edge cases that would be affected in the same way. This will be available in 20.02.5, or if you need this fix sooner you can apply this patch and run with it until the next release is available. Let me know if you have any additional questions about this. Thanks, Ben
Between the fix in bug 9477 and our current workaround for this issue (disabling the '--time-min' flag via job_submit.lua), we are no longer seeing leaky accounting data in TRESRunMins. We are still looking forward to the 20.02.5 release so we can re-enable '--time-min', but I just wanted to follow up and say that I think we've squashed this particular family of bugs for now. Thanks!
Vielen Dank für Ihre Nachricht. Bis zum 25.08.2020 bin ich leider im Büro nicht erreichbar. Ich werde mich aber schnellstmöglichst mit Ihnen in Verbindung setzen. In dringenden Fällen wenden Sie sich bitte an: kiz.hpc-admin@uni-ulm.de Mit freundlichen Grüßen, Jürgen Salk ----------------------------------------------------------------------------------------- Thank you for your e-mail. I am out of office until Aug 25th 2020. I will have limited access to my e-mail during this period but will answer your message as soon as possible. If you have immediate questions or concerns, please contact kiz.hpc-admin@uni-ulm.de Best regards, Juergen Salk
Thanks for the update. I'm glad to hear that things are looking better and we appreciate your help in tracking down the source of these problems. Since this fix has been checked in I'll go ahead and close the ticket. If you see issues with the --time-min flag again after you go to 20.02.5 please let us know. Thanks, Ben
When I re-run the script from comment 14, I am unable to reproduce the leaky accounting data with 20.02.5, so that's good. But I'm also no longer getting the slurmctld log lines like "backfill: JobId=34318 time limit changed from 60 to 19" anymore, which makes it hard to verify that I'm hitting the same code path. Is it expected that those log lines would disappear in 20.02.5? Those seem like helpful messages - I wouldn't want them to disappear. I could read through '_attempt_backfill()' to find out but it's more than 1200 lines long so I'd rather not.
Hi Luke, I looked into why it's not showing the same log message. It's because the time limit is being reset a little earlier than it was before, so it doesn't match an if statement that printed out the log message previously. I'm looking into what can be done to have the message printed still. Thanks, Ben
Hi Luke, I was looking at this ticket a little again this morning and it really should be split into another ticket. It sounds like you are seeing that the issue with the time being leaked has been resolved, which is good. Would you mind opening a new ticket for the log message not appearing? Please reference this ticket when you do to easily provide context. Thanks, Ben