Hi, we are seeing an increasing number of bad jobs that are missing ReqTRES & alloc_nodes, plus start of the job is frequently equal to end of job (for main part, not for batch): sacct -j 5891006 -o 'jobidraw,jobid,start,end,ntasks,ncpus,nnodes,reqmem,reqtres%20,cputime,usercpu,cputimeraw%20,elapsed,node' JobIDRaw JobID Start End NTasks NCPUS NNodes ReqMem ReqTRES CPUTime UserCPU CPUTimeRAW Elapsed NodeList ------------ ------------ ------------------- ------------------- -------- ---------- -------- ---------- -------------------- ---------- ---------- -------------------- ---------- --------------- 5891006 0_0 2015-10-25T00:13:00 2015-10-25T00:13:00 0 0n 00:00:00 18:46.762 0 00:00:00 della-r1c4n5 5891006.bat+ 0_0.batch 2015-10-24T23:54:04 2015-10-25T00:13:00 1 1 0n INVALID 18:46.762 18446744073709550480 00:18:56 della-r1c4n5 on 10/24 we had over 4000 of jobs with such bad records. While we are at it - can we talk about CPUTime and CPUTimeRAW? Seems like that's always invalid/huge for batch, e.g. for a job that does have ncpus/nnodes/reqtres: sacct -j 5892006 -o 'jobidraw,jobid,start,end,ntasks,ncpus,nnodes,reqmem,reqtres%20,cputime,usercpu,cputimeraw%20,elapsed,node' JobIDRaw JobID Start End NTasks NCPUS NNodes ReqMem ReqTRES CPUTime UserCPU CPUTimeRAW Elapsed NodeList ------------ ------------ ------------------- ------------------- -------- ---------- -------- ---------- -------------------- ---------- ---------- -------------------- ---------- --------------- 5892006 5892006 2015-10-25T09:42:58 2015-10-25T09:46:38 1 1 4000Mn cpu=1,mem=4000,node+ 00:03:40 03:35.443 220 00:03:40 della-r3c4n10 5892006.bat+ 5892006.bat+ 2015-10-25T09:42:58 2015-10-25T09:46:38 1 1 4000Mn INVALID 03:35.443 18446744073709551396 00:03:40 della-r3c4n10 Thanks!
Hi, I am working on this. I see the job that don't have tres have a huge cputimeraw and invalid cputime. David
Well, not entirely - it looks like the INVALID time is happening *always* for batch portion of the job, e.g.: 5817297 5817297 2015-10-26T17:33:24 2015-10-27T00:48:45 80 4 4000Mc cpu=80,mem=320000,n+ 24-04:28:00 15-05:00:+ 2089680 07:15:21 della-r3c1n9,d+ 5817297.bat+ 5817297.bat+ 2015-10-26T17:33:24 2015-10-27T00:48:45 1 1 4000Mc INVALID 00:30.412 18446744073709525495 07:15:21 della-r3c1n9 5817297.0 5817297.0 2015-10-26T17:33:24 2015-10-26T17:35:49 80 80 4 4000Mc 03:13:20 01:24:19 11600 00:02:25 della-r3c1n9,d+ 5817297.1 5817297.1 2015-10-26T17:38:26 2015-10-27T00:48:45 80 80 4 4000Mc 23-21:45:20 15-03:35:+ 2065520 07:10:19 della-r3c1n9,d+ So looks like we have two things happening - batch portion is not accounted for correctly AND some jobs are missing reqtres and other info.
yes I was looking at the batch part only without steps
Actually it seems like there is an obvious culprit (linked to our slurmdbd trouble - ticket #2066): [2015-10-24T23:53:11.301] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-24T23:55:13.666] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-24T23:57:15.464] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-24T23:59:16.118] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-25T00:01:18.142] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-25T00:03:19.003] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-25T00:05:20.096] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-25T00:07:21.510] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-25T00:09:23.047] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-25T00:11:24.628] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-25T00:13:26.195] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-25T00:15:00.941] error: slurmdbd: response is not type DBD_RC: DBD_GOT_MULT_MSG(1475) [2015-10-25T00:15:02.390] error: SlurmDBD connection experienced an error [2015-10-25T00:15:02.390] error: slurmdbd: Getting response to message type 1472 [2015-10-25T00:15:02.390] error: slurmdbd: DBD_SEND_MULT_JOB_START failure: No error [2015-10-25T00:15:31.126] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW which covers full period from start to end of job. That probably explains 5891006 bad records (though, clearly, not missing batch data).
Josko, could you send us the dbd.messages file in your state directory? Not being able to get the dbd to read the messages is most likely the problem in all the tickets submitted today. On October 27, 2015 7:30:44 AM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=2067 > >--- Comment #4 from Josko Plazonic <plazonic@princeton.edu> --- >Actually it seems like there is an obvious culprit (linked to our >slurmdbd >trouble - ticket #2066): > >[2015-10-24T23:53:11.301] error: slurmdbd: agent queue filling, RESTART >SLURMDBD NOW >[2015-10-24T23:55:13.666] error: slurmdbd: agent queue filling, RESTART >SLURMDBD NOW >[2015-10-24T23:57:15.464] error: slurmdbd: agent queue filling, RESTART >SLURMDBD NOW >[2015-10-24T23:59:16.118] error: slurmdbd: agent queue filling, RESTART >SLURMDBD NOW >[2015-10-25T00:01:18.142] error: slurmdbd: agent queue filling, RESTART >SLURMDBD NOW >[2015-10-25T00:03:19.003] error: slurmdbd: agent queue filling, RESTART >SLURMDBD NOW >[2015-10-25T00:05:20.096] error: slurmdbd: agent queue filling, RESTART >SLURMDBD NOW >[2015-10-25T00:07:21.510] error: slurmdbd: agent queue filling, RESTART >SLURMDBD NOW >[2015-10-25T00:09:23.047] error: slurmdbd: agent queue filling, RESTART >SLURMDBD NOW >[2015-10-25T00:11:24.628] error: slurmdbd: agent queue filling, RESTART >SLURMDBD NOW >[2015-10-25T00:13:26.195] error: slurmdbd: agent queue filling, RESTART >SLURMDBD NOW >[2015-10-25T00:15:00.941] error: slurmdbd: response is not type DBD_RC: >DBD_GOT_MULT_MSG(1475) >[2015-10-25T00:15:02.390] error: SlurmDBD connection experienced an >error >[2015-10-25T00:15:02.390] error: slurmdbd: Getting response to message >type >1472 >[2015-10-25T00:15:02.390] error: slurmdbd: DBD_SEND_MULT_JOB_START >failure: No >error >[2015-10-25T00:15:31.126] error: slurmdbd: agent queue filling, RESTART >SLURMDBD NOW > >which covers full period from start to end of job. That probably >explains >5891006 bad records (though, clearly, not missing batch data). > >-- >You are receiving this mail because: >You are on the CC list for the bug.
I am afraid that that file is empty right now - I did restart slurmdbd when I (finally... working on adding notification for further trouble) noticed it was not functioning correctly. So probably missing fields could be attributed to db trouble but missing batch data is present even now after restart and apparent correct functioning of slurmdbd: JobIDRaw JobID Start End NTasks NCPUS NNodes ReqMem ReqTRES CPUTime UserCPU CPUTimeRAW Elapsed NodeList End ------------ ------------ ------------------- ------------------- -------- ---------- -------- ---------- -------------------- ---------- ---------- -------------------- ---------- --------------- ------------------- 5936393 5936393 2015-10-27T10:50:59 2015-10-27T10:51:59 6 1 4000Mc cpu=6,mem=24000,nod+ 00:06:00 05:59.899 360 00:01:00 della-r2c3n2 2015-10-27T10:51:59 5936393.bat+ 5936393.bat+ 2015-10-27T10:50:59 2015-10-27T10:51:59 1 1 4000Mc INVALID 05:59.899 18446744073709551556 00:01:00 della-r2c3n2 2015-10-27T10:51:59 Freshly ran (6 core stress job) - still INVALID. Also the other ticket I reported (2068) - those messages were appearing before slurmdbd trouble and still appearing after restart.
I can reproduce the missing and invalid data and I did not have any dbd trouble. I am investigating where the problem is. David
Could you stop the slurmctld and grab it if there is something there? Then restart the slurmctld. I am guessing this is the main problem though. Since the slurmctld isn't getting anything to the dbd all the data will be incomplete until it gets there. On 10/27/15 07:54, bugs@schedmd.com wrote: > > *Comment # 6 <http://bugs.schedmd.com/show_bug.cgi?id=2067#c6> on bug > 2067 <http://bugs.schedmd.com/show_bug.cgi?id=2067> from Josko > Plazonic <mailto:plazonic@princeton.edu> * > I am afraid that that file is empty right now - I did restart slurmdbd when I > (finally... working on adding notification for further trouble) noticed it was > not functioning correctly. > > So probably missing fields could be attributed to db trouble but missing batch > data is present even now after restart and apparent correct functioning of > slurmdbd: > > JobIDRaw JobID Start End NTasks > NCPUS NNodes ReqMem ReqTRES CPUTime UserCPU > CPUTimeRAW Elapsed NodeList End > ------------ ------------ ------------------- ------------------- -------- > ---------- -------- ---------- -------------------- ---------- ---------- > -------------------- ---------- --------------- ------------------- > 5936393 5936393 2015-10-27T10:50:59 2015-10-27T10:51:59 > 6 1 4000Mc cpu=6,mem=24000,nod+ 00:06:00 05:59.899 > 360 00:01:00 della-r2c3n2 2015-10-27T10:51:59 > 5936393.bat+ 5936393.bat+ 2015-10-27T10:50:59 2015-10-27T10:51:59 1 > 1 4000Mc INVALID 05:59.899 > 18446744073709551556 00:01:00 della-r2c3n2 2015-10-27T10:51:59 > > Freshly ran (6 core stress job) - still INVALID. > > Also the other ticket I reported (2068) - those messages were appearing before > slurmdbd trouble and still appearing after restart. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You are on the CC list for the bug. >
stopped slurmctld, sadly dbd.messages still had zero length. Any way to increase how much data slurmctld is willing to cache in case of slurmdbd communication trouble - i.e. if there a fixed limit or is it willing to cache for as long as necessary?
Is the problem still happening? From the comment 6 it seems that newly submitted jobs are being entered in the database, beside the missing or apparently incorrect fields. Is this correct? David
So far, after restart, things look good and it appears that reqtres field is present - of course not that many jobs started and finished since then so can't be 100% sure. We will definitely try to keep a closer eye on nightly accounting processings - that's where anomalies usually surface.
Good to hear. What exactly do you run in your nightly accounting processing? David
*** Ticket 2068 has been marked as a duplicate of this ticket. ***
Jasko, in comment 4, are these all the errors in the log dealing with the DBD grep "slurmdbd"? Just wondering when they started and ended and if the communication failed multiple times or was constant. Also was the DBD down during this period or just not responsive for one reason or another.
No, we had numerous errors but they did not start at the same time that slurmdbd started spewing those logs (that was on the morning of 19th). First time della's slurmctld started complaining about slurmdbd was on 23rd: [2015-10-23T16:40:26.769] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-23T16:42:28.806] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-23T16:44:29.239] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-23T16:46:32.133] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-23T16:48:34.525] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-23T16:50:35.246] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-23T16:52:37.215] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-23T16:54:42.626] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-23T16:56:43.101] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-23T16:58:45.299] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-23T17:00:47.694] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-23T17:02:49.673] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-23T17:04:52.098] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-23T17:06:01.823] error: _handle_assoc_tres_run_secs: job 5872667: assoc 440 TRES cpu grp_used_tres_run_secs underflow, tried to remove 6508 seconds when only 6448 remained. [2015-10-23T17:06:01.823] error: _handle_assoc_tres_run_secs: job 5872667: assoc 440 TRES mem grp_used_tres_run_secs underflow, tried to remove 26032000 seconds when only 25792000 remained. [2015-10-23T17:06:01.823] error: _handle_assoc_tres_run_secs: job 5872667: assoc 440 TRES node grp_used_tres_run_secs underflow, tried to remove 6508 seconds when only 6448 remained. [2015-10-23T17:06:01.823] error: _handle_assoc_tres_run_secs: job 5872667: assoc 24 TRES cpu grp_used_tres_run_secs underflow, tried to remove 6508 seconds when only 6448 remained. [2015-10-23T17:06:01.823] error: _handle_assoc_tres_run_secs: job 5872667: assoc 24 TRES mem grp_used_tres_run_secs underflow, tried to remove 26032000 seconds when only 25792000 remained. [2015-10-23T17:06:01.823] error: _handle_assoc_tres_run_secs: job 5872667: assoc 24 TRES node grp_used_tres_run_secs underflow, tried to remove 6508 seconds when only 6448 remained. [2015-10-23T17:06:53.853] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-23T17:08:55.333] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-23T17:11:03.838] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-23T17:13:10.542] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-23T17:15:11.921] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-23T17:17:12.001] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-23T17:19:15.024] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-23T17:21:20.434] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-23T17:23:30.263] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-23T17:25:32.807] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-23T17:27:46.640] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-24T06:35:36.291] error: _handle_assoc_tres_run_secs: job 5873777: assoc 354 TRES cpu grp_used_tres_run_secs underflow, tried to remove 78327 seconds when only 78267 remained. [2015-10-24T06:35:36.291] error: _handle_assoc_tres_run_secs: job 5873777: assoc 354 TRES mem grp_used_tres_run_secs underflow, tried to remove 313308000 seconds when only 313068000 remained. e.g. you can see there that it was constant for a while, then it stops for many hours) and it resumes complaining at: [2015-10-24T21:06:15.434] error: slurm_receive_msg: Zero Bytes were transmitted or received [2015-10-24T21:12:03.186] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW [2015-10-24T21:14:06.173] error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW You can find the full della slurmctld log at: http://tigress-web.princeton.edu/~plazonic/della-slurmctld.logs.txt
Ok, the error message that points to lost data (which is missing) is slurmdbd: agent queue is full, discarding request If you grep for slurmdbd you will see the size in the form of messages like slurmdbd: agent queue size $NUMBER They are info's instead of errors. As long as that number doesn't keep growing you should be ok. If the database is busy for some reason you may see these messages depending on the activity on the slurmctld. So it isn't strange to see them come and go. Do you have anything in the slurmdbd log that points to around [2015-10-23T16:40:26.769]?
Josko, this appears to be fixed in commits 78098296f35 3132d34a9f1 fd12a0910ad 0804545c345 02db31c29d4 The good news is this appears to only be related to the batch step. The problem is only cosmetic as well. Applying the patches above will fix the problem completely. If you are just looking for a bandaid you only really need commit 0804545c345. It doesn't appear you have lost any messages as noted in Comment 16. Let us know if this fixes your issues or not.
You will also want 14a39998c48f it fixes a memory leak in fd12a0910adf75.
(In reply to Danny Auble from comment #16) > Ok, the error message that points to lost data (which is missing) is > > slurmdbd: agent queue is full, discarding request Which we didn't seem to have in this case. > If you grep for slurmdbd you will see the size in the form of messages like > > slurmdbd: agent queue size $NUMBER We also didn't have this message in logs - I assume you mean in slurmctld logs. > They are info's instead of errors. As long as that number doesn't keep > growing you should be ok. If the database is busy for some reason you may > see these messages depending on the activity on the slurmctld. So it isn't > strange to see them come and go. Aha, maybe our logging level is too low for these to show up? Though I'd hope that discarding requests messages should be an error rather then info. > Do you have anything in the slurmdbd log that points to around > [2015-10-23T16:40:26.769]? Sadly we lost all slurmdbd logs after morning of 2015-10-19 :( but if we were supposed to get "discarding requests" messages and we didn't doesn't that mean we really didn't loose any data (from this cluster)? I am going to apply those patches soon - it looks like that only slurmctld and slurmdbd's servers need to be updated so unless you tell me differently that's where I'll apply updates first.
(In reply to Josko Plazonic from comment #19) > (In reply to Danny Auble from comment #16) > > Ok, the error message that points to lost data (which is missing) is > > > > slurmdbd: agent queue is full, discarding request > > Which we didn't seem to have in this case. That is good :). > > > If you grep for slurmdbd you will see the size in the form of messages like > > > > slurmdbd: agent queue size $NUMBER > > We also didn't have this message in logs - I assume you mean in slurmctld > logs. Yes, the slurmctld logs. info is the level of debug you would need. If you have the debug to only error you wouldn't see these. They are really just for reference to tell you how many messages you have in the system to be sent. > > > They are info's instead of errors. As long as that number doesn't keep > > growing you should be ok. If the database is busy for some reason you may > > see these messages depending on the activity on the slurmctld. So it isn't > > strange to see them come and go. > > Aha, maybe our logging level is too low for these to show up? Though I'd > hope that discarding requests messages should be an error rather then info. Yes, the discarding request message is an error message, so you would see them if you had them. > > > Do you have anything in the slurmdbd log that points to around > > [2015-10-23T16:40:26.769]? > > Sadly we lost all slurmdbd logs after morning of 2015-10-19 :( but if we > were supposed to get "discarding requests" messages and we didn't doesn't > that mean we really didn't loose any data (from this cluster)? I don't believe you lost any data for this cluster during this time period, at least there isn't any evidence that supports it here. > > I am going to apply those patches soon - it looks like that only slurmctld > and slurmdbd's servers need to be updated so unless you tell me differently > that's where I'll apply updates first. The patches that really matter are for the slurmctld. You would only need the slurmdbd patch if you were writing directly to mysql (which you are not ;)). The other patches are for sacct which will correct the view you have for older jobs.
Josko, how are things going? Can we close this?
Yup, for now everything seems to be behaving and I can't seem to find newly finished jobs without reqtres and other fields. If they re-appear we'll re-open. Thanks!
No problem, glad things worked out.