Dear Support, we have found the following problem: 1. a job was started on a number of nodes 2. one of the nodes became unresponsive 3. the job was requeued 4. the job started again on a completely different set of nodes 5. after a few hours the job was killed because of the down node from point 2 We log to /var/log/messages, so this is what the slurmctld log looks # grep slurmctld /var/log/messages | grep 53507 2018-04-22T23:10:48.766952+02:00 hn slurmctld[147256]: sched: Allocate JobID=53507 NodeList=co[4000-4017,4213-4260,4372,4424-4459,4462-4474,4476-4478,4480-4488] #CPUs=10240 Partition=n0128 2018-04-23T09:42:04.939514+02:00 hn slurmctld[147256]: requeue job 53507 due to failure of node co4016 2018-04-23T09:42:04.940165+02:00 hn slurmctld[147256]: email msg to nezami@mpia.de: SLURM Job_id=53507 Name=ca1F3 Failed, Run time 10:31:16, NODE_FAIL, ExitCode 0 2018-04-23T09:47:04.202102+02:00 hn slurmctld[147256]: cleanup_completing: job 53507 completion process took 299 seconds 2018-04-23T09:47:29.208237+02:00 hn slurmctld[147256]: email msg to XXXX: SLURM Job_id=53507 Name=YYYY Began, Queued time 3-00:22:32 2018-04-23T09:47:29.209672+02:00 hn slurmctld[147256]: backfill: Started JobId=53507 in n0128 on co[3358-3430,3432-3478,3609-3616] 2018-04-23T14:19:38.642341+02:00 hn slurmctld[147256]: Killing job 53507 on DOWN node co4016 2018-04-23T14:19:38.647682+02:00 hn slurmctld[147256]: email msg to XXXX: SLURM Job_id=53507 Name=YYY Failed, Run time 04:32:09, NODE_FAIL, ExitCode 0 2018-04-23T14:19:38.668985+02:00 hn slurmctld[147256]: _sync_nodes_to_comp_job: Job 53507 in completing state The peculiar line is of course 2018-04-23T14:19:38.642341+02:00 hn slurmctld[147256]: Killing job 53507 on DOWN node co4016 This is what sacct reports: # sacct -j 53507 --format jobId,start,end,state JobID Start End State ------------ ------------------- ------------------- ---------- 53507 2018-04-23T09:47:29 2018-04-23T14:19:38 NODE_FAIL 53507.batch 2018-04-22T23:10:48 2018-04-23T14:19:38 CANCELLED 53507.0 2018-04-22T23:10:49 2018-04-23T09:47:04 FAILED 53507.1 2018-04-23T09:47:30 2018-04-23T14:19:39 FAILED Do you need any other info? Kindest regards, Oliver Schroeder
> Do you need any other info? Hi Oliver, Yes please, can you attach also your slurm.conf file and slurmd log of node co4016? Thanks, Felip
Hi Felip, thanks for the quick response. Here is the slurmd portion of /var/log/messages form co4016 2018-04-22T23:10:49.415034+02:00 co4016 slurmd[9419]: launch task 53507.0 request from 36550.12800@10.180.141.1 (port 16110) 2018-04-22T23:10:49.415603+02:00 co4016 slurmd[9419]: lllp_distribution jobid [53507] binding: threads,one_thread, dist 34 2018-04-22T23:10:49.415945+02:00 co4016 slurmd[9419]: _task_layout_lllp_block 2018-04-22T23:10:49.416258+02:00 co4016 slurmd[9419]: _lllp_generate_cpu_bind jobid [53507]: mask_cpu,one_thread, 0x00000000000000000001,0x00000000000000000002,0x00000000000000000004,0x00000000000000000008,0x00000000000000000010,0x00000000000000000020,0x00000000000000000040,0x00000000000000000080,0x00000000000000000100,0x00000000000000000200,0x00000000000000000400,0x00000000000000000800,0x00000000000000001000,0x00000000000000002000,0x00000000000000004000,0x00000000000000008000,0x00000000000000010000,0x00000000000000020000,0x00000000000000040000,0x0 2018-04-22T23:10:49.416574+02:00 co4016 slurmd[9419]: _run_prolog: run job script took usec=15 2018-04-22T23:10:49.418339+02:00 co4016 slurmd[9419]: _run_prolog: prolog with lock for job 53507 ran for 0 seconds 2018-04-23T17:07:10.180463+02:00 co4016 slurmd[5559]: Message aggregation disabled 2018-04-23T17:07:10.187592+02:00 co4016 slurmd[5559]: TOPOLOGY: warning -- no switch can reach all nodes through its descendants.Do not use route/topology 2018-04-23T17:07:10.206757+02:00 co4016 slurmd[5559]: Resource spec: Reserved system memory limit not configured for this node 2018-04-23T17:07:10.277190+02:00 co4016 systemd[1]: slurmd.service: PID file /var/run/slurm/slurmd.pid not readable (yet?) after start: No such file or directory 2018-04-23T17:07:10.277596+02:00 co4016 slurmd[5563]: slurmd version 17.02.9 started 2018-04-23T17:07:10.280102+02:00 co4016 slurmd[5563]: slurmd started on Mon, 23 Apr 2018 17:07:10 +0200 2018-04-23T17:07:10.593533+02:00 co4016 slurmd[5563]: CPUs=80 Boards=1 Sockets=2 Cores=20 Threads=2 Memory=192115 TmpDisk=32752 Uptime=62 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) You see that slurm came only back online after the job had been killed. I'll try to send you the slurm.conf. Kindest regards, Oliver Dr. Oliver Schröder Senior IT Consultant science + computing ag Phone: +49 (0) 7071 9457 479 E-Mail: oliver.2.schroeder@atos.net Hagellocher Weg 73 D-72070 Tübingen Website: de.atos.net/sc [image002] ________________________________ From: bugs@schedmd.com [bugs@schedmd.com] Sent: Thursday, April 26, 2018 2:47 PM To: Schroeder, Oliver Subject: [Bug 5102] Job gets killed after requeue because of down node Felip Moll<mailto:felip.moll@schedmd.com> changed bug 5102<https://bugs.schedmd.com/show_bug.cgi?id=5102> What Removed Added CC felip.moll@schedmd.com Assignee support@schedmd.com felip.moll@schedmd.com Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=5102#c1> on bug 5102<https://bugs.schedmd.com/show_bug.cgi?id=5102> from Felip Moll<mailto:felip.moll@schedmd.com> > Do you need any other info? Hi Oliver, Yes please, can you attach also your slurm.conf file and slurmd log of node co4016? Thanks, Felip ________________________________ You are receiving this mail because: * You reported the bug.
I am missing some information.. controller: 1. 2018-04-22T23:10 - Job starts 2. 2018-04-23T09:42 - After ~11hours, job fails and is re-queued due to failure of co4016 3. 2018-04-23T09:47 - It starts again in a different subset of nodes 4. 2018-04-23T14:19 - After ~4.5hours, job 53507 is killed on node co4016. Job is canceled/failed. slurmd: 1. 2018-04-22T23:10 - Task launch 53507.0 2. 2018-04-23T17:07 - Slurmd started As you see, from slurmd side we don't get much information, maybe you can send me full /var/log/messages from 2018-04-22@23:09 to 2018-04-23@14:20 of co4016? This way I would see the moments that it is failing, not responding and so on. Btw, any special reason to log into /var/log/messages instead of a dedicated log file? Regarding the controller, having the full slurmctld log in between 2018-04-22@23:09 to 2018-04-23@14:20 would be helpful, since there can be strings and some information besides the |grep 53507 that can be useful. If you upload it here and security is a concern, I can mark files as private.
Hi Oliver, May you refer to my last comment? Has the issue happened again? Thank you
Hi Felip, thanks for your email. We had some data protection issues to discuss, and I will transfer the tickets directly to the customer. I will keep you posted. Have a good weekend! Cheers, Oliver Dr. Oliver Schröder Senior IT Consultant science + computing ag Phone: +49 (0) 7071 9457 479 E-Mail: oliver.2.schroeder@atos.net Hagellocher Weg 73 D-72070 Tübingen Website: de.atos.net/sc [image002] ________________________________ From: bugs@schedmd.com [bugs@schedmd.com] Sent: Friday, May 04, 2018 11:49 AM To: Schroeder, Oliver Subject: [Bug 5102] Job gets killed after requeue because of down node Comment # 4<https://bugs.schedmd.com/show_bug.cgi?id=5102#c4> on bug 5102<https://bugs.schedmd.com/show_bug.cgi?id=5102> from Felip Moll<mailto:felip.moll@schedmd.com> Hi Oliver, May you refer to my last comment? Has the issue happened again? Thank you ________________________________ You are receiving this mail because: * You reported the bug.
Created attachment 6792 [details] /var/log/messages from node co4016 for April 22/23
Created attachment 6793 [details] /var/log/messages with entries of slurmctld from April 22/23
Hi, Felip, I attached two files with excerpts of /var/log/messages from the headnode, which contains entries of slurmctld among others, and an excerpt of /var/log/messages from node co4016. I hope, this is what you requested. Please, mark them as private, as you suggested. Best regards, Renate Dohmen (MPCDF)
Hi Renate, The job is killed just after a reconfigure on slurmctld, while the job was already requeued and the old node was still down. The node was down during this period: 2018-04-23T09:30:01.388625+02:00 co4016 CRON[165581]: pam_unix(crond:session): session closed for user root ... 2018-04-23T15:06:16.144126+02:00 co4016 systemd[1]: Detected architecture x86-64. 2018-04-23T15:06:16.144143+02:00 co4016 systemd[1]: Set hostname to <co4016>. The requeue happened here: 2018-04-23T09:42:04.939514+02:00 hn0 slurmctld[147256]: requeue job 53507 due to failure of node co4016 2018-04-23T09:42:04.940165+02:00 hn0 slurmctld[147256]: email msg to xxxxxxxxxx: SLURM Job_id=53507 Name=yyyyy Failed, Run time 10:31:16, NODE_FAIL, ExitCode 0 2018-04-23T09:42:04.942929+02:00 hn0 slurmctld[147256]: error: Nodes co4016 not responding, setting DOWN The reconfigure happened here: 2018-04-23T14:19:26.655992+02:00 hn0 slurmctld[147256]: _slurm_rpc_submit_batch_job JobId=56017 usec=8518 2018-04-23T14:19:28.485859+02:00 hn0 slurmctld[147256]: Processing RPC: REQUEST_RECONFIGURE from uid=0 2018-04-23T14:19:28.909131+02:00 hn0 slurmctld[147256]: TOPOLOGY: warning -- no switch can reach all nodes through its descendants.Do not use route/topology ... 2018-04-23T14:19:38.609339+02:00 hn0 slurmctld[147256]: restoring original state of nodes 2018-04-23T14:19:38.630682+02:00 hn0 slurmctld[147256]: cons_res: select_p_node_init 2018-04-23T14:19:38.631135+02:00 hn0 slurmctld[147256]: cons_res: preparing for 12 partitions 2018-04-23T14:19:38.642341+02:00 hn0 slurmctld[147256]: Killing job 53507 on DOWN node co4016 2018-04-23T14:19:38.647682+02:00 hn0 slurmctld[147256]: email msg to xxxxxe: SLURM Job_id=53507 Name=yyyy Failed, Run time 04:32:09, NODE_FAIL, ExitCode 0 2018-04-23T14:19:38.655455+02:00 hn0 slurmctld[147256]: _sync_nodes_to_jobs updated state of 1 nodes 2018-04-23T14:19:38.668985+02:00 hn0 slurmctld[147256]: _sync_nodes_to_comp_job: Job 53507 in completing state I am trying to reproduce, but since it is 17.02.9 it could be already resolved in a newer patch. Will let you know what I find. In the meantime is it possible to have your slurm.conf?
Created attachment 6897 [details] slurm.conf file
I provided the requested slurm.conf file which was in use at the time the error occurred.
(In reply to Renate Dohmen from comment #11) > I provided the requested slurm.conf file which was in use at the time the > error occurred. Hi Renate, This is just a post to let you know we are still working on the issue. I've identified the spot where it fails but still not the cause. It seems that for some reason the job_ptr->node_bitmap or job_ptr->node_bitmap_cg bit of the failing node was not cleared. Did the situation happen again to you?
Renate, I didn't manage to reproduce the issue, but I am still looking. Did it happen again?
Hi Renate/Oliver, I didn't found any particular spot that could lead to this situation. Did this problem happen again? Or just was a one time issue? I am guessing if some message got lost in the middle of a communication, but aside from that I don't see anything more. If it is still happening, are you able to reproduce?
Hi, Felip, > Did this problem happen again? Or just was a one time issue? At least, we did not observe it again. >I am guessing if some message got lost in the middle of a communication, but >aside from that I don't see anything more. If it is still happening, are you >able to reproduce? We did not try to reproduce, and also do not have the resources to set up test cases for problems like this. Best regards, Renate
Hi Oliver/Renate, I am unable to reproduce the situation. If you only have seen the issue once, and haven't reproduced more, I think it has to be something related about a lost message in the communication between slurmd and ctld, despite this shouldn't create this situation. I am inclined to close this bug and wait if it happens again, then grab again the slurm logs, system logs, and so on and see if there's any correlation with the first time issue was seen. If it happens again, you could reopen the bug. What do you think?
Dear Sir and Madam, I will be out of office till 02/AUGUST/2018; please contact lsf-support@science-computing.de in case of a problem. Kindest regards, Oliver Schroeder --- Sehr geehrte Damen und Herren, ich werde leider erst ab dem 02/AUGUST/2018 Zugang zu meinen emails haben. Im Falle eines eiligen Problems wenden Sie sich bitte an lsf-support@science-computing.de Mit freundlichen Gruessen, Oliver Schroeder
I am closing this since it seems to have not happened again and I/customer are unable to reproduce again. If the issue reappears don't hesitate to reopen the bug. Regards, Felip M