Ticket 5102 - Job gets killed after requeue because of down node
Summary: Job gets killed after requeue because of down node
Status: RESOLVED CANNOTREPRODUCE
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 17.02.9
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Felip Moll
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-04-26 03:45 MDT by Oliver Schroeder
Modified: 2018-08-07 06:48 MDT (History)
2 users (show)

See Also:
Site: Max Planck Computing and Data Facility
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Oliver Schroeder 2018-04-26 03:45:50 MDT
Dear Support,

we have found the following problem:

1. a job was started on a number of nodes
2. one of the nodes became unresponsive
3. the job was requeued
4. the job started again on a completely different set of nodes
5. after a few hours the job was killed because of the down node from point 2

We log to /var/log/messages, so this is what the slurmctld log looks 

# grep slurmctld /var/log/messages | grep 53507

2018-04-22T23:10:48.766952+02:00 hn slurmctld[147256]: sched: Allocate JobID=53507 NodeList=co[4000-4017,4213-4260,4372,4424-4459,4462-4474,4476-4478,4480-4488] #CPUs=10240 Partition=n0128
2018-04-23T09:42:04.939514+02:00 hn slurmctld[147256]: requeue job 53507 due to failure of node co4016
2018-04-23T09:42:04.940165+02:00 hn slurmctld[147256]: email msg to nezami@mpia.de: SLURM Job_id=53507 Name=ca1F3 Failed, Run time 10:31:16, NODE_FAIL, ExitCode 0
2018-04-23T09:47:04.202102+02:00 hn slurmctld[147256]: cleanup_completing: job 53507 completion process took 299 seconds
2018-04-23T09:47:29.208237+02:00 hn slurmctld[147256]: email msg to XXXX: SLURM Job_id=53507 Name=YYYY Began, Queued time 3-00:22:32
2018-04-23T09:47:29.209672+02:00 hn slurmctld[147256]: backfill: Started JobId=53507 in n0128 on co[3358-3430,3432-3478,3609-3616]
2018-04-23T14:19:38.642341+02:00 hn slurmctld[147256]: Killing job 53507 on DOWN node co4016
2018-04-23T14:19:38.647682+02:00 hn slurmctld[147256]: email msg to XXXX: SLURM Job_id=53507 Name=YYY Failed, Run time 04:32:09, NODE_FAIL, ExitCode 0
2018-04-23T14:19:38.668985+02:00 hn slurmctld[147256]: _sync_nodes_to_comp_job: Job 53507 in completing state

The peculiar line is of course 

2018-04-23T14:19:38.642341+02:00 hn slurmctld[147256]: Killing job 53507 on DOWN node co4016

This is what sacct reports:

# sacct -j 53507 --format jobId,start,end,state
       JobID               Start                 End      State 
------------ ------------------- ------------------- ---------- 
53507        2018-04-23T09:47:29 2018-04-23T14:19:38  NODE_FAIL 
53507.batch  2018-04-22T23:10:48 2018-04-23T14:19:38  CANCELLED 
53507.0      2018-04-22T23:10:49 2018-04-23T09:47:04     FAILED 
53507.1      2018-04-23T09:47:30 2018-04-23T14:19:39     FAILED 

Do you need any other info?

Kindest regards,
Oliver Schroeder
Comment 1 Felip Moll 2018-04-26 06:47:32 MDT
> Do you need any other info?

Hi Oliver,

Yes please, can you attach also your slurm.conf file and slurmd log of node co4016?

Thanks,
Felip
Comment 2 Oliver Schroeder 2018-04-26 07:29:27 MDT
Hi Felip,

thanks for the quick response.

Here is the slurmd portion of /var/log/messages form co4016

2018-04-22T23:10:49.415034+02:00 co4016 slurmd[9419]: launch task 53507.0 request from 36550.12800@10.180.141.1 (port 16110)
2018-04-22T23:10:49.415603+02:00 co4016 slurmd[9419]: lllp_distribution jobid [53507] binding: threads,one_thread, dist 34
2018-04-22T23:10:49.415945+02:00 co4016 slurmd[9419]: _task_layout_lllp_block
2018-04-22T23:10:49.416258+02:00 co4016 slurmd[9419]: _lllp_generate_cpu_bind jobid [53507]: mask_cpu,one_thread, 0x00000000000000000001,0x00000000000000000002,0x00000000000000000004,0x00000000000000000008,0x00000000000000000010,0x00000000000000000020,0x00000000000000000040,0x00000000000000000080,0x00000000000000000100,0x00000000000000000200,0x00000000000000000400,0x00000000000000000800,0x00000000000000001000,0x00000000000000002000,0x00000000000000004000,0x00000000000000008000,0x00000000000000010000,0x00000000000000020000,0x00000000000000040000,0x0
2018-04-22T23:10:49.416574+02:00 co4016 slurmd[9419]: _run_prolog: run job script took usec=15
2018-04-22T23:10:49.418339+02:00 co4016 slurmd[9419]: _run_prolog: prolog with lock for job 53507 ran for 0 seconds
2018-04-23T17:07:10.180463+02:00 co4016 slurmd[5559]: Message aggregation disabled
2018-04-23T17:07:10.187592+02:00 co4016 slurmd[5559]: TOPOLOGY: warning -- no switch can reach all nodes through its descendants.Do not use route/topology
2018-04-23T17:07:10.206757+02:00 co4016 slurmd[5559]: Resource spec: Reserved system memory limit not configured for this node
2018-04-23T17:07:10.277190+02:00 co4016 systemd[1]: slurmd.service: PID file /var/run/slurm/slurmd.pid not readable (yet?) after start: No such file or directory
2018-04-23T17:07:10.277596+02:00 co4016 slurmd[5563]: slurmd version 17.02.9 started
2018-04-23T17:07:10.280102+02:00 co4016 slurmd[5563]: slurmd started on Mon, 23 Apr 2018 17:07:10 +0200
2018-04-23T17:07:10.593533+02:00 co4016 slurmd[5563]: CPUs=80 Boards=1 Sockets=2 Cores=20 Threads=2 Memory=192115 TmpDisk=32752 Uptime=62 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

You see that slurm came only back online after the job had been killed.

I'll try to send you the slurm.conf.

Kindest regards, Oliver

Dr. Oliver Schröder
Senior IT Consultant
science + computing ag
Phone: +49 (0) 7071 9457 479
E-Mail: oliver.2.schroeder@atos.net
Hagellocher Weg 73
D-72070 Tübingen
Website: de.atos.net/sc
[image002]
________________________________
From: bugs@schedmd.com [bugs@schedmd.com]
Sent: Thursday, April 26, 2018 2:47 PM
To: Schroeder, Oliver
Subject: [Bug 5102] Job gets killed after requeue because of down node

Felip Moll<mailto:felip.moll@schedmd.com> changed bug 5102<https://bugs.schedmd.com/show_bug.cgi?id=5102>
What    Removed Added
CC              felip.moll@schedmd.com
Assignee        support@schedmd.com     felip.moll@schedmd.com

Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=5102#c1> on bug 5102<https://bugs.schedmd.com/show_bug.cgi?id=5102> from Felip Moll<mailto:felip.moll@schedmd.com>

> Do you need any other info?

Hi Oliver,

Yes please, can you attach also your slurm.conf file and slurmd log of node
co4016?

Thanks,
Felip

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 3 Felip Moll 2018-04-26 08:09:32 MDT
I am missing some information..

controller:
1. 2018-04-22T23:10 - Job starts
2. 2018-04-23T09:42 - After ~11hours, job fails and is re-queued due to failure of co4016
3. 2018-04-23T09:47 - It starts again in a different subset of nodes
4. 2018-04-23T14:19 - After ~4.5hours, job 53507 is killed on node co4016. Job is canceled/failed.

slurmd:
1. 2018-04-22T23:10 - Task launch 53507.0
2. 2018-04-23T17:07 - Slurmd started


As you see, from slurmd side we don't get much information, maybe you can send me full /var/log/messages from 2018-04-22@23:09 to 2018-04-23@14:20 of co4016? This way I would see the moments that it is failing, not responding and so on.

Btw, any special reason to log into /var/log/messages instead of a dedicated log file?


Regarding the controller, having the full slurmctld log in between 2018-04-22@23:09 to 2018-04-23@14:20 would be helpful, since there can be strings and some information besides the |grep 53507 that can be useful.



If you upload it here and security is a concern, I can mark files as private.
Comment 4 Felip Moll 2018-05-04 03:49:17 MDT
Hi Oliver,

May you refer to my last comment?
Has the issue happened again?

Thank you
Comment 5 Oliver Schroeder 2018-05-04 09:18:20 MDT
Hi Felip,

thanks for your email. We had some data protection issues to discuss, and I will transfer the tickets directly to the customer.

I will keep you posted. Have a good weekend!

Cheers, Oliver


Dr. Oliver Schröder
Senior IT Consultant
science + computing ag
Phone: +49 (0) 7071 9457 479
E-Mail: oliver.2.schroeder@atos.net
Hagellocher Weg 73
D-72070 Tübingen
Website: de.atos.net/sc
[image002]
________________________________
From: bugs@schedmd.com [bugs@schedmd.com]
Sent: Friday, May 04, 2018 11:49 AM
To: Schroeder, Oliver
Subject: [Bug 5102] Job gets killed after requeue because of down node


Comment # 4<https://bugs.schedmd.com/show_bug.cgi?id=5102#c4> on bug 5102<https://bugs.schedmd.com/show_bug.cgi?id=5102> from Felip Moll<mailto:felip.moll@schedmd.com>

Hi Oliver,

May you refer to my last comment?
Has the issue happened again?

Thank you

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 6 Renate Dohmen 2018-05-08 09:15:21 MDT
Created attachment 6792 [details]
/var/log/messages from node co4016 for April 22/23
Comment 7 Renate Dohmen 2018-05-08 09:16:33 MDT
Created attachment 6793 [details]
/var/log/messages with entries of slurmctld from April 22/23
Comment 8 Renate Dohmen 2018-05-08 09:19:32 MDT
Hi, Felip,

I attached two files with excerpts of /var/log/messages from the headnode, which contains entries of slurmctld among others, and an excerpt of /var/log/messages from node co4016. I hope, this is what you requested. Please, mark them as private, as you suggested.

Best regards,
Renate Dohmen (MPCDF)
Comment 9 Felip Moll 2018-05-14 08:17:33 MDT
Hi Renate,


The job is killed just after a reconfigure on slurmctld, while the job was already requeued and the old node was still down.

The node was down during this period:

2018-04-23T09:30:01.388625+02:00 co4016 CRON[165581]: pam_unix(crond:session): session closed for user root
...
2018-04-23T15:06:16.144126+02:00 co4016 systemd[1]: Detected architecture x86-64.
2018-04-23T15:06:16.144143+02:00 co4016 systemd[1]: Set hostname to <co4016>.

The requeue happened here:
2018-04-23T09:42:04.939514+02:00 hn0 slurmctld[147256]: requeue job 53507 due to failure of node co4016
2018-04-23T09:42:04.940165+02:00 hn0 slurmctld[147256]: email msg to xxxxxxxxxx: SLURM Job_id=53507 Name=yyyyy Failed, Run time 10:31:16, NODE_FAIL, ExitCode 0
2018-04-23T09:42:04.942929+02:00 hn0 slurmctld[147256]: error: Nodes co4016 not responding, setting DOWN


The reconfigure happened here:
2018-04-23T14:19:26.655992+02:00 hn0 slurmctld[147256]: _slurm_rpc_submit_batch_job JobId=56017 usec=8518
2018-04-23T14:19:28.485859+02:00 hn0 slurmctld[147256]: Processing RPC: REQUEST_RECONFIGURE from uid=0
2018-04-23T14:19:28.909131+02:00 hn0 slurmctld[147256]: TOPOLOGY: warning -- no switch can reach all nodes through its descendants.Do not use route/topology
...
2018-04-23T14:19:38.609339+02:00 hn0 slurmctld[147256]: restoring original state of nodes
2018-04-23T14:19:38.630682+02:00 hn0 slurmctld[147256]: cons_res: select_p_node_init
2018-04-23T14:19:38.631135+02:00 hn0 slurmctld[147256]: cons_res: preparing for 12 partitions
2018-04-23T14:19:38.642341+02:00 hn0 slurmctld[147256]: Killing job 53507 on DOWN node co4016
2018-04-23T14:19:38.647682+02:00 hn0 slurmctld[147256]: email msg to xxxxxe: SLURM Job_id=53507 Name=yyyy Failed, Run time 04:32:09, NODE_FAIL, ExitCode 0
2018-04-23T14:19:38.655455+02:00 hn0 slurmctld[147256]: _sync_nodes_to_jobs updated state of 1 nodes
2018-04-23T14:19:38.668985+02:00 hn0 slurmctld[147256]: _sync_nodes_to_comp_job: Job 53507 in completing state



I am trying to reproduce, but since it is 17.02.9 it could be already resolved in a newer patch.
Will let you know what I find. In the meantime is it possible to have your slurm.conf?
Comment 10 Renate Dohmen 2018-05-18 08:27:07 MDT
Created attachment 6897 [details]
slurm.conf file
Comment 11 Renate Dohmen 2018-05-18 08:28:10 MDT
I provided the requested slurm.conf file which was in use at the time the error occurred.
Comment 13 Felip Moll 2018-06-25 10:07:06 MDT
(In reply to Renate Dohmen from comment #11)
> I provided the requested slurm.conf file which was in use at the time the
> error occurred.

Hi Renate,

This is just a post to let you know we are still working on the issue.
I've identified the spot where it fails but still not the cause.

It seems that for some reason the job_ptr->node_bitmap or job_ptr->node_bitmap_cg bit of the failing node was not cleared.

Did the situation happen again to you?
Comment 14 Felip Moll 2018-07-11 08:01:17 MDT
Renate,

I didn't manage to reproduce the issue, but I am still looking.

Did it happen again?
Comment 15 Felip Moll 2018-07-23 07:44:00 MDT
Hi Renate/Oliver,

I didn't found any particular spot that could lead to this situation.

Did this problem happen again? Or just was a one time issue?

I am guessing if some message got lost in the middle of a communication, but aside from that I don't see anything more. If it is still happening, are you able to reproduce?
Comment 16 Renate Dohmen 2018-07-24 09:38:28 MDT
Hi, Felip,

> Did this problem happen again? Or just was a one time issue?
At least, we did not observe it again.

>I am guessing if some message got lost in the middle of a communication, but
>aside from that I don't see anything more. If it is still happening, are you 
>able to reproduce?
We did not try to reproduce, and also do not have the resources to set up test cases for problems like this.

Best regards,
Renate
Comment 17 Felip Moll 2018-08-01 01:53:10 MDT
Hi Oliver/Renate,

I am unable to reproduce the situation.

If you only have seen the issue once, and haven't reproduced more, I think it has to be something related about a lost message in the communication between slurmd and ctld, despite this shouldn't create this situation.

I am inclined to close this bug and wait if it happens again, then grab again the slurm logs, system logs, and so on and see if there's any correlation with the first time issue was seen.

If it happens again, you could reopen the bug.

What do you think?
Comment 18 Oliver Schroeder 2018-08-01 01:53:23 MDT
Dear Sir and Madam,

I will be out of office till 02/AUGUST/2018; please contact lsf-support@science-computing.de in case of a problem.

Kindest regards, Oliver Schroeder

---

Sehr geehrte Damen und Herren,

ich werde leider erst ab dem 02/AUGUST/2018 Zugang zu meinen emails haben.

Im Falle eines eiligen Problems wenden Sie sich bitte an

lsf-support@science-computing.de

Mit freundlichen Gruessen,
Oliver Schroeder
Comment 19 Felip Moll 2018-08-07 06:48:08 MDT
I am closing this since it seems to have not happened again and I/customer are unable to reproduce again.

If the issue reappears don't hesitate to reopen the bug.

Regards,
Felip M