Ticket 406

Summary:	One slurmd process reports: Malformed RPC of type 5016 received
Product:	Slurm	Reporter:	aalba7 <ascanio.alba7>
Component:	slurmd	Assignee:	Moe Jette <jette>
Status:	RESOLVED FIXED	QA Contact:
Severity:	2 - High Impact
Priority:	---	CC:	da
Version:	2.6.x
Hardware:	Linux
OS:	Linux
Site:	-Other-	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	2 node partition

Description aalba7 2013-08-27 01:15:37 MDT

Running any type of slurm job across two nodes results in one node hanging for 60s. srun -N 1 does not cause this to happen on the sole compute node.
The user processes have ended (i.e. no unkillable processes).


## run on VM headnode, 2 x VM compute nodes
srun -N 2 hostname


[2013-08-27T21:07:03.974] slurmd version 2.6.1 started
[2013-08-27T21:07:03.975] slurmd started on Tue, 27 Aug 2013 21:07:03 +0800
[2013-08-27T21:07:03.975] Procs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=996 TmpDisk=13646 Uptime=85013
[2013-08-27T21:07:23.576] launch task 58.0 request from 1252600001.1252600001@172.16.16.1 (port 40333)
[2013-08-27T21:07:23.597] error: Malformed RPC of type 5016 received
[2013-08-27T21:07:23.597] error: slurm_receive_msg_and_forward: Header lengths are longer than data received
[2013-08-27T21:07:23.607] error: service_connection: slurm_receive_msg: Header lengths are longer than data received
[2013-08-27T21:07:24.609] error: Malformed RPC of type 5016 received
[2013-08-27T21:07:24.609] error: slurm_receive_msg_and_forward: Header lengths are longer than data received
[2013-08-27T21:07:24.619] error: service_connection: slurm_receive_msg: Header lengths are longer than data received
[2013-08-27T21:07:25.620] error: Malformed RPC of type 5016 received
[2013-08-27T21:07:25.620] error: slurm_receive_msg_and_forward: Header lengths are longer than data received
[2013-08-27T21:07:25.631] error: service_connection: slurm_receive_msg: Header lengths are longer than data received
[2013-08-27T21:07:26.632] error: Malformed RPC of type 5016 received
[2013-08-27T21:07:26.632] error: slurm_receive_msg_and_forward: Header lengths are longer than data received
[2013-08-27T21:07:26.642] error: service_connection: slurm_receive_msg: Header lengths are longer than data received
[2013-08-27T21:07:27.644] error: Malformed RPC of type 5016 received
[2013-08-27T21:07:27.644] error: slurm_receive_msg_and_forward: Header lengths are longer than data received
[2013-08-27T21:07:27.654] error: service_connection: slurm_receive_msg: Header lengths are longer than data received
[2013-08-27T21:08:23.633] [58.0] Abandoning IO 60 secs after job shutdown initiated
[2013-08-27T21:08:26.001] [58.0] done with job

Comment 1 aalba7 2013-08-27 01:16:29 MDT

Occurs also in 2.6.0
Not observed in 2.5.7

Comment 2 aalba7 2013-08-27 02:00:48 MDT

I tested 13.12.0-0pre1 and the bug persists:

I forgot to mention that the jobs runs correctly; only something weird happens during the cleanup



[2013-08-27T21:56:30.163] CPU frequency setting not configured for this node
[2013-08-27T21:56:30.164] slurmd version 13.12.0-0pre1 started
[2013-08-27T21:56:30.165] slurmd started on Tue, 27 Aug 2013 21:56:30 +0800
[2013-08-27T21:56:30.165] Procs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=996 TmpDisk=13646 Uptime=87979
[2013-08-27T21:56:46.167] launch task 5.0 request from 0.0@172.16.16.1 (port 20110)
[2013-08-27T21:56:46.188] [5.0] done with job
[2013-08-27T21:56:49.869] launch task 6.0 request from 0.0@172.16.16.1 (port 22158)
[2013-08-27T21:56:49.894] error: Malformed RPC of type 5016 received
[2013-08-27T21:56:49.894] error: slurm_receive_msg_and_forward: Header lengths are longer than data received
[2013-08-27T21:56:49.904] error: service_connection: slurm_receive_msg: Header lengths are longer than data received
[2013-08-27T21:56:50.906] error: Malformed RPC of type 5016 received
[2013-08-27T21:56:50.906] error: slurm_receive_msg_and_forward: Header lengths are longer than data received
[2013-08-27T21:56:50.916] error: service_connection: slurm_receive_msg: Header lengths are longer than data received
[2013-08-27T21:56:51.918] error: Malformed RPC of type 5016 received
[2013-08-27T21:56:51.918] error: slurm_receive_msg_and_forward: Header lengths are longer than data received
[2013-08-27T21:56:51.928] error: service_connection: slurm_receive_msg: Header lengths are longer than data received
[2013-08-27T21:56:52.930] error: Malformed RPC of type 5016 received
[2013-08-27T21:56:52.930] error: slurm_receive_msg_and_forward: Header lengths are longer than data received
[2013-08-27T21:56:52.940] error: service_connection: slurm_receive_msg: Header lengths are longer than data received
[2013-08-27T21:56:53.941] error: Malformed RPC of type 5016 received
[2013-08-27T21:56:53.941] error: slurm_receive_msg_and_forward: Header lengths are longer than data received
[2013-08-27T21:56:53.952] error: service_connection: slurm_receive_msg: Header lengths are longer than data received
[2013-08-27T21:57:49.934] [6.0] Abandoning IO 60 secs after job shutdown initiated
[2013-08-27T21:57:52.001] [6.0] done with job

Comment 3 Moe Jette 2013-08-28 11:36:13 MDT

RPC 5016 is REQUEST_STEP_COMPLETE

There is a 99.99% probability that you have inconsistent configuration files between the nodes or inconsistent versions installed.

Comment 4 aalba7 2013-08-29 02:34:23 MDT

This occurs on CentOS 6.4 VMs. I have also replicated this on baremetal running Fedora 19 1 x Control/Node + 1 x Node.

Pretty sure this falls into the 0.1% :-(. Head node and compute nodes are all running 2.6.1. md5sum checksum of slurm.conf shows all equal. See logs below.
Absolutely certain they are all running the same level of slurm and same slurm.conf. When I revert to 2.5.7 all works. With 2.6.0, 2.6.1, 13.12 I get this repeatable issue.

This consistently happens on the lexicographically first node and only if two or more nodes are involved. E.g., set node n100 DOWN, add third node n102, do

srun -N 2 hostname

This time the RPC 5016, IO 60 secs problem occurs on n101.

srun -N 1 -n 2 --overcommit hostname doesn't cause a problem.

Saw a slurm-dev report of the same issue by Andy Riebs from HP. I believe you also mentioned checking slurm version and slurm.conf match. I doubly and triply checked that the versions are the same.



On head01:
[root@head01 RPMS]# rpm -qi slurm
Name        : slurm                        Relocations: (not relocatable)
Version     : 2.6.1                             Vendor: (none)
Release     : 1.el6                         Build Date: Mon 26 Aug 2013 21:49:48 SGT
[root@head01 slurm]# md5sum /etc/slurm/slurm.conf
d76e0b8b8c513a7e1dc43822e9dbd512  /etc/slurm/slurm.conf


[2013-08-29T21:42:38.057] slurmctld version 2.6.1 started on cluster hpc8888
[2013-08-29T21:42:38.059] No trigger state file (/var/lib/slurm/trigger_state.old) to recover
[2013-08-29T21:42:38.059] error: Incomplete trigger data checkpoint file
[2013-08-29T21:42:38.059] read_slurm_conf: backup_controller not specified.
[2013-08-29T21:42:38.059] Reinitializing job accounting state
[2013-08-29T21:42:38.059] cons_res: select_p_reconfigure
[2013-08-29T21:42:38.059] cons_res: select_p_node_init
[2013-08-29T21:42:38.059] cons_res: preparing for 1 partitions
[2013-08-29T21:42:38.059] Running as primary controller



On compute nodes:

[root@head01 slurm]# pdsh -wn100,n101 md5sum /etc/slurm/slurm.conf
n100: d76e0b8b8c513a7e1dc43822e9dbd512  /etc/slurm/slurm.conf
n101: d76e0b8b8c513a7e1dc43822e9dbd512  /etc/slurm/slurm.conf


[root@head01 RPMS]# pdsh -wn100,n101 rpm -qi slurm
n101: Name        : slurm                        Relocations: (not relocatable)
n101: Version     : 2.6.1                             Vendor: (none)
n101: Release     : 1.el6                         Build Date: Mon 26 Aug 2013 21:49:48 SGT
n100: Name        : slurm                        Relocations: (not relocatable)
n100: Version     : 2.6.1                             Vendor: (none)
n100: Release     : 1.el6                         Build Date: Mon 26 Aug 2013 21:49:48 SGT

[2013-08-29T21:43:54.416] CPU frequency setting not configured for this node
[2013-08-29T21:43:54.418] slurmd version 2.6.1 started
[2013-08-29T21:43:54.418] slurmd started on Thu, 29 Aug 2013 21:43:54 +0800
[2013-08-29T21:43:54.419] Procs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=996 TmpDisk=13646 Uptime=260023



[2013-08-29T21:43:54.617] CPU frequency setting not configured for this node
[2013-08-29T21:43:54.620] slurmd version 2.6.1 started
[2013-08-29T21:43:54.620] slurmd started on Thu, 29 Aug 2013 21:43:54 +0800
[2013-08-29T21:43:54.620] Procs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=996 TmpDisk=13646 Uptime=265790

Comment 5 aalba7 2013-08-29 02:46:11 MDT

Created attachment 404 [details]
2 node partition

Baremetal Fedora 19 2 node partition exhibiting this issue.
The slurm RPMs are locally compiled.

Comment 6 aalba7 2013-08-29 02:48:26 MDT

Replicated on baremetal Fedora 19, 2.6.1 (2 node partition, control on one of the nodes). No chance of version skew as this is the first time these machines have SLURM on them.



On a two node job:

[2013-08-29T22:44:13.388] launch task 12.0 request from 0.0@192.168.1.7 (port 7866)
[2013-08-29T22:44:13.388] scaling CPU count by factor of 2
[2013-08-29T22:44:13.401] Received cpu frequency information for 8 cpus
[2013-08-29T22:44:13.401] scaling CPU count by factor of 2
[2013-08-29T22:44:13.527] error: Malformed RPC of type 5016 received
[2013-08-29T22:44:13.527] error: slurm_receive_msg_and_forward: Header lengths are longer than data received
[2013-08-29T22:44:13.537] error: service_connection: slurm_receive_msg: Header lengths are longer than data received
[2013-08-29T22:44:14.539] error: Malformed RPC of type 5016 received
[2013-08-29T22:44:14.539] error: slurm_receive_msg_and_forward: Header lengths are longer than data received
[2013-08-29T22:44:14.549] error: service_connection: slurm_receive_msg: Header lengths are longer than data received
[2013-08-29T22:44:15.551] error: Malformed RPC of type 5016 received
[2013-08-29T22:44:15.552] error: slurm_receive_msg_and_forward: Header lengths are longer than data received
[2013-08-29T22:44:15.562] error: service_connection: slurm_receive_msg: Header lengths are longer than data received
[2013-08-29T22:44:16.564] error: Malformed RPC of type 5016 received
[2013-08-29T22:44:16.564] error: slurm_receive_msg_and_forward: Header lengths are longer than data received
[2013-08-29T22:44:16.574] error: service_connection: slurm_receive_msg: Header lengths are longer than data received
[2013-08-29T22:44:17.576] error: Malformed RPC of type 5016 received
[2013-08-29T22:44:17.576] error: slurm_receive_msg_and_forward: Header lengths are longer than data received
[2013-08-29T22:44:17.586] error: service_connection: slurm_receive_msg: Header lengths are longer than data received
[2013-08-29T22:45:13.564] [12.0] Abandoning IO 60 secs after job shutdown initiated
[2013-08-29T22:45:16.001] [12.0] done with job

Comment 7 Danny Auble 2013-08-29 04:17:32 MDT

This was fixed a week ago.  Commit 60a37b5460b4f61a66355318897ab3c6cbd0ff1c.  Please test with the latest head or head of the 2.6 branch and I am pretty sure the problem is resolved.  It only has to do with jobacct_gather/none.

If you find a bug perhaps it would be better to test against the code head before submitting a bug report since Slurm is constantly being developed and bug fixes go in every day.  Looking at the NEWS file is also a good way to know if something has been fixed.  In this case you might not of known it was related to jobacct_gather/none though.

Thanks

Comment 8 Danny Auble 2013-08-29 04:17:55 MDT

Sorry it was commit 33ff8dbc7d40875dc4e2af1520fc525d865154b2