| Summary: | One slurmd process reports: Malformed RPC of type 5016 received | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | aalba7 <ascanio.alba7> |
| Component: | slurmd | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | da |
| Version: | 2.6.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | -Other- | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | 2 node partition | ||
|
Description
aalba7
2013-08-27 01:15:37 MDT
Occurs also in 2.6.0 Not observed in 2.5.7 I tested 13.12.0-0pre1 and the bug persists: I forgot to mention that the jobs runs correctly; only something weird happens during the cleanup [2013-08-27T21:56:30.163] CPU frequency setting not configured for this node [2013-08-27T21:56:30.164] slurmd version 13.12.0-0pre1 started [2013-08-27T21:56:30.165] slurmd started on Tue, 27 Aug 2013 21:56:30 +0800 [2013-08-27T21:56:30.165] Procs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=996 TmpDisk=13646 Uptime=87979 [2013-08-27T21:56:46.167] launch task 5.0 request from 0.0@172.16.16.1 (port 20110) [2013-08-27T21:56:46.188] [5.0] done with job [2013-08-27T21:56:49.869] launch task 6.0 request from 0.0@172.16.16.1 (port 22158) [2013-08-27T21:56:49.894] error: Malformed RPC of type 5016 received [2013-08-27T21:56:49.894] error: slurm_receive_msg_and_forward: Header lengths are longer than data received [2013-08-27T21:56:49.904] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2013-08-27T21:56:50.906] error: Malformed RPC of type 5016 received [2013-08-27T21:56:50.906] error: slurm_receive_msg_and_forward: Header lengths are longer than data received [2013-08-27T21:56:50.916] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2013-08-27T21:56:51.918] error: Malformed RPC of type 5016 received [2013-08-27T21:56:51.918] error: slurm_receive_msg_and_forward: Header lengths are longer than data received [2013-08-27T21:56:51.928] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2013-08-27T21:56:52.930] error: Malformed RPC of type 5016 received [2013-08-27T21:56:52.930] error: slurm_receive_msg_and_forward: Header lengths are longer than data received [2013-08-27T21:56:52.940] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2013-08-27T21:56:53.941] error: Malformed RPC of type 5016 received [2013-08-27T21:56:53.941] error: slurm_receive_msg_and_forward: Header lengths are longer than data received [2013-08-27T21:56:53.952] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2013-08-27T21:57:49.934] [6.0] Abandoning IO 60 secs after job shutdown initiated [2013-08-27T21:57:52.001] [6.0] done with job RPC 5016 is REQUEST_STEP_COMPLETE There is a 99.99% probability that you have inconsistent configuration files between the nodes or inconsistent versions installed. This occurs on CentOS 6.4 VMs. I have also replicated this on baremetal running Fedora 19 1 x Control/Node + 1 x Node. Pretty sure this falls into the 0.1% :-(. Head node and compute nodes are all running 2.6.1. md5sum checksum of slurm.conf shows all equal. See logs below. Absolutely certain they are all running the same level of slurm and same slurm.conf. When I revert to 2.5.7 all works. With 2.6.0, 2.6.1, 13.12 I get this repeatable issue. This consistently happens on the lexicographically first node and only if two or more nodes are involved. E.g., set node n100 DOWN, add third node n102, do srun -N 2 hostname This time the RPC 5016, IO 60 secs problem occurs on n101. srun -N 1 -n 2 --overcommit hostname doesn't cause a problem. Saw a slurm-dev report of the same issue by Andy Riebs from HP. I believe you also mentioned checking slurm version and slurm.conf match. I doubly and triply checked that the versions are the same. On head01: [root@head01 RPMS]# rpm -qi slurm Name : slurm Relocations: (not relocatable) Version : 2.6.1 Vendor: (none) Release : 1.el6 Build Date: Mon 26 Aug 2013 21:49:48 SGT [root@head01 slurm]# md5sum /etc/slurm/slurm.conf d76e0b8b8c513a7e1dc43822e9dbd512 /etc/slurm/slurm.conf [2013-08-29T21:42:38.057] slurmctld version 2.6.1 started on cluster hpc8888 [2013-08-29T21:42:38.059] No trigger state file (/var/lib/slurm/trigger_state.old) to recover [2013-08-29T21:42:38.059] error: Incomplete trigger data checkpoint file [2013-08-29T21:42:38.059] read_slurm_conf: backup_controller not specified. [2013-08-29T21:42:38.059] Reinitializing job accounting state [2013-08-29T21:42:38.059] cons_res: select_p_reconfigure [2013-08-29T21:42:38.059] cons_res: select_p_node_init [2013-08-29T21:42:38.059] cons_res: preparing for 1 partitions [2013-08-29T21:42:38.059] Running as primary controller On compute nodes: [root@head01 slurm]# pdsh -wn100,n101 md5sum /etc/slurm/slurm.conf n100: d76e0b8b8c513a7e1dc43822e9dbd512 /etc/slurm/slurm.conf n101: d76e0b8b8c513a7e1dc43822e9dbd512 /etc/slurm/slurm.conf [root@head01 RPMS]# pdsh -wn100,n101 rpm -qi slurm n101: Name : slurm Relocations: (not relocatable) n101: Version : 2.6.1 Vendor: (none) n101: Release : 1.el6 Build Date: Mon 26 Aug 2013 21:49:48 SGT n100: Name : slurm Relocations: (not relocatable) n100: Version : 2.6.1 Vendor: (none) n100: Release : 1.el6 Build Date: Mon 26 Aug 2013 21:49:48 SGT [2013-08-29T21:43:54.416] CPU frequency setting not configured for this node [2013-08-29T21:43:54.418] slurmd version 2.6.1 started [2013-08-29T21:43:54.418] slurmd started on Thu, 29 Aug 2013 21:43:54 +0800 [2013-08-29T21:43:54.419] Procs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=996 TmpDisk=13646 Uptime=260023 [2013-08-29T21:43:54.617] CPU frequency setting not configured for this node [2013-08-29T21:43:54.620] slurmd version 2.6.1 started [2013-08-29T21:43:54.620] slurmd started on Thu, 29 Aug 2013 21:43:54 +0800 [2013-08-29T21:43:54.620] Procs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=996 TmpDisk=13646 Uptime=265790 Created attachment 404 [details]
2 node partition
Baremetal Fedora 19 2 node partition exhibiting this issue.
The slurm RPMs are locally compiled.
Replicated on baremetal Fedora 19, 2.6.1 (2 node partition, control on one of the nodes). No chance of version skew as this is the first time these machines have SLURM on them. On a two node job: [2013-08-29T22:44:13.388] launch task 12.0 request from 0.0@192.168.1.7 (port 7866) [2013-08-29T22:44:13.388] scaling CPU count by factor of 2 [2013-08-29T22:44:13.401] Received cpu frequency information for 8 cpus [2013-08-29T22:44:13.401] scaling CPU count by factor of 2 [2013-08-29T22:44:13.527] error: Malformed RPC of type 5016 received [2013-08-29T22:44:13.527] error: slurm_receive_msg_and_forward: Header lengths are longer than data received [2013-08-29T22:44:13.537] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2013-08-29T22:44:14.539] error: Malformed RPC of type 5016 received [2013-08-29T22:44:14.539] error: slurm_receive_msg_and_forward: Header lengths are longer than data received [2013-08-29T22:44:14.549] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2013-08-29T22:44:15.551] error: Malformed RPC of type 5016 received [2013-08-29T22:44:15.552] error: slurm_receive_msg_and_forward: Header lengths are longer than data received [2013-08-29T22:44:15.562] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2013-08-29T22:44:16.564] error: Malformed RPC of type 5016 received [2013-08-29T22:44:16.564] error: slurm_receive_msg_and_forward: Header lengths are longer than data received [2013-08-29T22:44:16.574] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2013-08-29T22:44:17.576] error: Malformed RPC of type 5016 received [2013-08-29T22:44:17.576] error: slurm_receive_msg_and_forward: Header lengths are longer than data received [2013-08-29T22:44:17.586] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2013-08-29T22:45:13.564] [12.0] Abandoning IO 60 secs after job shutdown initiated [2013-08-29T22:45:16.001] [12.0] done with job This was fixed a week ago. Commit 60a37b5460b4f61a66355318897ab3c6cbd0ff1c. Please test with the latest head or head of the 2.6 branch and I am pretty sure the problem is resolved. It only has to do with jobacct_gather/none. If you find a bug perhaps it would be better to test against the code head before submitting a bug report since Slurm is constantly being developed and bug fixes go in every day. Looking at the NEWS file is also a good way to know if something has been fixed. In this case you might not of known it was related to jobacct_gather/none though. Thanks Sorry it was commit 33ff8dbc7d40875dc4e2af1520fc525d865154b2 |