Ticket 7864

Summary: Annoying "error" message in slurmd log error
Product: Slurm Reporter: NYU HPC Team <hpc-staff>
Component: OtherAssignee: Ben Roberts <ben>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 19.05.3   
Hardware: Linux   
OS: Linux   
Site: NYU Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf, slurmctld and slurmd logs
slurmd and slurmctld log files after running "scontrol setdebug debug2"
/usr/sbin/nhc
/var/log/nhc-d.log

Description NYU HPC Team 2019-10-04 08:33:30 MDT
Created attachment 11821 [details]
slurm.conf, slurmctld and slurmd logs

Hi Expert, 

We have seen this "error" message in slurmd log. It seems that the operation goes on okay. Thank you!

Best, 
Wensheng
Comment 1 NYU HPC Team 2019-10-04 08:35:33 MDT
[2019-10-04T10:25:53.953] error: unpack_header: protocol_version 7936 not supported
[2019-10-04T10:25:53.953] error: unpacking header
[2019-10-04T10:25:53.953] error: destroy_forward: no init
[2019-10-04T10:25:53.953] error: slurm_receive_msg_and_forward: Message receive failure
[2019-10-04T10:25:53.963] error: service_connection: slurm_receive_msg: Message receive failure
[2019-10-04T10:27:33.300] error: unpack_header: protocol_version 7936 not supported
[2019-10-04T10:27:33.301] error: unpacking header
[2019-10-04T10:27:33.301] error: destroy_forward: no init
[2019-10-04T10:27:33.301] error: slurm_receive_msg_and_forward: Message receive failure
[2019-10-04T10:27:33.311] error: service_connection: slurm_receive_msg: Message receive failure
Comment 3 Ben Roberts 2019-10-04 10:09:08 MDT
Hi Wensheng,

I think there might be a couple things going on here.  One of the messages that shows up in your slurmctld.conf message indicates that there is a problem with how the nodes are specified in your topology.conf file:
[2019-10-04T10:19:34.542] error: WARNING: Invalid hostnames in switch configuration: ...

Would you mind sending a copy of your topology.conf file for me to review?  

The other messages you referenced in the ticket look like they're related to differences in client versions:
[2019-10-04T10:25:53.953] error: unpack_header: protocol_version 7936 not supported


Did you recently upgrade your system?  Can you check the version of a few binaries for me?  On the head node can you run 'slurmctld -V'?  From one of the compute nodes I'd like to have you run 'slurmd -V' and 'squeue -V'.  

Thanks,
Ben
Comment 4 NYU HPC Team 2019-10-04 12:08:25 MDT
Hi Ben,

We are querying about the message on the slurmd side:
[2019-10-04T10:25:53.953] error: unpack_header: protocol_version 7936 not supported
[2019-10-04T10:25:53.953] error: unpacking header
[2019-10-04T10:25:53.953] error: destroy_forward: no init
[2019-10-04T10:25:53.953] error: slurm_receive_msg_and_forward: Message receive failure
[2019-10-04T10:25:53.963] error: service_connection: slurm_receive_msg: Message receive failure


We are not concerning about this message at slurmctld:
[2019-10-04T10:19:34.542] error: WARNING: Invalid hostnames in switch configuration: ...
since we isolated three nodes yesterday for testing version 19.05.3 to prepare for an upgrade, and did not bother to change topology.conf. Sorry I did not say that earlier.


The test cluster consists of three nodes c30-[09-11]. slurmctld runs on c30-09.
On c30-09, we have:
$ /opt/slurm/sbin/slurmctld -V
slurm 19.05.3-2

On both c30-10 and c30-11, the versions are as below:
$ /opt/slurm/sbin/slurmd -V
slurm 19.05.3-2
$ squeue -V
slurm 19.05.3-2


Thank you,
Wensheng
Comment 5 Ben Roberts 2019-10-04 13:52:31 MDT
Hi Wensheng,

Thanks for verifying the version for me and that you know about the other log messages.  I ran some jobs through 19.05.3 just to verify that this isn't something easily reproducible.  There is some additional information at debug2 that should help me see where this is coming from.  Can I have you enable this logging long enough for these errors to show up again and then turn it back off.  You can enable this logging like this:
scontrol setdebug debug2

And when after you've seen this message show up in the logs you can turn it off again like this:
scontrol setdebug verbose

Thanks,
Ben
Comment 6 NYU HPC Team 2019-10-07 07:34:03 MDT
Created attachment 11844 [details]
slurmd and slurmctld log files after running "scontrol setdebug debug2"

Hi Ben,

I did as you suggested, and submitted a test job. Please see the attachment.

Thanks
Wensheng
Comment 7 Ben Roberts 2019-10-07 09:50:46 MDT
Hi Wensheng,

Thank you for doing that.  From the logging it looks like this is related to the node health check script that is running.  Is there a change I can have you send a copy of your health check script (/usr/sbin/nhc-slurm-wrapper)?

Thanks,
Ben
Comment 8 NYU HPC Team 2019-10-07 11:06:48 MDT
Created attachment 11858 [details]
/usr/sbin/nhc

Hi Ben,

/usr/sbin/nhc-slurm-wrapper is basically a one line script. Please see below. I attached /usr/sbin/nhc. Thank you!  ~Wensheng

$ cat /usr/sbin/nhc-slurm-wrapper
#!/bin/bash

/usr/sbin/nhc NHC_RM=slurm $*


And this is the NHC version:
$ rpm -qa |grep nhc
lbnl-nhc-1.4.2-1.el7.noarch
Comment 9 Ben Roberts 2019-10-07 12:04:36 MDT
I was curious if the script had a hard coded path to a previous version of the slurm commands.  It doesn't look like that's the case though.  Do your nodes have previous versions of slurm commands on them to allow you to switch between versions if needed?  Do you have to load a module or something similar to make sure it's using the current version of the commands when you log on to the nodes?  If you run the nhc script manually on one of the nodes does it return an error?  I would also like to see if you can modify your nhc-slurm-wrapper script one one of your nodes to call the nhc script with the following arguments:
-d -l /var/log/nhc.log

This should enable debug mode for the script and log information to /var/log/nhc.log (you can adjust the path if needed).  This just has to be long enough for it to run the health check once and you can revert the changes.  Let me know if this is a possibility.

Thanks,
Ben
Comment 10 NYU HPC Team 2019-10-07 12:32:59 MDT
Created attachment 11859 [details]
/var/log/nhc-d.log

Hi Ben,

I changed the script as you suggested:
# cat /usr/sbin/nhc-slurm-wrapper
#!/bin/bash

/usr/sbin/nhc -d -l /var/log/nhc-d.log NHC_RM=slurm $*


Please the attached NHC log file "nhc-d.log". We don't have another version of Slurms available on nodes. Only the Slurm commands from this version 19.05.3 are available. We did modify the rpm spec file 'slurm.spec', so that it generates the following file. Is that a concern?
$ cat /etc/ld.so.conf.d/slurm.conf 
/opt/slurm/lib64
/opt/slurm/lib64/slurm

Thank you!
Comment 11 Ben Roberts 2019-10-07 16:13:53 MDT
The logs you sent don't show that there is any error in the node health check script or much interaction with slurm other than getting the state of the node.  Since that's the case I'm digging a little deeper into what might be causing this error to appear.  While I'm looking, I wonder if you can get me some information about slurmdbd by running this:
ps aux | grep slurmdbd

Thank you, 
Ben
Comment 12 NYU HPC Team 2019-10-07 17:46:22 MDT
Sure, please see below:
[root@c30-09 deng]# ps aux |grep slurmdbd
slurm    32761  0.0  0.0 341308  4676 ?        Sl   09:19   0:00 /opt/slurm/sbin/slurmdbd
root     39662  0.0  0.0 112716   980 pts/0    S+   19:45   0:00 grep --color=auto slurmdbd
Comment 13 NYU HPC Team 2019-10-08 08:59:45 MDT
On a computing node I commented out the (last) three lines on NHC in slurm.conf, and restarted slurmd. Still seeing the error messages.
Comment 14 Ben Roberts 2019-10-08 13:07:48 MDT
Hello Wensheng,

I'd like to verify that the slurm libraries got updated correctly with the rest of the binaries.  Can I have you go to your SLURM_HOME directory and run the following commands:
strings lib/libslurm.so | grep 19.05
ls -l lib/libslurm.so


Your SLURM_HOME is the destination specified for the installation.  If you're unsure where this is, you can look at your config.log and look for the 'prefix' path or look for the 'libdir' if specified.  

Thanks,
Ben
Comment 15 NYU HPC Team 2019-10-08 18:34:53 MDT
Hi Ben,

Please see below:

[deng@c30-10 slurm]$ pwd
/opt/slurm
[deng@c30-10 slurm]$ strings lib64/libslurm.so | grep 19.05
19.05.3-2
PACKAGE_STRING "slurm 19.05"
SLURM_19_05_PROTOCOL_VERSION ((34 << 8) | 0)
PACKAGE_VERSION "19.05"
VERSION "19.05.3"
/home/deng/rpmbuild/BUILD/slurm-19.05.3-2/src/api
SLURM_PROTOCOL_VERSION SLURM_19_05_PROTOCOL_VERSION
SLURM_VERSION_STRING "19.05.3-2"
/home/deng/rpmbuild/BUILD/slurm-19.05.3-2/src/common
__func__.19205
__func__.19105
[deng@c30-10 slurm]$ ls -l lib64/libslurm.so
lrwxrwxrwx 1 root root 18 Oct  4 09:59 lib64/libslurm.so -> libslurm.so.34.0.0


Thanks,
Wensheng
Comment 16 Ben Roberts 2019-10-09 09:36:39 MDT
Hi Wensheng,

Thanks for verifying the libslurm information.  This library does look like it was updated correctly.  I was discussing this issue with a colleague and they said that they've seen similar behavior when something didn't quite work right with make/make install.  There are a lot of other libraries to try and track down, so rather than spend time going down that road I think it would be best to try doing a 'make clean' and reinstalling 19.05.3.  Is this something you can do outside of a maintenance window?  

Thanks,
Ben
Comment 17 NYU HPC Team 2019-10-17 13:18:40 MDT
Hi Ben,

I got a chance to come back to this. It is understood now. As mentioned previously, From a large cluster, we isolated three nodes c30-[09-11] to set up a small test cluster to try new version of Slurm. We generated a new Munge key for the test cluster, used one of the three nodes as master to run slurmdbd and slurmctld services.

We modified the slurm.conf in the large cluster, set c30-[09-11] as DownNodes:
DownNodes=c30-[09-11] Reason="Cluster Dev" State=DOWN

Did not expect the slurmctld in the large cluster, still polling the three nodes, hence the error: unpack_header: protocol_version 7936 not supported. 

I assumed that because the Munge keys are different, slurmctld could not communicate with slurmd from the other cluster. Apparently this is not true. 


Thanks,
Wensheng
Comment 18 Ben Roberts 2019-10-17 21:50:54 MDT
Thanks for following up on this ticket, this does make sense with what you were seeing.  I kept trying to think of how you might have a different version of the client somewhere.  To get slurmctld to stop polling the nodes you would want to take them out of the slurm.conf rather than just marking them down.  

Regarding the munge key, it could be that the munge service wasn't restarted after generating a new key.  I tested and verified that it keeps the previous key in memory until the service is restarted.  If the nodes are still in the same state you can verify this by sending a message to c30-[09-11] from the machine running slurmctld to see if it can be decrypted.  That would look something like this:
munge -s "test message" | ssh <user>@c30-09 unmunge


Does it sound like a possible scenario that these machines were not restarted after creating a new key?  If not, did you see any munge errors in addition to the protocol mismatch errors?  

Thanks,
Ben
Comment 19 NYU HPC Team 2019-10-18 07:37:36 MDT
Hi Ben,

After generating the munge key, I did restart the slurmd services on the test cluster, and tested the exact munge & unmunge command from slurmctld on large cluster to slurmd on test nodes. The command returned error which is expected.
Comment 20 Ben Roberts 2019-10-18 09:28:18 MDT
I looked into the errors that would show up if you had slurmctld trying to communicate to nodes with a different munge key.  The errors that would show up should look something like this:

[2019-10-18T10:03:04.096] error: slurm_unpack_received_msg: MESSAGE_COMPOSITE has authentication error: Invalid authentication credential
[2019-10-18T10:03:04.096] error: slurm_unpack_received_msg: Protocol authentication error
[2019-10-18T10:03:04.106] error: slurm_receive_msg [192.168.0.27:32964]: Unspecified error
[2019-10-18T10:03:04.108] error: Munge decode failed: Invalid credential


These would be in the slurmctld logs of the production system these nodes were taken from.  The slurmctld logs I have look like they're from the test system you set up.  Do you see log entries like those in your main system's logs?  

Thanks,
Ben
Comment 21 NYU HPC Team 2019-10-18 11:53:37 MDT
The slurmctld log from the production system showing:
[2019-10-18T13:44:33.733] error: Invalid Protocol Version 8704 from uid=-1 at 172.16.3.208:6818
[2019-10-18T13:44:33.734] error: slurm_receive_msgs: Incompatible versions of client and server code

The slurmctld and slurmd in this case have different munge keys, and on different Slurm versions (18.08.5-2 & 19.05.3-2).
Comment 22 Ben Roberts 2019-10-18 15:36:25 MDT
Hi Wensheng,

I've looked into where the messages you are seeing are coming from and why you're not seeing the munge errors on your side.  They are coming from the slurm_receive_msgs function in src/common/slurm_protocol_api.c:
https://github.com/SchedMD/slurm/blob/55cf1583d506f8c209a5886aea6679aaa149fac9/src/common/slurm_protocol_api.c#L3640-L3648


It's not until after it checks the header information on the message it is handling that it will run the slurm_auth_* checks where it does the munge verification.  This explains why you're seeing these messages and not the munge related ones.  It sounds like I haven't quite reproduced the scenario you're seeing though since I am getting the munge errors too.  

I do think we have a good grasp of what happened to cause these errors in the logs for you though.  Do you have additional questions about this or are you ok to close this ticket at this point?  

Thanks,
Ben
Comment 23 NYU HPC Team 2019-10-18 16:56:45 MDT
Ben, let's close the ticket. We can re-open it when necessary. Thank you for the interaction.