Ticket 17548 - protocol_version 9216 not supported
Summary: protocol_version 9216 not supported
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 23.02.4
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Benjamin Witham
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2023-08-28 08:18 MDT by Chris Holder
Modified: 2023-09-05 10:41 MDT (History)
1 user (show)

See Also:
Site: Baylor College of Medicine Molecular and Human Genetics
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 23.02.4
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurmctld.log (474.92 KB, text/plain)
2023-08-28 08:18 MDT, Chris Holder
Details
slurmd.log (3.55 KB, application/octet-stream)
2023-08-28 09:55 MDT, Chris Holder
Details
slurmd.log-20230828 (129.13 KB, application/octet-stream)
2023-08-28 09:55 MDT, Chris Holder
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Chris Holder 2023-08-28 08:18:21 MDT
Created attachment 31990 [details]
slurmctld.log

After upgrading from 20.11.7 to 23.02.4 over the weekend we are getting the below errors.  They are coming from several nodes in the cluster.  We have checked all the nodes and confirmed all the slurm daemons and binaries are the same versions.

[2023-08-28T06:27:30.056] error: unpack_header: protocol_version 9216 not supported
[2023-08-28T06:27:30.056] error: unpacking header
[2023-08-28T06:27:30.056] error: destroy_forward: no init
[2023-08-28T06:27:30.057] error: slurm_unpack_received_msg: [[mhgcp-c01.grid.bcm.edu]:56276] Message receive failure
[2023-08-28T06:27:30.067] error: slurm_receive_msg [10.66.4.144:56276]: Message receive failure
[2023-08-28T06:27:45.068] error: unpack_header: protocol_version 9216 not supported
[2023-08-28T06:27:45.069] error: unpacking header
[2023-08-28T06:27:45.069] error: destroy_forward: no init
[2023-08-28T06:27:45.069] error: slurm_unpack_received_msg: [[mhgcp-c01.grid.bcm.edu]:56278] Message receive failure
[2023-08-28T06:27:45.080] error: slurm_receive_msg [10.66.4.144:56278]: Message receive failure
[2023-08-28T06:28:00.081] error: unpack_header: protocol_version 9216 not supported
[2023-08-28T06:28:00.081] error: unpacking header
[2023-08-28T06:28:00.081] error: destroy_forward: no init
[2023-08-28T06:28:00.082] error: slurm_unpack_received_msg: [[mhgcp-c01.grid.bcm.edu]:56280] Message receive failure
[2023-08-28T06:28:00.092] error: slurm_receive_msg [10.66.4.144:56280]: Message receive failure
Comment 1 Benjamin Witham 2023-08-28 09:41:34 MDT
Hello Chris, 

After the upgrade, did you restart all of your slurmd daemons and slurmctld after the upgrade?

Where does this IP come from: 10.66.4.144 ?
Comment 2 Benjamin Witham 2023-08-28 09:43:04 MDT
The below documentation has a step-by-step procedure for upgrades:

> https://slurm.schedmd.com/quickstart_admin.html#upgrade
Comment 3 Chris Holder 2023-08-28 09:44:20 MDT
Yes, all of the slurmd daemons have been restarted.  That IP address is one of our Compute nodes.  I am tailing the slurmctld logs right now and I am getting the same protocol error on a different node.

[2023-08-28T10:42:23.380] error: unpack_header: protocol_version 9216 not supported
[2023-08-28T10:42:23.380] error: unpacking header
[2023-08-28T10:42:23.380] error: destroy_forward: no init
[2023-08-28T10:42:23.381] error: slurm_unpack_received_msg: [[mhgcp-g00.grid.bcm.edu]:40490] Message receive failure
[2023-08-28T10:42:23.391] error: slurm_receive_msg [10.66.4.214:40490]: Message receive failure

Thanks,
Chris

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, August 28, 2023 10:42 AM
To: Holder, Christopher Michael <Christopher.Holder@bcm.edu>
Subject: [Bug 17548] protocol_version 9216 not supported

***CAUTION:*** This email is not from a BCM Source. Only click links or open attachments you know are safe.
________________________________
Benjamin Witham<mailto:benjamin.witham@schedmd.com> changed bug 17548<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=Adesaams1OLiD81jjgnocbgkZQMTUqedwFFg37Tmdw1a55kIi_P3Iol0J5041Qsp&s=cBjrdm9DRl4u5JORu-u-aas3ApGZNqv7jZah9VbvyM8&e=>
What
Removed
Added
CC

benjamin.witham@schedmd.com<mailto:benjamin.witham@schedmd.com>
Comment # 1<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548-23c1&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=Adesaams1OLiD81jjgnocbgkZQMTUqedwFFg37Tmdw1a55kIi_P3Iol0J5041Qsp&s=5AkgJo2R_XJmCht4KWj2QxVh_EafoX6YVSzsjakwBWg&e=> on bug 17548<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=Adesaams1OLiD81jjgnocbgkZQMTUqedwFFg37Tmdw1a55kIi_P3Iol0J5041Qsp&s=cBjrdm9DRl4u5JORu-u-aas3ApGZNqv7jZah9VbvyM8&e=> from Benjamin Witham<mailto:benjamin.witham@schedmd.com>

Hello Chris,



After the upgrade, did you restart all of your slurmd daemons and slurmctld

after the upgrade?



Where does this IP come from: 10.66.4.144 ?

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 4 Chris Holder 2023-08-28 09:47:11 MDT
Just to be safe, I restarted the slurmd on the G00 node.  I am getting this in the slurmd.log on the compute node:

[2023-08-28T10:45:39.345] slurmd version 23.02.4 started
[2023-08-28T10:45:39.346] slurmd started on Mon, 28 Aug 2023 10:45:39 -0500
[2023-08-28T10:45:39.347] CPUs=128 Boards=1 Sockets=4 Cores=16 Threads=2 Memory=1160259 TmpDisk=102427 Uptime=9058577 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2023-08-28T10:45:39.354] error: stepd_signal_container: invalid protocol_version 9216
[2023-08-28T10:45:39.354] error: stepd_signal_container: invalid protocol_version 9216
[2023-08-28T10:45:39.355] error: stepd_signal_container: invalid protocol_version 9216
[2023-08-28T10:45:39.355] error: stepd_signal_container: invalid protocol_version 9216

Thanks,
Chris

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, August 28, 2023 10:42 AM
To: Holder, Christopher Michael <Christopher.Holder@bcm.edu>
Subject: [Bug 17548] protocol_version 9216 not supported

***CAUTION:*** This email is not from a BCM Source. Only click links or open attachments you know are safe.
________________________________
Benjamin Witham<mailto:benjamin.witham@schedmd.com> changed bug 17548<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=Adesaams1OLiD81jjgnocbgkZQMTUqedwFFg37Tmdw1a55kIi_P3Iol0J5041Qsp&s=cBjrdm9DRl4u5JORu-u-aas3ApGZNqv7jZah9VbvyM8&e=>
What
Removed
Added
CC

benjamin.witham@schedmd.com<mailto:benjamin.witham@schedmd.com>
Comment # 1<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548-23c1&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=Adesaams1OLiD81jjgnocbgkZQMTUqedwFFg37Tmdw1a55kIi_P3Iol0J5041Qsp&s=5AkgJo2R_XJmCht4KWj2QxVh_EafoX6YVSzsjakwBWg&e=> on bug 17548<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=Adesaams1OLiD81jjgnocbgkZQMTUqedwFFg37Tmdw1a55kIi_P3Iol0J5041Qsp&s=cBjrdm9DRl4u5JORu-u-aas3ApGZNqv7jZah9VbvyM8&e=> from Benjamin Witham<mailto:benjamin.witham@schedmd.com>

Hello Chris,



After the upgrade, did you restart all of your slurmd daemons and slurmctld

after the upgrade?



Where does this IP come from: 10.66.4.144 ?

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 5 Benjamin Witham 2023-08-28 09:48:37 MDT
How do you start your slurmds? Do you use a systemd script?
Comment 6 Chris Holder 2023-08-28 09:49:33 MDT
Systemctld <start|stop> slurmd

Thanks,
Chris

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, August 28, 2023 10:49 AM
To: Holder, Christopher Michael <Christopher.Holder@bcm.edu>
Subject: [Bug 17548] protocol_version 9216 not supported

***CAUTION:*** This email is not from a BCM Source. Only click links or open attachments you know are safe.
________________________________
Comment # 5<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548-23c5&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=2Q6hs5e26xwUgrYhnJgs5v8NCPO8J-31ttiuks_o7kaJs4Pe9oQ2LtTT1w3RJHM8&s=F32lL4_G5tsgcFRn5_FuYcHI3HnQS_y1pCR708eHLPI&e=> on bug 17548<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=2Q6hs5e26xwUgrYhnJgs5v8NCPO8J-31ttiuks_o7kaJs4Pe9oQ2LtTT1w3RJHM8&s=rkxBOtfgYEc5pfN7CB6ODIVrFy77agICoxATYLK-_TQ&e=> from Benjamin Witham<mailto:benjamin.witham@schedmd.com>

How do you start your slurmds? Do you use a systemd script?

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 7 Benjamin Witham 2023-08-28 09:50:25 MDT
Could I get the slurmd logs from that slurmd you just restarted?
Comment 8 Chris Holder 2023-08-28 09:55:09 MDT
Created attachment 31991 [details]
slurmd.log

Here you go.  I started a new log with this current restart, so you have the current log and the one immediately preceding it.

Thanks,
Chris

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, August 28, 2023 10:50 AM
To: Holder, Christopher Michael <Christopher.Holder@bcm.edu>
Subject: [Bug 17548] protocol_version 9216 not supported

***CAUTION:*** This email is not from a BCM Source. Only click links or open attachments you know are safe.
________________________________
Comment # 7<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548-23c7&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=UEKs1eToihvYB14jAywQem8W6VX5fwdE2vzZ3xiXn5nbD32icR6lAR6OdXV-utwJ&s=H7xt3iMeFhSF3GzrmFkZGWLCYHOU0M0isxlN64d62YU&e=> on bug 17548<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=UEKs1eToihvYB14jAywQem8W6VX5fwdE2vzZ3xiXn5nbD32icR6lAR6OdXV-utwJ&s=N7E51NoWt1GiRzDapfCeO6Xr5fn91VGUku9X97pM4gg&e=> from Benjamin Witham<mailto:benjamin.witham@schedmd.com>

Could I get the slurmd logs from that slurmd you just restarted?

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 9 Chris Holder 2023-08-28 09:55:10 MDT
Created attachment 31992 [details]
slurmd.log-20230828
Comment 10 Benjamin Witham 2023-08-28 12:15:18 MDT
Have you checked the environment for older binaries? There could be a profile setting such as .bashrc or modules being loaded that add older binaries from slurm into the $PATH.

Are the slurmstepds referencing the correct binary?
Comment 11 Chris Holder 2023-08-28 12:27:10 MDT
How can I verify the version of slurmstepd?

[root@mhgcp-g00 ~]# whereis slurmstepd
slurmstepd: /usr/sbin/slurmstepd /usr/share/man/man8/slurmstepd.8.gz



Thanks,
Chris

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, August 28, 2023 1:15 PM
To: Holder, Christopher Michael <Christopher.Holder@bcm.edu>
Subject: [Bug 17548] protocol_version 9216 not supported

***CAUTION:*** This email is not from a BCM Source. Only click links or open attachments you know are safe.
________________________________
Comment # 10<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548-23c10&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=GBiTkOI_Tse3xYx6EbneR4GGK9LapnMTG1hyoEFzJ1uWZl4sY5jWdOFV_KsMSFhx&s=neb-SKv5F13InggQeZDMt2DijgpziVvSRikvbfVHg9o&e=> on bug 17548<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=GBiTkOI_Tse3xYx6EbneR4GGK9LapnMTG1hyoEFzJ1uWZl4sY5jWdOFV_KsMSFhx&s=sqi0g3mmcuvjXy6E4Dz6dATKt0wVMo-MgbPp5K8CVEM&e=> from Benjamin Witham<mailto:benjamin.witham@schedmd.com>

Have you checked the environment for older binaries? There could be a profile

setting such as .bashrc or modules being loaded that add older binaries from

slurm into the $PATH.



Are the slurmstepds referencing the correct binary?

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 12 Benjamin Witham 2023-08-28 12:36:10 MDT
Does running the command 

> ps aux | grep slurm

give you any information on the slurmstepd version?
Comment 13 Chris Holder 2023-08-28 12:37:52 MDT
Does not appear to…

[root@mhgcp-g00 ~]# ps aux | grep slurm
root     102165  0.0  0.0 283144  3476 ?        Sl   10:03   0:00 slurmstepd: [1024766.extern]
root     102176  0.0  0.0 355060  4024 ?        Sl   10:03   0:00 slurmstepd: [1024766.0]
root     160561  0.0  0.0 293672  4392 ?        Ss   13:00   0:00 /usr/sbin/slurmd -D -s
root     176976  0.0  0.0 271944  2496 ?        Sl   Aug23   2:51 slurmstepd: [1014255.batch]
u250176  176982  0.0  0.0 113296   888 ?        S    Aug23   0:00 /bin/bash /var/spool/slurmd/job1014255/slurm_script
root     191875  0.0  0.0 271836  2480 ?        Sl   Aug22   3:43 slurmstepd: [1011787.batch]
232499   191884  0.0  0.0 113292   908 ?        S    Aug22   0:00 /bin/bash /var/spool/slurmd/job1011787/slurm_script 19D14Mac
root     214383  0.0  0.0 113088  1236 pts/3    R+   13:37   0:00 grep --color=auto slurm
root     347822  0.0  0.0 271836  2320 ?        Sl   Aug22   3:45 slurmstepd: [1011632.batch]
232499   347826  0.0  0.0 113292   892 ?        S    Aug22   0:00 /bin/bash /var/spool/slurmd/job1011632/slurm_script 19D007Mac

Thanks,
Chris

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, August 28, 2023 1:36 PM
To: Holder, Christopher Michael <Christopher.Holder@bcm.edu>
Subject: [Bug 17548] protocol_version 9216 not supported

***CAUTION:*** This email is not from a BCM Source. Only click links or open attachments you know are safe.
________________________________
Comment # 12<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548-23c12&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=cVynQ1bOPJQ1JgYGypLprUmW-2pSYvBUdcH-Slfksy83XRhsCBp39L_fMHeP8Dyk&s=FcztCNaZsLW7-62bQ3z8t2bQTICq21ZIBumYVC-uOgU&e=> on bug 17548<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=cVynQ1bOPJQ1JgYGypLprUmW-2pSYvBUdcH-Slfksy83XRhsCBp39L_fMHeP8Dyk&s=O1jZ22MjJSChRdZeHNewgRvYZrTBBKNOv9uM0avCh74&e=> from Benjamin Witham<mailto:benjamin.witham@schedmd.com>

Does running the command



> ps aux | grep slurm



give you any information on the slurmstepd version?

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 14 Benjamin Witham 2023-08-28 12:57:16 MDT
Did you have any running jobs while you upgraded your cluster? Do you have any scripts that may be referencing older binaries?
Comment 15 Chris Holder 2023-08-28 13:14:37 MDT
They were actively running.  I shutdown all the slurmd daemons before the upgrades.

Thanks,
Chris

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, August 28, 2023 1:57 PM
To: Holder, Christopher Michael <Christopher.Holder@bcm.edu>
Subject: [Bug 17548] protocol_version 9216 not supported

***CAUTION:*** This email is not from a BCM Source. Only click links or open attachments you know are safe.
________________________________
Comment # 14<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548-23c14&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=qfLowmEaH2eL1qTm2mRaT4BTvZx-3xae4PEcyKfiYT9I3-rBER79Pd2Er5vwjbBR&s=43ZD30NywcGvGXpsM0_rSy1Kp4NPPldEUJ2pxSfLTTE&e=> on bug 17548<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=qfLowmEaH2eL1qTm2mRaT4BTvZx-3xae4PEcyKfiYT9I3-rBER79Pd2Er5vwjbBR&s=ynChPPzQmmCj2Al6vfT59S3Kpu86avpccuZ03P4cCW4&e=> from Benjamin Witham<mailto:benjamin.witham@schedmd.com>

Did you have any running jobs while you upgraded your cluster? Do you have any

scripts that may be referencing older binaries?

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 16 Benjamin Witham 2023-08-28 13:43:18 MDT
I think that's our issue. The slurmstepds continued to run during the upgrade, and once the jobs finished, the stepds would send a completed mission to the slurmctld. The slurmctld can recognize protocols up to two versions old, but 20.11 is out of that range for your jump to 23.02.

To confirm this, can you run the command:

> sacctmgr show RunawayJobs

> https://slurm.schedmd.com/sacctmgr.html#OPT_RunawayJobs
Comment 17 Chris Holder 2023-08-28 13:46:13 MDT
[root@mhgcp-g00 ~]# sacctmgr show RunawayJobs
sacctmgr: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:localhost:6819: Connection refused
sacctmgr: error: Sending PersistInit msg: Connection refused

Thanks,
Chris

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, August 28, 2023 2:43 PM
To: Holder, Christopher Michael <Christopher.Holder@bcm.edu>
Subject: [Bug 17548] protocol_version 9216 not supported

***CAUTION:*** This email is not from a BCM Source. Only click links or open attachments you know are safe.
________________________________
Comment # 16<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548-23c16&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=p4QttKic70beU22s1otfggYjS6CVUmOck-Wn561nWSd3CDAf_rMrxIvDgeMQl3Mw&s=vcDasVzIAoUqqZaML6vMR6R9xniATTA-GZ9q3zOylMM&e=> on bug 17548<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=p4QttKic70beU22s1otfggYjS6CVUmOck-Wn561nWSd3CDAf_rMrxIvDgeMQl3Mw&s=7EA_p-lS5Ds_3thBu6UoUuOYVqEX8WwRN7JBFY66OkI&e=> from Benjamin Witham<mailto:benjamin.witham@schedmd.com>

I think that's our issue. The slurmstepds continued to run during the upgrade,

and once the jobs finished, the stepds would send a completed mission to the

slurmctld. The slurmctld can recognize protocols up to two versions old, but

20.11 is out of that range for your jump to 23.02.



To confirm this, can you run the command:



> sacctmgr show RunawayJobs



> https://slurm.schedmd.com/sacctmgr.html#OPT_RunawayJobs<https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_sacctmgr.html-23OPT-5FRunawayJobs&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=p4QttKic70beU22s1otfggYjS6CVUmOck-Wn561nWSd3CDAf_rMrxIvDgeMQl3Mw&s=uIZHQvhTcH34BlGwuhNytOMyq2_A_Ol57ftdzuwmFd0&e=>

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 18 Benjamin Witham 2023-08-28 14:00:34 MDT
Do you use slurmdbd? Is it up right now? Do you have operator permissions with the account you ran the command on?
Comment 19 Chris Holder 2023-08-28 14:02:16 MDT
So here’s a fun one…  I just got an email from one of my PI’s and apparently his jobs (that disappeared from the queue and the reporting) during the upgrade ARE STILL RUNNING.  The output files are still being written…

Thanks,
Chris

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, August 28, 2023 3:01 PM
To: Holder, Christopher Michael <Christopher.Holder@bcm.edu>
Subject: [Bug 17548] protocol_version 9216 not supported

***CAUTION:*** This email is not from a BCM Source. Only click links or open attachments you know are safe.
________________________________
Comment # 18<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548-23c18&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=Tf7m9jfo06NiPb3iYDH43eCPBesILau0b-fwCdpQWIZk7TyLXucYKgqssqYGvMbO&s=_leVVPURRltguvJLhGFstb1EslSLxrEa5yC4WuMofvI&e=> on bug 17548<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=Tf7m9jfo06NiPb3iYDH43eCPBesILau0b-fwCdpQWIZk7TyLXucYKgqssqYGvMbO&s=l_QVm08W2kOYqGBmuQqOB0OHsxnvsMSxZzershLhd9I&e=> from Benjamin Witham<mailto:benjamin.witham@schedmd.com>

Do you use slurmdbd? Is it up right now? Do you have operator permissions with

the account you ran the command on?

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 20 Benjamin Witham 2023-08-28 14:12:29 MDT
Okay, sounds like you do have runaway jobs. you should be able to fix them by using the 

> saccctmgr show runawayjobs

command, but it looks like your slurmdbd isn't responding. Is your slurmdbd up right now?
Comment 21 Chris Holder 2023-08-28 14:20:06 MDT
It’s active and running:

[root@mhgcp-h00 packages]# systemctl status slurmdbd
● slurmdbd.service - Slurm DBD accounting daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2023-08-25 21:21:44 CDT; 2 days ago
Main PID: 25946 (slurmdbd)
    Tasks: 6
   Memory: 6.9M
   CGroup: /system.slice/slurmdbd.service
           └─25946 /usr/sbin/slurmdbd -D -s

I don’t know why the compute nodes are trying to connect to localhost for sacct and scontrol commands instead of the controller node…

Running the command on the controller node says no runaway jobs:

[root@mhgcp-h00 packages]# sacctmgr show runawayjobs
Runaway Jobs: No runaway jobs found on cluster mhgcp

(mhgcp-h00 is the head node where slurmctld and slurmdbd are running.  All other hosts are compute nodes in various partitions)

[root@mhgcp-g00 ~]# sacctmgr show RunawayJobs
sacctmgr: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:localhost:6819: Connection refused
sacctmgr: error: Sending PersistInit msg: Connection refused


Running the

Thanks,
Chris

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, August 28, 2023 3:12 PM
To: Holder, Christopher Michael <Christopher.Holder@bcm.edu>
Subject: [Bug 17548] protocol_version 9216 not supported

***CAUTION:*** This email is not from a BCM Source. Only click links or open attachments you know are safe.
________________________________
Comment # 20<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548-23c20&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=VGjgXBbbFMdsPnx-IkK6cuhB9h1VnF0aOnK_MisqxshtYL76twp4r_HW9JbJ9wXd&s=V57z0xTdgT570oUjr-TumKNgkOWeibn25oFgetXYqdw&e=> on bug 17548<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=VGjgXBbbFMdsPnx-IkK6cuhB9h1VnF0aOnK_MisqxshtYL76twp4r_HW9JbJ9wXd&s=-08y5oQb8R2O5LDCuUg8lqL_-PFqPeX2x8oR1gLSXpM&e=> from Benjamin Witham<mailto:benjamin.witham@schedmd.com>

Okay, sounds like you do have runaway jobs. you should be able to fix them by

using the



> saccctmgr show runawayjobs



command, but it looks like your slurmdbd isn't responding. Is your slurmdbd up

right now?

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 22 Chris Holder 2023-08-29 09:37:26 MDT
Yes, slurmDBD is running and from the head node (mhgcp-h00) I can run sacct queries.
Comment 23 Benjamin Witham 2023-08-29 10:18:53 MDT
Okay, great. There are runaway jobs on your system, but they haven't run past their runtime yet. Once the jobs spend longer than their runtime, the jobs will populate from that command, and there will be an option to fix these jobs.

> [root@mhgcp-g00 ~]# sacctmgr show RunawayJobs
> sacctmgr: error: slurm_persist_conn_open_without_init: failed to open persistent 
> connection to host:localhost:6819: Connection refused
> sacctmgr: error: Sending PersistInit msg: Connection refused

This is due to a configuration error. I noticed that many of your nodes in your logs had different slurm.confs than your slurmctld.

> error: Node mhgcp-c01 appears to have a different slurm.conf than the slurmctld.  This 
> could cause issues with communication and functionality.  Please review both files and 
> make sure they are the same.  If this is expected ignore, and set 
> DebugFlags=NO_CONF_HASH in your slurm.conf.

I would review the conf file that was distributed to the nodes and ensure that it has the correct address of your database.
Comment 24 Chris Holder 2023-08-29 11:51:04 MDT
What about the fact that even though all the daemons are working I can’t run any of the sacct or sacctmgr queries on the compute nodes?

What is the proper list of slurm modules (slrumd, slurmctld, slrurmdbd, etc) to be installed on the compute nodes?

Thanks,
Chris

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, August 29, 2023 11:19 AM
To: Holder, Christopher Michael <Christopher.Holder@bcm.edu>
Subject: [Bug 17548] protocol_version 9216 not supported

***CAUTION:*** This email is not from a BCM Source. Only click links or open attachments you know are safe.
________________________________
Comment # 23<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548-23c23&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=Hq3DArMzw9pButrHBuqsHG-w6kAxX36L1vpQREnDX0h1iTixX7_bYEGuK-5hehk0&s=mqLH-mQbVz5uC72nANVg-vk5l2vS5VTUSB0pEcOmPAU&e=> on bug 17548<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=Hq3DArMzw9pButrHBuqsHG-w6kAxX36L1vpQREnDX0h1iTixX7_bYEGuK-5hehk0&s=YOvtB5q5jOR4aGK_d7VJndWF1371gULFUctIU-gRpRA&e=> from Benjamin Witham<mailto:benjamin.witham@schedmd.com>

Okay, great. There are runaway jobs on your system, but they haven't run past

their runtime yet. Once the jobs spend longer than their runtime, the jobs will

populate from that command, and there will be an option to fix these jobs.



> [root@mhgcp-g00 ~]# sacctmgr show RunawayJobs

> sacctmgr: error: slurm_persist_conn_open_without_init: failed to open persistent

> connection to host:localhost:6819: Connection refused

> sacctmgr: error: Sending PersistInit msg: Connection refused



This is due to a configuration error. I noticed that many of your nodes in your

logs had different slurm.confs than your slurmctld.



> error: Node mhgcp-c01 appears to have a different slurm.conf than the slurmctld.  This

> could cause issues with communication and functionality.  Please review both files and

> make sure they are the same.  If this is expected ignore, and set

> DebugFlags=NO_CONF_HASH in your slurm.conf.



I would review the conf file that was distributed to the nodes and ensure that

it has the correct address of your database.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 25 Benjamin Witham 2023-08-29 12:19:10 MDT
The sacct pulls the location of the accounting database from the slurm.conf. If this is set as localhost, the compute nodes will attempt to find the database on themselves rather than the actual location on the scheduler node. If the AccountingStorageHost is set to localhost in your slurm.conf, change it to the address of your database and sync this change up with the compute nodes in your cluster.
Comment 26 Chris Holder 2023-08-29 12:40:15 MDT
That fixed that…

Is there any way for me to kill the runaway jobs?

Thanks,
Chris

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, August 29, 2023 1:19 PM
To: Holder, Christopher Michael <Christopher.Holder@bcm.edu>
Subject: [Bug 17548] protocol_version 9216 not supported

***CAUTION:*** This email is not from a BCM Source. Only click links or open attachments you know are safe.
________________________________
Comment # 25<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548-23c25&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=VAAJA74n14K1V3Ej9uF7RGzUM-CaCj0XamySYFmHssLJoGx_GTjldCSmqCmGxlIH&s=NZMP3Lbm3DibaYKZwqipW_EnjfWfHXXEBKFQrUUMAeU&e=> on bug 17548<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=VAAJA74n14K1V3Ej9uF7RGzUM-CaCj0XamySYFmHssLJoGx_GTjldCSmqCmGxlIH&s=f9Up7SEnsC5KL_B5k0Ly5FzxNfqUdu35ynUW_lxyx9U&e=> from Benjamin Witham<mailto:benjamin.witham@schedmd.com>

The sacct pulls the location of the accounting database from the slurm.conf. If

this is set as localhost, the compute nodes will attempt to find the database

on themselves rather than the actual location on the scheduler node. If the

AccountingStorageHost is set to localhost in your slurm.conf, change it to the

address of your database and sync this change up with the compute nodes in your

cluster.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 27 Benjamin Witham 2023-08-29 14:28:16 MDT
You can safely clear the runaway jobs once the sacctmgr is able to detect them. Once the jobs have run past their runtime, the sacctmgr will be able to pick them up and kill them in a safe way.
Comment 28 Chris Holder 2023-08-29 15:36:59 MDT
FANTASTIC!!

Thanks,
Chris

From: bugs@schedmd.com <bugs@schedmd.com>
Date: Tuesday, August 29, 2023 at 3:28 PM
To: Holder, Christopher Michael <Christopher.Holder@bcm.edu>
Subject: [Bug 17548] protocol_version 9216 not supported
***CAUTION:*** This email is not from a BCM Source. Only click links or open attachments you know are safe.
________________________________
Comment # 27<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548-23c27&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=57T8NkME88w4teTOzTvVLLADAf32MLxy0PTDZvI8Z6tf_tdhexKByjSEHlbPEOnc&s=uBzgchh72_wVieJwq8HfvdaJuA3qxVCBTvvFJGk_Zk8&e=> on bug 17548<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=57T8NkME88w4teTOzTvVLLADAf32MLxy0PTDZvI8Z6tf_tdhexKByjSEHlbPEOnc&s=y91tswEwFk2HQLcYn9ZQPuYtHKqdIusaOTGVhUIpFxQ&e=> from Benjamin Witham<mailto:benjamin.witham@schedmd.com>

You can safely clear the runaway jobs once the sacctmgr is able to detect them.

Once the jobs have run past their runtime, the sacctmgr will be able to pick

them up and kill them in a safe way.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 29 Benjamin Witham 2023-09-04 16:21:10 MDT
Hello Chris, were you able to clear your runaway jobs? If so, I'll close this ticket now.
Comment 30 Chris Holder 2023-09-05 10:41:31 MDT
According to my log files those messages have not been logged in 8/28.  I would consider that closed.