| Summary: | protocol_version 9216 not supported | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Chris Holder <christopher.holder> |
| Component: | slurmctld | Assignee: | Benjamin Witham <benjamin.witham> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | benjamin.witham |
| Version: | 23.02.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Baylor College of Medicine Molecular and Human Genetics | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 23.02.4 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurmctld.log
slurmd.log slurmd.log-20230828 |
||
Hello Chris, After the upgrade, did you restart all of your slurmd daemons and slurmctld after the upgrade? Where does this IP come from: 10.66.4.144 ? The below documentation has a step-by-step procedure for upgrades:
> https://slurm.schedmd.com/quickstart_admin.html#upgrade
Yes, all of the slurmd daemons have been restarted. That IP address is one of our Compute nodes. I am tailing the slurmctld logs right now and I am getting the same protocol error on a different node. [2023-08-28T10:42:23.380] error: unpack_header: protocol_version 9216 not supported [2023-08-28T10:42:23.380] error: unpacking header [2023-08-28T10:42:23.380] error: destroy_forward: no init [2023-08-28T10:42:23.381] error: slurm_unpack_received_msg: [[mhgcp-g00.grid.bcm.edu]:40490] Message receive failure [2023-08-28T10:42:23.391] error: slurm_receive_msg [10.66.4.214:40490]: Message receive failure Thanks, Chris From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, August 28, 2023 10:42 AM To: Holder, Christopher Michael <Christopher.Holder@bcm.edu> Subject: [Bug 17548] protocol_version 9216 not supported ***CAUTION:*** This email is not from a BCM Source. Only click links or open attachments you know are safe. ________________________________ Benjamin Witham<mailto:benjamin.witham@schedmd.com> changed bug 17548<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=Adesaams1OLiD81jjgnocbgkZQMTUqedwFFg37Tmdw1a55kIi_P3Iol0J5041Qsp&s=cBjrdm9DRl4u5JORu-u-aas3ApGZNqv7jZah9VbvyM8&e=> What Removed Added CC benjamin.witham@schedmd.com<mailto:benjamin.witham@schedmd.com> Comment # 1<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548-23c1&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=Adesaams1OLiD81jjgnocbgkZQMTUqedwFFg37Tmdw1a55kIi_P3Iol0J5041Qsp&s=5AkgJo2R_XJmCht4KWj2QxVh_EafoX6YVSzsjakwBWg&e=> on bug 17548<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=Adesaams1OLiD81jjgnocbgkZQMTUqedwFFg37Tmdw1a55kIi_P3Iol0J5041Qsp&s=cBjrdm9DRl4u5JORu-u-aas3ApGZNqv7jZah9VbvyM8&e=> from Benjamin Witham<mailto:benjamin.witham@schedmd.com> Hello Chris, After the upgrade, did you restart all of your slurmd daemons and slurmctld after the upgrade? Where does this IP come from: 10.66.4.144 ? ________________________________ You are receiving this mail because: * You reported the bug. Just to be safe, I restarted the slurmd on the G00 node. I am getting this in the slurmd.log on the compute node: [2023-08-28T10:45:39.345] slurmd version 23.02.4 started [2023-08-28T10:45:39.346] slurmd started on Mon, 28 Aug 2023 10:45:39 -0500 [2023-08-28T10:45:39.347] CPUs=128 Boards=1 Sockets=4 Cores=16 Threads=2 Memory=1160259 TmpDisk=102427 Uptime=9058577 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) [2023-08-28T10:45:39.354] error: stepd_signal_container: invalid protocol_version 9216 [2023-08-28T10:45:39.354] error: stepd_signal_container: invalid protocol_version 9216 [2023-08-28T10:45:39.355] error: stepd_signal_container: invalid protocol_version 9216 [2023-08-28T10:45:39.355] error: stepd_signal_container: invalid protocol_version 9216 Thanks, Chris From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, August 28, 2023 10:42 AM To: Holder, Christopher Michael <Christopher.Holder@bcm.edu> Subject: [Bug 17548] protocol_version 9216 not supported ***CAUTION:*** This email is not from a BCM Source. Only click links or open attachments you know are safe. ________________________________ Benjamin Witham<mailto:benjamin.witham@schedmd.com> changed bug 17548<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=Adesaams1OLiD81jjgnocbgkZQMTUqedwFFg37Tmdw1a55kIi_P3Iol0J5041Qsp&s=cBjrdm9DRl4u5JORu-u-aas3ApGZNqv7jZah9VbvyM8&e=> What Removed Added CC benjamin.witham@schedmd.com<mailto:benjamin.witham@schedmd.com> Comment # 1<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548-23c1&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=Adesaams1OLiD81jjgnocbgkZQMTUqedwFFg37Tmdw1a55kIi_P3Iol0J5041Qsp&s=5AkgJo2R_XJmCht4KWj2QxVh_EafoX6YVSzsjakwBWg&e=> on bug 17548<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=Adesaams1OLiD81jjgnocbgkZQMTUqedwFFg37Tmdw1a55kIi_P3Iol0J5041Qsp&s=cBjrdm9DRl4u5JORu-u-aas3ApGZNqv7jZah9VbvyM8&e=> from Benjamin Witham<mailto:benjamin.witham@schedmd.com> Hello Chris, After the upgrade, did you restart all of your slurmd daemons and slurmctld after the upgrade? Where does this IP come from: 10.66.4.144 ? ________________________________ You are receiving this mail because: * You reported the bug. How do you start your slurmds? Do you use a systemd script? Systemctld <start|stop> slurmd Thanks, Chris From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, August 28, 2023 10:49 AM To: Holder, Christopher Michael <Christopher.Holder@bcm.edu> Subject: [Bug 17548] protocol_version 9216 not supported ***CAUTION:*** This email is not from a BCM Source. Only click links or open attachments you know are safe. ________________________________ Comment # 5<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548-23c5&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=2Q6hs5e26xwUgrYhnJgs5v8NCPO8J-31ttiuks_o7kaJs4Pe9oQ2LtTT1w3RJHM8&s=F32lL4_G5tsgcFRn5_FuYcHI3HnQS_y1pCR708eHLPI&e=> on bug 17548<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=2Q6hs5e26xwUgrYhnJgs5v8NCPO8J-31ttiuks_o7kaJs4Pe9oQ2LtTT1w3RJHM8&s=rkxBOtfgYEc5pfN7CB6ODIVrFy77agICoxATYLK-_TQ&e=> from Benjamin Witham<mailto:benjamin.witham@schedmd.com> How do you start your slurmds? Do you use a systemd script? ________________________________ You are receiving this mail because: * You reported the bug. Could I get the slurmd logs from that slurmd you just restarted? Created attachment 31991 [details] slurmd.log Here you go. I started a new log with this current restart, so you have the current log and the one immediately preceding it. Thanks, Chris From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, August 28, 2023 10:50 AM To: Holder, Christopher Michael <Christopher.Holder@bcm.edu> Subject: [Bug 17548] protocol_version 9216 not supported ***CAUTION:*** This email is not from a BCM Source. Only click links or open attachments you know are safe. ________________________________ Comment # 7<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548-23c7&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=UEKs1eToihvYB14jAywQem8W6VX5fwdE2vzZ3xiXn5nbD32icR6lAR6OdXV-utwJ&s=H7xt3iMeFhSF3GzrmFkZGWLCYHOU0M0isxlN64d62YU&e=> on bug 17548<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=UEKs1eToihvYB14jAywQem8W6VX5fwdE2vzZ3xiXn5nbD32icR6lAR6OdXV-utwJ&s=N7E51NoWt1GiRzDapfCeO6Xr5fn91VGUku9X97pM4gg&e=> from Benjamin Witham<mailto:benjamin.witham@schedmd.com> Could I get the slurmd logs from that slurmd you just restarted? ________________________________ You are receiving this mail because: * You reported the bug. Created attachment 31992 [details]
slurmd.log-20230828
Have you checked the environment for older binaries? There could be a profile setting such as .bashrc or modules being loaded that add older binaries from slurm into the $PATH. Are the slurmstepds referencing the correct binary? How can I verify the version of slurmstepd? [root@mhgcp-g00 ~]# whereis slurmstepd slurmstepd: /usr/sbin/slurmstepd /usr/share/man/man8/slurmstepd.8.gz Thanks, Chris From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, August 28, 2023 1:15 PM To: Holder, Christopher Michael <Christopher.Holder@bcm.edu> Subject: [Bug 17548] protocol_version 9216 not supported ***CAUTION:*** This email is not from a BCM Source. Only click links or open attachments you know are safe. ________________________________ Comment # 10<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548-23c10&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=GBiTkOI_Tse3xYx6EbneR4GGK9LapnMTG1hyoEFzJ1uWZl4sY5jWdOFV_KsMSFhx&s=neb-SKv5F13InggQeZDMt2DijgpziVvSRikvbfVHg9o&e=> on bug 17548<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=GBiTkOI_Tse3xYx6EbneR4GGK9LapnMTG1hyoEFzJ1uWZl4sY5jWdOFV_KsMSFhx&s=sqi0g3mmcuvjXy6E4Dz6dATKt0wVMo-MgbPp5K8CVEM&e=> from Benjamin Witham<mailto:benjamin.witham@schedmd.com> Have you checked the environment for older binaries? There could be a profile setting such as .bashrc or modules being loaded that add older binaries from slurm into the $PATH. Are the slurmstepds referencing the correct binary? ________________________________ You are receiving this mail because: * You reported the bug. Does running the command
> ps aux | grep slurm
give you any information on the slurmstepd version?
Does not appear to… [root@mhgcp-g00 ~]# ps aux | grep slurm root 102165 0.0 0.0 283144 3476 ? Sl 10:03 0:00 slurmstepd: [1024766.extern] root 102176 0.0 0.0 355060 4024 ? Sl 10:03 0:00 slurmstepd: [1024766.0] root 160561 0.0 0.0 293672 4392 ? Ss 13:00 0:00 /usr/sbin/slurmd -D -s root 176976 0.0 0.0 271944 2496 ? Sl Aug23 2:51 slurmstepd: [1014255.batch] u250176 176982 0.0 0.0 113296 888 ? S Aug23 0:00 /bin/bash /var/spool/slurmd/job1014255/slurm_script root 191875 0.0 0.0 271836 2480 ? Sl Aug22 3:43 slurmstepd: [1011787.batch] 232499 191884 0.0 0.0 113292 908 ? S Aug22 0:00 /bin/bash /var/spool/slurmd/job1011787/slurm_script 19D14Mac root 214383 0.0 0.0 113088 1236 pts/3 R+ 13:37 0:00 grep --color=auto slurm root 347822 0.0 0.0 271836 2320 ? Sl Aug22 3:45 slurmstepd: [1011632.batch] 232499 347826 0.0 0.0 113292 892 ? S Aug22 0:00 /bin/bash /var/spool/slurmd/job1011632/slurm_script 19D007Mac Thanks, Chris From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, August 28, 2023 1:36 PM To: Holder, Christopher Michael <Christopher.Holder@bcm.edu> Subject: [Bug 17548] protocol_version 9216 not supported ***CAUTION:*** This email is not from a BCM Source. Only click links or open attachments you know are safe. ________________________________ Comment # 12<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548-23c12&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=cVynQ1bOPJQ1JgYGypLprUmW-2pSYvBUdcH-Slfksy83XRhsCBp39L_fMHeP8Dyk&s=FcztCNaZsLW7-62bQ3z8t2bQTICq21ZIBumYVC-uOgU&e=> on bug 17548<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=cVynQ1bOPJQ1JgYGypLprUmW-2pSYvBUdcH-Slfksy83XRhsCBp39L_fMHeP8Dyk&s=O1jZ22MjJSChRdZeHNewgRvYZrTBBKNOv9uM0avCh74&e=> from Benjamin Witham<mailto:benjamin.witham@schedmd.com> Does running the command > ps aux | grep slurm give you any information on the slurmstepd version? ________________________________ You are receiving this mail because: * You reported the bug. Did you have any running jobs while you upgraded your cluster? Do you have any scripts that may be referencing older binaries? They were actively running. I shutdown all the slurmd daemons before the upgrades. Thanks, Chris From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, August 28, 2023 1:57 PM To: Holder, Christopher Michael <Christopher.Holder@bcm.edu> Subject: [Bug 17548] protocol_version 9216 not supported ***CAUTION:*** This email is not from a BCM Source. Only click links or open attachments you know are safe. ________________________________ Comment # 14<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548-23c14&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=qfLowmEaH2eL1qTm2mRaT4BTvZx-3xae4PEcyKfiYT9I3-rBER79Pd2Er5vwjbBR&s=43ZD30NywcGvGXpsM0_rSy1Kp4NPPldEUJ2pxSfLTTE&e=> on bug 17548<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=qfLowmEaH2eL1qTm2mRaT4BTvZx-3xae4PEcyKfiYT9I3-rBER79Pd2Er5vwjbBR&s=ynChPPzQmmCj2Al6vfT59S3Kpu86avpccuZ03P4cCW4&e=> from Benjamin Witham<mailto:benjamin.witham@schedmd.com> Did you have any running jobs while you upgraded your cluster? Do you have any scripts that may be referencing older binaries? ________________________________ You are receiving this mail because: * You reported the bug. I think that's our issue. The slurmstepds continued to run during the upgrade, and once the jobs finished, the stepds would send a completed mission to the slurmctld. The slurmctld can recognize protocols up to two versions old, but 20.11 is out of that range for your jump to 23.02. To confirm this, can you run the command: > sacctmgr show RunawayJobs > https://slurm.schedmd.com/sacctmgr.html#OPT_RunawayJobs [root@mhgcp-g00 ~]# sacctmgr show RunawayJobs sacctmgr: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:localhost:6819: Connection refused sacctmgr: error: Sending PersistInit msg: Connection refused Thanks, Chris From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, August 28, 2023 2:43 PM To: Holder, Christopher Michael <Christopher.Holder@bcm.edu> Subject: [Bug 17548] protocol_version 9216 not supported ***CAUTION:*** This email is not from a BCM Source. Only click links or open attachments you know are safe. ________________________________ Comment # 16<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548-23c16&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=p4QttKic70beU22s1otfggYjS6CVUmOck-Wn561nWSd3CDAf_rMrxIvDgeMQl3Mw&s=vcDasVzIAoUqqZaML6vMR6R9xniATTA-GZ9q3zOylMM&e=> on bug 17548<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=p4QttKic70beU22s1otfggYjS6CVUmOck-Wn561nWSd3CDAf_rMrxIvDgeMQl3Mw&s=7EA_p-lS5Ds_3thBu6UoUuOYVqEX8WwRN7JBFY66OkI&e=> from Benjamin Witham<mailto:benjamin.witham@schedmd.com> I think that's our issue. The slurmstepds continued to run during the upgrade, and once the jobs finished, the stepds would send a completed mission to the slurmctld. The slurmctld can recognize protocols up to two versions old, but 20.11 is out of that range for your jump to 23.02. To confirm this, can you run the command: > sacctmgr show RunawayJobs > https://slurm.schedmd.com/sacctmgr.html#OPT_RunawayJobs<https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_sacctmgr.html-23OPT-5FRunawayJobs&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=p4QttKic70beU22s1otfggYjS6CVUmOck-Wn561nWSd3CDAf_rMrxIvDgeMQl3Mw&s=uIZHQvhTcH34BlGwuhNytOMyq2_A_Ol57ftdzuwmFd0&e=> ________________________________ You are receiving this mail because: * You reported the bug. Do you use slurmdbd? Is it up right now? Do you have operator permissions with the account you ran the command on? So here’s a fun one… I just got an email from one of my PI’s and apparently his jobs (that disappeared from the queue and the reporting) during the upgrade ARE STILL RUNNING. The output files are still being written… Thanks, Chris From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, August 28, 2023 3:01 PM To: Holder, Christopher Michael <Christopher.Holder@bcm.edu> Subject: [Bug 17548] protocol_version 9216 not supported ***CAUTION:*** This email is not from a BCM Source. Only click links or open attachments you know are safe. ________________________________ Comment # 18<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548-23c18&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=Tf7m9jfo06NiPb3iYDH43eCPBesILau0b-fwCdpQWIZk7TyLXucYKgqssqYGvMbO&s=_leVVPURRltguvJLhGFstb1EslSLxrEa5yC4WuMofvI&e=> on bug 17548<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=Tf7m9jfo06NiPb3iYDH43eCPBesILau0b-fwCdpQWIZk7TyLXucYKgqssqYGvMbO&s=l_QVm08W2kOYqGBmuQqOB0OHsxnvsMSxZzershLhd9I&e=> from Benjamin Witham<mailto:benjamin.witham@schedmd.com> Do you use slurmdbd? Is it up right now? Do you have operator permissions with the account you ran the command on? ________________________________ You are receiving this mail because: * You reported the bug. Okay, sounds like you do have runaway jobs. you should be able to fix them by using the
> saccctmgr show runawayjobs
command, but it looks like your slurmdbd isn't responding. Is your slurmdbd up right now?
It’s active and running:
[root@mhgcp-h00 packages]# systemctl status slurmdbd
● slurmdbd.service - Slurm DBD accounting daemon
Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2023-08-25 21:21:44 CDT; 2 days ago
Main PID: 25946 (slurmdbd)
Tasks: 6
Memory: 6.9M
CGroup: /system.slice/slurmdbd.service
└─25946 /usr/sbin/slurmdbd -D -s
I don’t know why the compute nodes are trying to connect to localhost for sacct and scontrol commands instead of the controller node…
Running the command on the controller node says no runaway jobs:
[root@mhgcp-h00 packages]# sacctmgr show runawayjobs
Runaway Jobs: No runaway jobs found on cluster mhgcp
(mhgcp-h00 is the head node where slurmctld and slurmdbd are running. All other hosts are compute nodes in various partitions)
[root@mhgcp-g00 ~]# sacctmgr show RunawayJobs
sacctmgr: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:localhost:6819: Connection refused
sacctmgr: error: Sending PersistInit msg: Connection refused
Running the
Thanks,
Chris
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, August 28, 2023 3:12 PM
To: Holder, Christopher Michael <Christopher.Holder@bcm.edu>
Subject: [Bug 17548] protocol_version 9216 not supported
***CAUTION:*** This email is not from a BCM Source. Only click links or open attachments you know are safe.
________________________________
Comment # 20<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548-23c20&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=VGjgXBbbFMdsPnx-IkK6cuhB9h1VnF0aOnK_MisqxshtYL76twp4r_HW9JbJ9wXd&s=V57z0xTdgT570oUjr-TumKNgkOWeibn25oFgetXYqdw&e=> on bug 17548<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=VGjgXBbbFMdsPnx-IkK6cuhB9h1VnF0aOnK_MisqxshtYL76twp4r_HW9JbJ9wXd&s=-08y5oQb8R2O5LDCuUg8lqL_-PFqPeX2x8oR1gLSXpM&e=> from Benjamin Witham<mailto:benjamin.witham@schedmd.com>
Okay, sounds like you do have runaway jobs. you should be able to fix them by
using the
> saccctmgr show runawayjobs
command, but it looks like your slurmdbd isn't responding. Is your slurmdbd up
right now?
________________________________
You are receiving this mail because:
* You reported the bug.
Yes, slurmDBD is running and from the head node (mhgcp-h00) I can run sacct queries. Okay, great. There are runaway jobs on your system, but they haven't run past their runtime yet. Once the jobs spend longer than their runtime, the jobs will populate from that command, and there will be an option to fix these jobs. > [root@mhgcp-g00 ~]# sacctmgr show RunawayJobs > sacctmgr: error: slurm_persist_conn_open_without_init: failed to open persistent > connection to host:localhost:6819: Connection refused > sacctmgr: error: Sending PersistInit msg: Connection refused This is due to a configuration error. I noticed that many of your nodes in your logs had different slurm.confs than your slurmctld. > error: Node mhgcp-c01 appears to have a different slurm.conf than the slurmctld. This > could cause issues with communication and functionality. Please review both files and > make sure they are the same. If this is expected ignore, and set > DebugFlags=NO_CONF_HASH in your slurm.conf. I would review the conf file that was distributed to the nodes and ensure that it has the correct address of your database. What about the fact that even though all the daemons are working I can’t run any of the sacct or sacctmgr queries on the compute nodes? What is the proper list of slurm modules (slrumd, slurmctld, slrurmdbd, etc) to be installed on the compute nodes? Thanks, Chris From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, August 29, 2023 11:19 AM To: Holder, Christopher Michael <Christopher.Holder@bcm.edu> Subject: [Bug 17548] protocol_version 9216 not supported ***CAUTION:*** This email is not from a BCM Source. Only click links or open attachments you know are safe. ________________________________ Comment # 23<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548-23c23&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=Hq3DArMzw9pButrHBuqsHG-w6kAxX36L1vpQREnDX0h1iTixX7_bYEGuK-5hehk0&s=mqLH-mQbVz5uC72nANVg-vk5l2vS5VTUSB0pEcOmPAU&e=> on bug 17548<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=Hq3DArMzw9pButrHBuqsHG-w6kAxX36L1vpQREnDX0h1iTixX7_bYEGuK-5hehk0&s=YOvtB5q5jOR4aGK_d7VJndWF1371gULFUctIU-gRpRA&e=> from Benjamin Witham<mailto:benjamin.witham@schedmd.com> Okay, great. There are runaway jobs on your system, but they haven't run past their runtime yet. Once the jobs spend longer than their runtime, the jobs will populate from that command, and there will be an option to fix these jobs. > [root@mhgcp-g00 ~]# sacctmgr show RunawayJobs > sacctmgr: error: slurm_persist_conn_open_without_init: failed to open persistent > connection to host:localhost:6819: Connection refused > sacctmgr: error: Sending PersistInit msg: Connection refused This is due to a configuration error. I noticed that many of your nodes in your logs had different slurm.confs than your slurmctld. > error: Node mhgcp-c01 appears to have a different slurm.conf than the slurmctld. This > could cause issues with communication and functionality. Please review both files and > make sure they are the same. If this is expected ignore, and set > DebugFlags=NO_CONF_HASH in your slurm.conf. I would review the conf file that was distributed to the nodes and ensure that it has the correct address of your database. ________________________________ You are receiving this mail because: * You reported the bug. The sacct pulls the location of the accounting database from the slurm.conf. If this is set as localhost, the compute nodes will attempt to find the database on themselves rather than the actual location on the scheduler node. If the AccountingStorageHost is set to localhost in your slurm.conf, change it to the address of your database and sync this change up with the compute nodes in your cluster. That fixed that… Is there any way for me to kill the runaway jobs? Thanks, Chris From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, August 29, 2023 1:19 PM To: Holder, Christopher Michael <Christopher.Holder@bcm.edu> Subject: [Bug 17548] protocol_version 9216 not supported ***CAUTION:*** This email is not from a BCM Source. Only click links or open attachments you know are safe. ________________________________ Comment # 25<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548-23c25&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=VAAJA74n14K1V3Ej9uF7RGzUM-CaCj0XamySYFmHssLJoGx_GTjldCSmqCmGxlIH&s=NZMP3Lbm3DibaYKZwqipW_EnjfWfHXXEBKFQrUUMAeU&e=> on bug 17548<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=VAAJA74n14K1V3Ej9uF7RGzUM-CaCj0XamySYFmHssLJoGx_GTjldCSmqCmGxlIH&s=f9Up7SEnsC5KL_B5k0Ly5FzxNfqUdu35ynUW_lxyx9U&e=> from Benjamin Witham<mailto:benjamin.witham@schedmd.com> The sacct pulls the location of the accounting database from the slurm.conf. If this is set as localhost, the compute nodes will attempt to find the database on themselves rather than the actual location on the scheduler node. If the AccountingStorageHost is set to localhost in your slurm.conf, change it to the address of your database and sync this change up with the compute nodes in your cluster. ________________________________ You are receiving this mail because: * You reported the bug. You can safely clear the runaway jobs once the sacctmgr is able to detect them. Once the jobs have run past their runtime, the sacctmgr will be able to pick them up and kill them in a safe way. FANTASTIC!! Thanks, Chris From: bugs@schedmd.com <bugs@schedmd.com> Date: Tuesday, August 29, 2023 at 3:28 PM To: Holder, Christopher Michael <Christopher.Holder@bcm.edu> Subject: [Bug 17548] protocol_version 9216 not supported ***CAUTION:*** This email is not from a BCM Source. Only click links or open attachments you know are safe. ________________________________ Comment # 27<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548-23c27&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=57T8NkME88w4teTOzTvVLLADAf32MLxy0PTDZvI8Z6tf_tdhexKByjSEHlbPEOnc&s=uBzgchh72_wVieJwq8HfvdaJuA3qxVCBTvvFJGk_Zk8&e=> on bug 17548<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D17548&d=DwMFaQ&c=ZQs-KZ8oxEw0p81sqgiaRA&r=2ZdeynACRgILMr62dx9xaTyPVxGWiPYfLvXORmnH2Vs&m=57T8NkME88w4teTOzTvVLLADAf32MLxy0PTDZvI8Z6tf_tdhexKByjSEHlbPEOnc&s=y91tswEwFk2HQLcYn9ZQPuYtHKqdIusaOTGVhUIpFxQ&e=> from Benjamin Witham<mailto:benjamin.witham@schedmd.com> You can safely clear the runaway jobs once the sacctmgr is able to detect them. Once the jobs have run past their runtime, the sacctmgr will be able to pick them up and kill them in a safe way. ________________________________ You are receiving this mail because: * You reported the bug. Hello Chris, were you able to clear your runaway jobs? If so, I'll close this ticket now. According to my log files those messages have not been logged in 8/28. I would consider that closed. |
Created attachment 31990 [details] slurmctld.log After upgrading from 20.11.7 to 23.02.4 over the weekend we are getting the below errors. They are coming from several nodes in the cluster. We have checked all the nodes and confirmed all the slurm daemons and binaries are the same versions. [2023-08-28T06:27:30.056] error: unpack_header: protocol_version 9216 not supported [2023-08-28T06:27:30.056] error: unpacking header [2023-08-28T06:27:30.056] error: destroy_forward: no init [2023-08-28T06:27:30.057] error: slurm_unpack_received_msg: [[mhgcp-c01.grid.bcm.edu]:56276] Message receive failure [2023-08-28T06:27:30.067] error: slurm_receive_msg [10.66.4.144:56276]: Message receive failure [2023-08-28T06:27:45.068] error: unpack_header: protocol_version 9216 not supported [2023-08-28T06:27:45.069] error: unpacking header [2023-08-28T06:27:45.069] error: destroy_forward: no init [2023-08-28T06:27:45.069] error: slurm_unpack_received_msg: [[mhgcp-c01.grid.bcm.edu]:56278] Message receive failure [2023-08-28T06:27:45.080] error: slurm_receive_msg [10.66.4.144:56278]: Message receive failure [2023-08-28T06:28:00.081] error: unpack_header: protocol_version 9216 not supported [2023-08-28T06:28:00.081] error: unpacking header [2023-08-28T06:28:00.081] error: destroy_forward: no init [2023-08-28T06:28:00.082] error: slurm_unpack_received_msg: [[mhgcp-c01.grid.bcm.edu]:56280] Message receive failure [2023-08-28T06:28:00.092] error: slurm_receive_msg [10.66.4.144:56280]: Message receive failure