Created attachment 8633 [details] slurm.conf Hi, We are having random users experiencing the following error messages. The problems is intermittent. I will attach the slurm.conf This is what the client sees. salloc: Pending job allocation 31077281 salloc: job 31077281 queued and waiting for resources salloc: error: Security violation, slurm message from uid 6281 salloc: Granted job allocation 31077281 salloc: Waiting for resource configuration salloc: error: Security violation, slurm message from uid 6281 salloc: error: Job allocation 31077281 has been revoked salloc: Relinquishing job allocation 31077281 Here is what the controller sees. [2018-12-13T12:10:43.491] sched: _slurm_rpc_allocate_resources JobId=31077281 NodeLi st=(null) usec=4377 [2018-12-13T12:10:43.558] sched: Allocate JobId=31077281 NodeList=nodef299 #CPUs=1 Partition=campus [2018-12-13T12:10:43.674] error: slurm_receive_msgs: Zero Bytes were transmitted or received [2018-12-13T12:10:43.684] Killing interactive JobId=31077281: Communication connection failure [2018-12-13T12:10:43.684] _job_complete: JobId=31077281 WEXITSTATUS 1 [2018-12-13T12:10:43.844] _job_complete: JobId=31077281 done [2018-12-13T12:10:46.684] _slurm_rpc_complete_job_allocation: JobId=31077281 error Job/step already completing or completed Here is error message in the slurm daemon logs. [2018-12-13T12:10:43.788] _run_prolog: prolog with lock for job 31077281 ran for 0 seconds [2018-12-13T12:10:55.155] _run_prolog: run job script took usec=115182 We have verified that the user id is the same on both the controller and the daemon. Please let me know what else I can provide. Thanks, Scott
Scott, > I will attach the slurm.conf Is your Slurm compiled with "--enable-multiple-slurmd"? If not, there is no need for the %n in your slurm.conf for the pidfile. Is there a specific reason why MessageTimeout is set explicitly instead of the default? >MessageTimeout=45 Can you please call remunge on the client and the slurmctld nodes? > remunge > Here is what the controller sees. > [2018-12-13T12:10:43.674] error: slurm_receive_msgs: Zero Bytes were transmitted or received Have you had any configuration changes recently? The problem may be similar to #6147. Could you please make sure your slurm.conf is synced on all nodes and restart all of your slurmctld, slurmdbd and slurmd daemons? --Nate
1)Yes, slurm is compiled with "--enable-multiple-slurmd". However, we plan to remove this soon. The config is maybe 10 years old at this point. 2)Message time is set explicitly to 45. If we do have reason we don't remember why we could reset to default if needed. However, we are not reserving memory for slurmd and there is a fear that slurmd would get paged out. 3) Yes we will remunge 4) Yes, we have been messing around with the configuration, but the dates do not seem to line up. However, it could be that the changes are what is causing the issues. As a result, we are going to re-sync the configuration to all of our nodes. We will run remunge on all our nodes and we are going to give all the slurmd's a restart on all our nodes. Thanks, Scott
(In reply to Scott Sisco from comment #2) > 1)Yes, slurm is compiled with "--enable-multiple-slurmd". However, we plan > to remove this soon. The config is maybe 10 years old at this point. It shouldn't be needed for a production cluster but it also shouldn't hurt anything. > 2)Message time is set explicitly to 45. If we do have reason we don't > remember why we could reset to default if needed. However, we are not > reserving memory for slurmd and there is a fear that slurmd would get paged > out. Then there should be no need to change it. > 4) Yes, we have been messing around with the configuration, but the dates do > not seem to line up. However, it could be that the changes are what is > causing the issues. As a result, we are going to re-sync the configuration > to all of our nodes. We will run remunge on all our nodes and we are going > to give all the slurmd's a restart on all our nodes. Please make sure to restart slurmctld and slurmdbd too. --Nate
Here is an update. I have remunged all our nodes. Additionally, I checked to see if we were running the most recent version of libslurmdb33 and found that many of our nodes were not which means chef, which we use for configuration management, has been failing to update our nodes leaving them in a pretty wonky state. I have now resolved that issue and all nodes currently running the most recent version of libslurmdb33. Next up is scheduling a reboot of slurmctld and slurmdbd Thanks, Scott
I have restarted slurmctld and slurmdbd. I will continue to monitor the situation. Scott
Hi Nate, Unfortunately, re-munging, making sure all our nodes are running the most recent slurm package, and restarting the slurmctld and slurmdbd services has not resolved the issue. Any idea what we should try next? Thanks, Scott
(In reply to Scott Sisco from comment #6) > restarting the slurmctld and slurmdbd services has > not resolved the issue. Any idea what we should try next? Did you also restart slurmd on every node? Can you please attach recent slurmctld logs and slurmd on one affected node? --Nate
Hi Nate, I have restarted slurmd across all 500 of our nodes. I will monitor the logs over the next few days to see if the error comes back. Thanks, Scott
(In reply to Scott Sisco from comment #8) > Hi Nate, > > I have restarted slurmd across all 500 of our nodes. I will monitor the logs > over the next few days to see if the error comes back. > > Thanks, > Scott If the issue comes back, please check the version on all the binaries being executed: > $ srun -V > $ sbatch -V > $ salloc -V > $ slurmctld -V --Nate
Hi Nate, Unfortunately, the error is still occurring. At your request. All 500 nodes reporting. srun -V = slurm-wlm 18.08.3 sbatch -V = slurm-wlm 18.08.3 salloc -V = slurm-wlm 18.08.3 Slurm controller reporting slurmctld -V = slurm-wlm 18.08.3 srun -V = slurm-wlm 18.08.3 sbatch -V = slurm-wlm 18.08.3 salloc -V = slurm-wlm 18.08.3 Thanks, Scott
Scot Is it possible that a user is running slurmd instead of SlurmUser? The error from your logs show a successful authentication, but as the wrong user to report privileged information to the controller (or vice versa). Is it possible that your SlurmUser ("slurm" from the attached config) has a different uid on some of the nodes? --Nate
Hi Nate, This ticket can be marked as resolved. 1 of the 3 servers, that our scientists ssh into to launch jobs on our cluster, had the wrong UID for the slurm user in /etc/passwd. The issue only occurred intermittently as the server would sometimes get the correct UID from LDAP and other times pull the incorrect UID from /etc/password. Once we fixed the UID in /etc/passwd on the server the issue went away. Thanks so much for your help! -Scott
Scott, Closing ticket per your response. --Nate