One - and only one - user receives an error: Batch job submission failed: Invalid account or account/partition combination specified when they attempt to submit a job. The logs report: [2018-06-26T13:40:11.767] error: User 167096 not found [2018-06-26T13:40:11.768] _job_create: invalid account or partition for user 167 096, account '(null)', and partition 'batch' [2018-06-26T13:40:11.768] _slurm_rpc_submit_batch_job: Invalid account or accoun t/partition combination specified
Hi Iain, > [2018-06-26T13:40:11.767] error: User 167096 not found > [2018-06-26T13:40:11.768] _job_create: invalid account or partition for user > 167096, account '(null)', and partition 'batch' > [2018-06-26T13:40:11.768] _slurm_rpc_submit_batch_job: Invalid account or > account/partition combination specified Please, check this things: 1. Is the user 167096 resolved correctly in your servers and nodes? Check it with "getent passwd 167096" in all servers. 2. Is the user shown in sacctmgr show user? I think the problem must be that 167096 uid is found in your submission host but not in the node.
Just in case my previous comment doesn't fix the issue, I suggest the following: Usually a firewall like iptables is to blame or different slurm users set in the various .conf files. This problem should be fairly clearly marked in both slurmctld and slurmdbd logs when it fails. When you add a user with sacctmgr, slurmdbd will do an RPC to slurmctld on the registered clusters to inform them of this change. If slurmdbd can't talk to them then you should see an error logged in the slurmdbd logs, and consequently slurmctld won't realise this new user exists until it reloads its list of users from slurmdbd (say on a restart). Check your slurmdbd logs and also check that: sacctmgr list cluster format=cluster,controlhost reports an IP address that slurmdbd can talk to for each cluster. Finally these values are important if your slurmdbd and slurmctld are on the same host: DbdHost=<hostname -s value> DbdAddr=<host or fqdn> ControlMachine=<hostname -s value> ControlAddr=<host or fqdn> Can you take a look at which addresses are listening in slurm ports?: lsof -n -i -P | grep 6817 lsof -n -i -P | grep 6819 To summarize: ---------------- 1. Is SlurmUser the same in slurm.conf and slurmdbd.conf? 2. Are there any related errors in the logs? 3. Does a slurmctld restart fix the issue? 4. Check sacctmgr for a report of IP addresses that slurmdbd can talk 5. Check conf files for correct *Addr and *Host parameters. 6. Check address port binding
On 27/06/18 18:27, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote: Felip Moll<mailto:felip.moll@schedmd.com> changed bug 5355<https://bugs.schedmd.com/show_bug.cgi?id=5355> What Removed Added Assignee support@schedmd.com<mailto:support@schedmd.com> felip.moll@schedmd.com<mailto:felip.moll@schedmd.com> CC felip.moll@schedmd.com<mailto:felip.moll@schedmd.com> Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=5355#c1> on bug 5355<https://bugs.schedmd.com/show_bug.cgi?id=5355> from Felip Moll<mailto:felip.moll@schedmd.com> Hi Iain, > [2018-06-26T13:40:11.767] error: User 167096 not found > [2018-06-26T13:40:11.768] _job_create: invalid account or partition for user > 167096, account '(null)', and partition 'batch' > [2018-06-26T13:40:11.768] _slurm_rpc_submit_batch_job: Invalid account or > account/partition combination specified Please, check this things: 1. Is the user 167096 resolved correctly in your servers and nodes? Check it with "getent passwd 167096" in all servers. Yes it is. Tested on submission host, control node and sundry other places. 2. Is the user shown in sacctmgr show user? Looking up the user by name: # sacctmgr show user zhant0e User Def Acct Admin ---------- ---------- --------- zhant0e default None I think the problem must be that 167096 uid is found in your submission host but not in the node. ________________________________ You are receiving this mail because: * You reported the bug. ________________________________ This message and its contents including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email.
On 27/06/18 19:07, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote: Comment # 2<https://bugs.schedmd.com/show_bug.cgi?id=5355#c2> on bug 5355<https://bugs.schedmd.com/show_bug.cgi?id=5355> from Felip Moll<mailto:felip.moll@schedmd.com> Just in case my previous comment doesn't fix the issue, I suggest the following: Usually a firewall like iptables is to blame or different slurm users set in the various .conf files. This problem should be fairly clearly marked in both slurmctld and slurmdbd logs when it fails. When you add a user with sacctmgr, slurmdbd will do an RPC to slurmctld on the registered clusters to inform them of this change. If slurmdbd can't talk to them then you should see an error logged in the slurmdbd logs, and consequently slurmctld won't realise this new user exists until it reloads its list of users from slurmdbd (say on a restart). We don't explicitly add new users to slurm. Check your slurmdbd logs and also check that: sacctmgr list cluster format=cluster,controlhost reports an IP address that slurmdbd can talk to for each cluster. Finally these values are important if your slurmdbd and slurmctld are on the same host: DbdHost=<hostname -s value> DbdAddr=<host or fqdn> ControlMachine=<hostname -s value> ControlAddr=<host or fqdn> Can you take a look at which addresses are listening in slurm ports?: lsof -n -i -P | grep 6817 lsof -n -i -P | grep 6819 To summarize: ---------------- 1. Is SlurmUser the same in slurm.conf and slurmdbd.conf? [root@dm308-17 log]# grep SlurmUser /etc/slurm/slurmdbd.conf SlurmUser=slurm [root@dm308-17 log]# grep SlurmUser /etc/slurm/slurm.conf SlurmUser=root 2. Are there any related errors in the logs? 3. Does a slurmctld restart fix the issue? I have reloaded it (kill -HUP) and it hasn't. 4. Check sacctmgr for a report of IP addresses that slurmdbd can talk This looks fine. 5. Check conf files for correct *Addr and *Host parameters. These look fine. 6. Check address port binding [root@dm308-17 log]# lsof -n -i -P | grep 6817 [root@dm308-17 log]# lsof -n -i -P | grep 6819 slurmctld 1262 root 6u IPv4 8386581 0t0 TCP 10.109.164.97:39946->10.109.164.97:6819 (ESTABLISHED) slurmdbd 20811 slurm 9u IPv4 352852 0t0 TCP *:6819 (LISTEN) slurmdbd 20811 slurm 10u IPv4 180552961 0t0 TCP 10.109.36.97:6819->10.109.0.1:43450 (ESTABLISHED) slurmdbd 20811 slurm 13u IPv4 8379709 0t0 TCP 10.109.164.97:6819->10.109.164.97:39946 (ESTABLISHED) Iain. ________________________________ This message and its contents including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email.
> We don't explicitly add new users to slurm. How do you add new users? I reproduce also the problem when adding a user in Slurm before it is in the system, but then a "scontrol reconfig" fixes it. > 1. Is SlurmUser the same in slurm.conf and slurmdbd.conf? > > [root@dm308-17 log]# grep SlurmUser /etc/slurm/slurmdbd.conf > SlurmUser=slurm > [root@dm308-17 log]# grep SlurmUser /etc/slurm/slurm.conf > SlurmUser=root Why is your slurmctld running as root? This can be a security concern. > 2. Are there any related errors in the logs? Please, send me the slurmctld and slurmdbd logs. > 3. Does a slurmctld restart fix the issue? > > I have reloaded it (kill -HUP) and it hasn't. > That's odd, this has to be some other issue then.
Created attachment 7200 [details] slurmdbd.log-extract
Created attachment 7201 [details] slurmctld.log-extract
Your slurmdbd log is not useful to me. You should increase log level of slurmdbd. In slurmctld log I see all the time the commented error surrounded by: [2018-06-10T22:27:00.429] error: slurm_receive_msgs: Socket timed out on send/recv operation or by [2018-06-11T15:15:05.939] slurmctld: agent retry_list size is 102 or by a lot of backfill operations. This means your system is high loaded and is unable to communicate to de nodes. I suspect you are arriving to the limit of nr open files or max connections. The agent retry_list queue indicates that there's a network problem, since there are a lot of messages that must be resent. It would be useful for me to see your "sdiag" output on a moment where the error is happening. Please, ensure you have followed this guide and you have your system (slurmctld + nodes) tuned properly: https://slurm.schedmd.com/high_throughput.html You should also take a look at the LDAP server logs or whatever you have for user resolution, this server may need some tuning too. This may fix the problem. ----- e.g. ------ [2018-06-10T22:26:03.483] backfill: Started JobID=11106585_1312(11108000) in batch on dbn404-06-r [2018-06-10T22:26:03.490] backfill: Started JobID=11106585_1313(11108001) in batch on dbn404-19-r [2018-06-10T22:27:00.429] error: slurm_receive_msgs: Socket timed out on send/recv operation [2018-06-10T22:27:01.649] error: User 167096 not found [2018-06-10T22:27:01.650] _job_create: invalid account or partition for user 167096, account '(null)', and partition 'batch' [2018-06-10T22:27:01.651] _slurm_rpc_submit_batch_job: Invalid account or account/partition combination specified [2018-06-10T22:28:12.710] backfill: Started JobID=11107962 in batch on dbn711-09-l [2018-06-10T22:28:12.716] backfill: Started JobID=11107963 in batch on dbn711-09-r [2018-06-11T15:15:05.923] backfill: Started JobID=11112419_188(11112609) in batch on dbn303-33-l [2018-06-11T15:15:05.927] backfill: Started JobID=11112419_189(11112610) in batch on dbn303-33-l [2018-06-11T15:15:05.932] backfill: Started JobID=11112419_190(11112611) in batch on dbn303-33-l [2018-06-11T15:15:05.936] backfill: Started JobID=11112419_191(11112612) in batch on dbn303-33-l [2018-06-11T15:15:05.939] slurmctld: agent retry_list size is 102 [2018-06-11T15:15:05.940] retry_list msg_type=6017,4005,6017,4005,6017 [2018-06-11T15:15:06.442] backfill: Started JobID=11112419_192(11112613) in batch on dbn303-33-l [2018-06-11T15:16:09.590] error: User 167096 not found [2018-06-11T15:16:09.591] _job_create: invalid account or partition for user 167096, account '(null)', and partition 'batch' [2018-06-11T15:16:09.592] _slurm_rpc_submit_batch_job: Invalid account or account/partition combination specified [2018-06-11T15:17:18.169] error: slurm_receive_msgs: Socket timed out on send/recv operation [2018-06-11T15:17:34.137] backfill: Started JobID=11112419_193(11112614) in batch on dbn303-33-l [2018-06-11T15:17:34.144] backfill: Started JobID=11112419_194(11112615) in batch on dbn303-33-l [2018-06-11T15:17:34.150] backfill: Started JobID=11112419_195(11112616) in batch on dbn303-33-l [2018-06-11T18:37:28.067] backfill: Started JobID=11112419_3142(11115940) in batch on kccn708-28-16 [2018-06-11T18:37:28.073] backfill: Started JobID=11112419_3143(11115941) in batch on kccn708-28-16 [2018-06-11T18:37:28.593] error: slurm_receive_msgs: Socket timed out on send/recv operation [2018-06-11T18:38:24.695] error: User 167096 not found [2018-06-11T18:38:24.696] _job_create: invalid account or partition for user 167096, account '(null)', and partition 'batch' [2018-06-11T18:38:24.697] _slurm_rpc_submit_batch_job: Invalid account or account/partition combination specified [2018-06-11T18:38:51.263] _slurm_rpc_submit_batch_job: JobId=11115942 InitPrio=786 usec=7426 [2018-06-11T18:39:32.681] node dbn303-19-l returned to service [2018-06-11T18:40:10.476] backfill: Started JobID=11112419_3144(11115943) in batch on dbn302-06-r [2018-06-11T18:40:10.482] backfill: Started JobID=11112419_3145(11115944) in batch on dbn303-03-r
I've managed to get some more information. We have a cron script which looks for new users and uses sacctmgr to add them to slurm. Users come from Active Directory. The script runs on a different node from the control master. Due to sssd caching effects, there's a possibility that the user will be added to slurm before slurmctld can resolve it. I shall try to fix that. We've resolved the immediate problem by restarting slurmctld again. Folk belief here is that slurmctld needs restarting a couple of times in this situation. Thank you for your assistance with this. Iain. ________________________________ This message and its contents including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email.
> We've resolved the immediate problem by restarting slurmctld again. Folk > belief here is that slurmctld needs restarting a couple of times in this > situation. Ok Iain, I am closing the bug then, but please take also into consideration my comment 8, this can help to avoid future issues. Best regards, Felip M