5355 – Batch job submission failed: Invalid account or account/partition combination specified

Ticket 5355 - Batch job submission failed: Invalid account or account/partition combination specified

Summary: Batch job submission failed: Invalid account or account/partition combination...

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Other (show other tickets)
Version:	17.11.5
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Felip Moll
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-06-27 05:44 MDT by Iain Georgeson
Modified:	2018-07-04 07:48 MDT (History)
CC List:	1 user (show)

See Also:
Site:	KAUST
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurmdbd.log-extract (367 bytes, text/plain) 2018-07-01 07:17 MDT, Iain Georgeson	Details
slurmctld.log-extract (34.71 MB, text/plain) 2018-07-01 07:19 MDT, Iain Georgeson	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Iain Georgeson 2018-06-27 05:44:33 MDT

One - and only one - user receives an error:

    Batch job submission failed: Invalid account or account/partition combination specified

when they attempt to submit a job.

The logs report:

[2018-06-26T13:40:11.767] error: User 167096 not found
[2018-06-26T13:40:11.768] _job_create: invalid account or partition for user 167
096, account '(null)', and partition 'batch'
[2018-06-26T13:40:11.768] _slurm_rpc_submit_batch_job: Invalid account or accoun
t/partition combination specified

Comment 1 Felip Moll 2018-06-27 09:27:38 MDT

Hi Iain,

> [2018-06-26T13:40:11.767] error: User 167096 not found
> [2018-06-26T13:40:11.768] _job_create: invalid account or partition for user
> 167096, account '(null)', and partition 'batch'
> [2018-06-26T13:40:11.768] _slurm_rpc_submit_batch_job: Invalid account or
> account/partition combination specified

Please, check this things:

1. Is the user 167096 resolved correctly in your servers and nodes? Check it with "getent passwd 167096" in all servers.
2. Is the user shown in sacctmgr show user?

I think the problem must be that 167096 uid is found in your submission host but not in the node.

Comment 2 Felip Moll 2018-06-27 10:07:50 MDT

Just in case my previous comment doesn't fix the issue, I suggest the following:

Usually a firewall like iptables is to blame or different slurm users set in
the various .conf files. 

This problem should be fairly clearly marked in both slurmctld and slurmdbd logs
when it fails. 

When you add a user with sacctmgr, slurmdbd will do an RPC to slurmctld on the
registered clusters to inform them of this change. If slurmdbd can't talk to them
then you should see an error logged in the slurmdbd logs, and consequently
slurmctld won't realise this new user exists until it reloads its list of users
from slurmdbd (say on a restart).

Check your slurmdbd logs and also check that:

 sacctmgr list cluster format=cluster,controlhost

reports an IP address that slurmdbd can talk to for each cluster.

Finally these values are important if your slurmdbd and slurmctld are on the same host:

DbdHost=<hostname -s value>
DbdAddr=<host or fqdn>

ControlMachine=<hostname -s value>
ControlAddr=<host or fqdn>

Can you take a look at which addresses are listening in slurm ports?:

lsof -n -i -P | grep 6817
lsof -n -i -P | grep 6819


To summarize:
----------------
1. Is SlurmUser the same in slurm.conf and slurmdbd.conf?
2. Are there any related errors in the logs?
3. Does a slurmctld restart fix the issue?
4. Check sacctmgr for a report of IP addresses that slurmdbd can talk
5. Check conf files for correct *Addr and *Host parameters.
6. Check address port binding

Comment 3 Iain Georgeson 2018-06-28 05:29:50 MDT

On 27/06/18 18:27, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote:
Felip Moll<mailto:felip.moll@schedmd.com> changed bug 5355<https://bugs.schedmd.com/show_bug.cgi?id=5355>
What    Removed Added
Assignee        support@schedmd.com<mailto:support@schedmd.com>         felip.moll@schedmd.com<mailto:felip.moll@schedmd.com>
CC              felip.moll@schedmd.com<mailto:felip.moll@schedmd.com>

Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=5355#c1> on bug 5355<https://bugs.schedmd.com/show_bug.cgi?id=5355> from Felip Moll<mailto:felip.moll@schedmd.com>

Hi Iain,

> [2018-06-26T13:40:11.767] error: User 167096 not found
> [2018-06-26T13:40:11.768] _job_create: invalid account or partition for user
> 167096, account '(null)', and partition 'batch'
> [2018-06-26T13:40:11.768] _slurm_rpc_submit_batch_job: Invalid account or
> account/partition combination specified

Please, check this things:

1. Is the user 167096 resolved correctly in your servers and nodes? Check it
with "getent passwd 167096" in all servers.

Yes it is. Tested on submission host, control node and sundry other places.

2. Is the user shown in sacctmgr show user?

Looking up the user by name:

# sacctmgr show user zhant0e
      User   Def Acct     Admin
---------- ---------- ---------
   zhant0e    default      None

I think the problem must be that 167096 uid is found in your submission host
but not in the node.

________________________________
You are receiving this mail because:

  *   You reported the bug.

________________________________
This message and its contents including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email.

Comment 4 Iain Georgeson 2018-06-28 05:45:09 MDT

On 27/06/18 19:07, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote:

Comment # 2<https://bugs.schedmd.com/show_bug.cgi?id=5355#c2> on bug 5355<https://bugs.schedmd.com/show_bug.cgi?id=5355> from Felip Moll<mailto:felip.moll@schedmd.com>

Just in case my previous comment doesn't fix the issue, I suggest the
following:

Usually a firewall like iptables is to blame or different slurm users set in
the various .conf files.

This problem should be fairly clearly marked in both slurmctld and slurmdbd
logs
when it fails.

When you add a user with sacctmgr, slurmdbd will do an RPC to slurmctld on the
registered clusters to inform them of this change. If slurmdbd can't talk to
them
then you should see an error logged in the slurmdbd logs, and consequently
slurmctld won't realise this new user exists until it reloads its list of users
from slurmdbd (say on a restart).

We don't explicitly add new users to slurm.

Check your slurmdbd logs and also check that:

 sacctmgr list cluster format=cluster,controlhost

reports an IP address that slurmdbd can talk to for each cluster.

Finally these values are important if your slurmdbd and slurmctld are on the
same host:

DbdHost=<hostname -s value>
DbdAddr=<host or fqdn>

ControlMachine=<hostname -s value>
ControlAddr=<host or fqdn>

Can you take a look at which addresses are listening in slurm ports?:

lsof -n -i -P | grep 6817
lsof -n -i -P | grep 6819


To summarize:
----------------
1. Is SlurmUser the same in slurm.conf and slurmdbd.conf?

[root@dm308-17 log]# grep SlurmUser /etc/slurm/slurmdbd.conf
SlurmUser=slurm
[root@dm308-17 log]# grep SlurmUser /etc/slurm/slurm.conf
SlurmUser=root


2. Are there any related errors in the logs?
3. Does a slurmctld restart fix the issue?

I have reloaded it (kill -HUP) and it hasn't.

4. Check sacctmgr for a report of IP addresses that slurmdbd can talk

This looks fine.

5. Check conf files for correct *Addr and *Host parameters.

These look fine.

6. Check address port binding

[root@dm308-17 log]# lsof -n -i -P | grep 6817
[root@dm308-17 log]# lsof -n -i -P | grep 6819
slurmctld  1262    root    6u  IPv4   8386581      0t0  TCP 10.109.164.97:39946->10.109.164.97:6819 (ESTABLISHED)
slurmdbd  20811   slurm    9u  IPv4    352852      0t0  TCP *:6819 (LISTEN)
slurmdbd  20811   slurm   10u  IPv4 180552961      0t0  TCP 10.109.36.97:6819->10.109.0.1:43450 (ESTABLISHED)
slurmdbd  20811   slurm   13u  IPv4   8379709      0t0  TCP 10.109.164.97:6819->10.109.164.97:39946 (ESTABLISHED)

    Iain.

________________________________
This message and its contents including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email.

Comment 5 Felip Moll 2018-06-28 08:39:04 MDT

> We don't explicitly add new users to slurm.

How do you add new users?
I reproduce also the problem when adding a user in Slurm before it is in the system, but then a "scontrol reconfig" fixes it.

> 1. Is SlurmUser the same in slurm.conf and slurmdbd.conf?
> 
> [root@dm308-17 log]# grep SlurmUser /etc/slurm/slurmdbd.conf
> SlurmUser=slurm
> [root@dm308-17 log]# grep SlurmUser /etc/slurm/slurm.conf
> SlurmUser=root

Why is your slurmctld running as root? This can be a security concern.


> 2. Are there any related errors in the logs?

Please, send me the slurmctld and slurmdbd logs.


> 3. Does a slurmctld restart fix the issue?
> 
> I have reloaded it (kill -HUP) and it hasn't.
> 

That's odd, this has to be some other issue then.

Comment 6 Iain Georgeson 2018-07-01 07:17:33 MDT

Created attachment 7200 [details]
slurmdbd.log-extract

Comment 7 Iain Georgeson 2018-07-01 07:19:33 MDT

Created attachment 7201 [details]
slurmctld.log-extract

Comment 8 Felip Moll 2018-07-02 04:38:02 MDT

Your slurmdbd log is not useful to me. You should increase log level of slurmdbd.


In slurmctld log I see all the time the commented error surrounded by:

[2018-06-10T22:27:00.429] error: slurm_receive_msgs: Socket timed out on send/recv operation

or by 

[2018-06-11T15:15:05.939] slurmctld: agent retry_list size is 102

or by a lot of backfill operations.

This means your system is high loaded and is unable to communicate to de nodes. I suspect you are arriving to the limit of nr open files or max connections.
The agent retry_list queue indicates that there's a network problem, since there are a lot of messages that must be resent.

It would be useful for me to see your "sdiag" output on a moment where the error is happening.

Please, ensure you have followed this guide and you have your system (slurmctld + nodes) tuned properly:

https://slurm.schedmd.com/high_throughput.html

You should also take a look at the LDAP server logs or whatever you have for user resolution, this server may need some tuning too.

This may fix the problem.


----- e.g. ------

[2018-06-10T22:26:03.483] backfill: Started JobID=11106585_1312(11108000) in batch on dbn404-06-r
[2018-06-10T22:26:03.490] backfill: Started JobID=11106585_1313(11108001) in batch on dbn404-19-r
[2018-06-10T22:27:00.429] error: slurm_receive_msgs: Socket timed out on send/recv operation
[2018-06-10T22:27:01.649] error: User 167096 not found
[2018-06-10T22:27:01.650] _job_create: invalid account or partition for user 167096, account '(null)', and partition 'batch'
[2018-06-10T22:27:01.651] _slurm_rpc_submit_batch_job: Invalid account or account/partition combination specified
[2018-06-10T22:28:12.710] backfill: Started JobID=11107962 in batch on dbn711-09-l
[2018-06-10T22:28:12.716] backfill: Started JobID=11107963 in batch on dbn711-09-r


[2018-06-11T15:15:05.923] backfill: Started JobID=11112419_188(11112609) in batch on dbn303-33-l
[2018-06-11T15:15:05.927] backfill: Started JobID=11112419_189(11112610) in batch on dbn303-33-l
[2018-06-11T15:15:05.932] backfill: Started JobID=11112419_190(11112611) in batch on dbn303-33-l
[2018-06-11T15:15:05.936] backfill: Started JobID=11112419_191(11112612) in batch on dbn303-33-l
[2018-06-11T15:15:05.939] slurmctld: agent retry_list size is 102
[2018-06-11T15:15:05.940]    retry_list msg_type=6017,4005,6017,4005,6017
[2018-06-11T15:15:06.442] backfill: Started JobID=11112419_192(11112613) in batch on dbn303-33-l
[2018-06-11T15:16:09.590] error: User 167096 not found
[2018-06-11T15:16:09.591] _job_create: invalid account or partition for user 167096, account '(null)', and partition 'batch'
[2018-06-11T15:16:09.592] _slurm_rpc_submit_batch_job: Invalid account or account/partition combination specified
[2018-06-11T15:17:18.169] error: slurm_receive_msgs: Socket timed out on send/recv operation
[2018-06-11T15:17:34.137] backfill: Started JobID=11112419_193(11112614) in batch on dbn303-33-l
[2018-06-11T15:17:34.144] backfill: Started JobID=11112419_194(11112615) in batch on dbn303-33-l
[2018-06-11T15:17:34.150] backfill: Started JobID=11112419_195(11112616) in batch on dbn303-33-l



[2018-06-11T18:37:28.067] backfill: Started JobID=11112419_3142(11115940) in batch on kccn708-28-16
[2018-06-11T18:37:28.073] backfill: Started JobID=11112419_3143(11115941) in batch on kccn708-28-16
[2018-06-11T18:37:28.593] error: slurm_receive_msgs: Socket timed out on send/recv operation
[2018-06-11T18:38:24.695] error: User 167096 not found
[2018-06-11T18:38:24.696] _job_create: invalid account or partition for user 167096, account '(null)', and partition 'batch'
[2018-06-11T18:38:24.697] _slurm_rpc_submit_batch_job: Invalid account or account/partition combination specified
[2018-06-11T18:38:51.263] _slurm_rpc_submit_batch_job: JobId=11115942 InitPrio=786 usec=7426
[2018-06-11T18:39:32.681] node dbn303-19-l returned to service
[2018-06-11T18:40:10.476] backfill: Started JobID=11112419_3144(11115943) in batch on dbn302-06-r
[2018-06-11T18:40:10.482] backfill: Started JobID=11112419_3145(11115944) in batch on dbn303-03-r

Comment 9 Iain Georgeson 2018-07-04 07:21:09 MDT

I've managed to get some more information.

We have a cron script which looks for new users and uses sacctmgr to add
them to slurm. Users come from Active Directory. The script runs on a
different node from the control master. Due to sssd caching effects,
there's a possibility that the user will be added to slurm before
slurmctld can resolve it. I shall try to fix that.

We've resolved the immediate problem by restarting slurmctld again. Folk
belief here is that slurmctld needs restarting a couple of times in this
situation.

Thank you for your assistance with this.

    Iain.

________________________________
This message and its contents including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email.

Comment 10 Felip Moll 2018-07-04 07:48:41 MDT

> We've resolved the immediate problem by restarting slurmctld again. Folk
> belief here is that slurmctld needs restarting a couple of times in this
> situation.

Ok Iain,

I am closing the bug then, but please take also into consideration my comment 8, this can help to avoid future issues.

Best regards,
Felip M