4804 – wrong response querying users of a specified cluster

Ticket 4804 - wrong response querying users of a specified cluster

Summary: wrong response querying users of a specified cluster

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Database (show other tickets)
Version:	17.11.3
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Felip Moll
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-02-19 06:54 MST by Cineca HPC Systems
Modified:	2018-03-02 10:42 MST (History)
CC List:	3 users (show)

See Also:
Site:	Cineca
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	18.08
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Cineca HPC Systems 2018-02-19 06:54:46 MST

Hi support
we recently installed SLURM on another cluster: galileo

[root@mgmt01 slurm]# sacctmgr show cluster format=Cluster
   Cluster 
---------- 
   galileo 
   marconi 

When we start or reconfigure the slurmctld on galileo we notice very large processing times during the partition to UID mapping phase. Meanwhile the SSSD service is using 100% of cputime. Debugging SSSD we notice that it's hitting a lot of cache misses related to users not defined in the galileo LDAP. These users belong to the marconi cluster instead.

Making some sacctmgr query we found that it fails to display correctly the users belonging to the selected cluster.
For example

[root@mgmt01 slurm]# sacctmgr show user withass where cluster=galileo format=user,cluster | head -5
      User    Cluster 
---------- ---------- 
  a06ccc06            
  a06ccc07    galileo 
  a06dlr00            

And checking any user with empty cluster column it belongs to marconi 

[root@mgmt01 slurm]# sacctmgr show user withass a06ccc06 format=user,cluster
      User    Cluster 
---------- ---------- 
  a06ccc06    marconi 

Is this the expected behavior? 
This a very annoying issue because it is causing very large processing time at every reconfigure or restart of the controller.

Thank you very much
ale

Comment 3 Felip Moll 2018-02-20 06:23:25 MST

Hi Ale,

Everytime you reconfigure slurmctld there will be a try to fill in missing UIDs on accounted users.

We could modify this check of slurmctld taking into account only users of the defined cluster in slurm.conf, but this wouldn't solve your problem entirely since slurmdbd does also the same checks every 60 minutes during rollup, and when it starts.

Different clusters can register against slurmdbd and therefore it is agnostic in regards which cluster has to be taken into account, so this check has to be performed.

My recommendation, if it is possible and there's no gid/uid overlapping, is to allow the query on both ldap directories from the machine where slurmctld and slurmdbd are running.

Anyway, I will check internally for any other solution. On the other hand, making a query to LDAP shouldn't delay too much, maybe you can tune LDAP in some way to minimize the impact of the queries?

Regarding to the sacctmgr problem, I've already proposed a patch. When it's reviewed and approved I will commit it.

Thanks

Comment 7 Felip Moll 2018-02-26 11:40:23 MST

Ale, more definitive info about this:

Check https://slurm.schedmd.com/accounting.html , starting with "Whether you use any authentication".

Regarding multi-cluster operations, one admin user can administer clusters where that users doesn't have any association. i.e. your user 'ale' on cluster1 could be set with adminlevel=administrator and therefore manage jobs on any other registered cluster.

For this reason each slurmctld server needs full user list from the database so that the user can be authenticated, in the example, user 'ale'.

In order to authenticate correctly among all clusters, all users would need to have a unique uid across all of them as referenced by the documentation page pointed below.

In conclusion, you need to give access to both ldaps from all slurmctld & slurmdbd servers.

Hope it helps.


I have a patch pending for the other 'sacctmgr' issue. Will come asap.

Comment 9 Cineca HPC Systems 2018-02-28 03:57:22 MST

Hi Felip,

thank you very much for the update and the very clear explanation :)

We understand your point so we will setup the SSSD daemons 
on the controllers and DBDs to cache all users/groups.

Do you think your patch for sacctmgr will be included in 17.11.4?

Thanks
Ale

Comment 14 Felip Moll 2018-03-01 12:19:52 MST

(In reply to Cineca HPC Systems from comment #9)
> Hi Felip,
> 
> thank you very much for the update and the very clear explanation :)
> 
> We understand your point so we will setup the SSSD daemons 
> on the controllers and DBDs to cache all users/groups.
> 
> Do you think your patch for sacctmgr will be included in 17.11.4?
> 
> Thanks
> Ale

Hi Ale,

Sorry to have not come back to you before. The change didn't go into 17.11.4 because we were discussing it internally.

Finally, the conclusion is that 'sacctmgr show user' is working as designed. The purpose of this command is to show all the users in the database independently of which cluster they are associated to. By default, it shows the user's default accounts for the current cluster defined in slurm.conf. Blank fields mean that the user doesn't have an association for the current cluster or for the requested cluster, but it will always show all the users anyway.

So in your example:

[root@mgmt01 slurm]# sacctmgr show user withass where cluster=galileo format=user,cluster | head -5
      User    Cluster 
---------- ---------- 
  a06ccc06            
  a06ccc07    galileo 
  a06dlr00  

The meaning of the output is that users that doesn't have "cluster" set here doesn't have an association with cluster "galileo".

The WithAssocs flag is just intended to give back all associations with the database users, and the cluster= option just filters associations being returned with the users, but does not filter the users.

I know it can initially sound a bit counter-intuitive but 'sacctmgr show users' command were initially just though to print users and not their associations.

The easiest option for your case would be to grep/awk the output of 'sacctmgr show users withassoc where cluster=galileo' and filter by "galileo" if you want to see only the users of "galileo" cluster.

I am sorry if it felt a little like beating about the bush, but it's been really a design decision we have had to discuss.

I will also update the documentation accordingly to clarify this commands.

Comment 15 Cineca HPC Systems 2018-03-02 07:24:04 MST

Hi Felip

thank you very much again for the explanation. We will use grep ;)

You can close this bug at your will.

Thanks
Ale

Comment 17 Felip Moll 2018-03-02 10:42:44 MST

(In reply to Cineca HPC Systems from comment #15)
> Hi Felip
> 
> thank you very much again for the explanation. We will use grep ;)
> 
> You can close this bug at your will.
> 
> Thanks
> Ale

Thanks Ale,

consider playing also with 'sacctmgr show assoc', this can be useful and probably more appropriate since you really want to query for associations and not just users.

Behavior set consistently between 'sacctmgr show accounts' and 'users' in commit
da49b8d0d14f1e2def06f2c22a14acb22a733153, available in future 18.08.

Closing bug now.

Regards,
Felip M