Hi support we recently installed SLURM on another cluster: galileo [root@mgmt01 slurm]# sacctmgr show cluster format=Cluster Cluster ---------- galileo marconi When we start or reconfigure the slurmctld on galileo we notice very large processing times during the partition to UID mapping phase. Meanwhile the SSSD service is using 100% of cputime. Debugging SSSD we notice that it's hitting a lot of cache misses related to users not defined in the galileo LDAP. These users belong to the marconi cluster instead. Making some sacctmgr query we found that it fails to display correctly the users belonging to the selected cluster. For example [root@mgmt01 slurm]# sacctmgr show user withass where cluster=galileo format=user,cluster | head -5 User Cluster ---------- ---------- a06ccc06 a06ccc07 galileo a06dlr00 And checking any user with empty cluster column it belongs to marconi [root@mgmt01 slurm]# sacctmgr show user withass a06ccc06 format=user,cluster User Cluster ---------- ---------- a06ccc06 marconi Is this the expected behavior? This a very annoying issue because it is causing very large processing time at every reconfigure or restart of the controller. Thank you very much ale
Hi Ale, Everytime you reconfigure slurmctld there will be a try to fill in missing UIDs on accounted users. We could modify this check of slurmctld taking into account only users of the defined cluster in slurm.conf, but this wouldn't solve your problem entirely since slurmdbd does also the same checks every 60 minutes during rollup, and when it starts. Different clusters can register against slurmdbd and therefore it is agnostic in regards which cluster has to be taken into account, so this check has to be performed. My recommendation, if it is possible and there's no gid/uid overlapping, is to allow the query on both ldap directories from the machine where slurmctld and slurmdbd are running. Anyway, I will check internally for any other solution. On the other hand, making a query to LDAP shouldn't delay too much, maybe you can tune LDAP in some way to minimize the impact of the queries? Regarding to the sacctmgr problem, I've already proposed a patch. When it's reviewed and approved I will commit it. Thanks
Ale, more definitive info about this: Check https://slurm.schedmd.com/accounting.html , starting with "Whether you use any authentication". Regarding multi-cluster operations, one admin user can administer clusters where that users doesn't have any association. i.e. your user 'ale' on cluster1 could be set with adminlevel=administrator and therefore manage jobs on any other registered cluster. For this reason each slurmctld server needs full user list from the database so that the user can be authenticated, in the example, user 'ale'. In order to authenticate correctly among all clusters, all users would need to have a unique uid across all of them as referenced by the documentation page pointed below. In conclusion, you need to give access to both ldaps from all slurmctld & slurmdbd servers. Hope it helps. I have a patch pending for the other 'sacctmgr' issue. Will come asap.
Hi Felip, thank you very much for the update and the very clear explanation :) We understand your point so we will setup the SSSD daemons on the controllers and DBDs to cache all users/groups. Do you think your patch for sacctmgr will be included in 17.11.4? Thanks Ale
(In reply to Cineca HPC Systems from comment #9) > Hi Felip, > > thank you very much for the update and the very clear explanation :) > > We understand your point so we will setup the SSSD daemons > on the controllers and DBDs to cache all users/groups. > > Do you think your patch for sacctmgr will be included in 17.11.4? > > Thanks > Ale Hi Ale, Sorry to have not come back to you before. The change didn't go into 17.11.4 because we were discussing it internally. Finally, the conclusion is that 'sacctmgr show user' is working as designed. The purpose of this command is to show all the users in the database independently of which cluster they are associated to. By default, it shows the user's default accounts for the current cluster defined in slurm.conf. Blank fields mean that the user doesn't have an association for the current cluster or for the requested cluster, but it will always show all the users anyway. So in your example: [root@mgmt01 slurm]# sacctmgr show user withass where cluster=galileo format=user,cluster | head -5 User Cluster ---------- ---------- a06ccc06 a06ccc07 galileo a06dlr00 The meaning of the output is that users that doesn't have "cluster" set here doesn't have an association with cluster "galileo". The WithAssocs flag is just intended to give back all associations with the database users, and the cluster= option just filters associations being returned with the users, but does not filter the users. I know it can initially sound a bit counter-intuitive but 'sacctmgr show users' command were initially just though to print users and not their associations. The easiest option for your case would be to grep/awk the output of 'sacctmgr show users withassoc where cluster=galileo' and filter by "galileo" if you want to see only the users of "galileo" cluster. I am sorry if it felt a little like beating about the bush, but it's been really a design decision we have had to discuss. I will also update the documentation accordingly to clarify this commands.
Hi Felip thank you very much again for the explanation. We will use grep ;) You can close this bug at your will. Thanks Ale
(In reply to Cineca HPC Systems from comment #15) > Hi Felip > > thank you very much again for the explanation. We will use grep ;) > > You can close this bug at your will. > > Thanks > Ale Thanks Ale, consider playing also with 'sacctmgr show assoc', this can be useful and probably more appropriate since you really want to query for associations and not just users. Behavior set consistently between 'sacctmgr show accounts' and 'users' in commit da49b8d0d14f1e2def06f2c22a14acb22a733153, available in future 18.08. Closing bug now. Regards, Felip M