Ticket 10985

Summary:	Can slurm accounts be used without slurm users
Product:	Slurm	Reporter:	Michael Schoenfelder <michael.schoenfelder>
Component:	Accounting	Assignee:	Ben Roberts <ben>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	marshall.adrian, vito.burggraf
Version:	20.02.4
Hardware:	Linux
OS:	Linux
Site:	SiFive	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Michael Schoenfelder 2021-03-01 17:51:03 MST

We've been running slurm successfully without creating slurm users.  We don't "spend" money, so we don't need accounts to prevent users from running jobs.  However, we would like to track slurm usage by team and general job type.  Almost all of our slurm jobs are submitted by flows (scripts) that are launched by users or by Jenkins on behalf of users or teams.  The flows typically belong to the teams, so this is a convenient way of determining team and job type.  Our idea is to use `srun --account some_account` in our flows.  However, when a user submits a job, we see messages of the form:

_job_create: account 'some_account' has no association for user 0 using default account 'root'

We understand what the message means because we know that the submitting unix user does not have a slurm account.

The problem is that we saw somewhat of a correlation between these messages and loss of data in the sacct DB.  We knew that certain jobs ran because we have squeue evidence, but there was no record in the sacct DB (slurm version 16).  Once we stopped using accounts, we saw no more missing data in sacct DB.

The question: is using accounts without user association supported?  Are you aware of any problems with this approach?

We don't want to wckey for this purpose since we use wckey in another way.

We have since upgraded to slurm 20.2.  I am not in particular trying to report a bug, but am seeking guidance on using accounts without slurm user associations.

Comment 1 Ben Roberts 2021-03-02 11:13:21 MST

Hi Michael,

If you're trying to get accurate reporting on usage in different accounts then you would need to create associations for each user who is going to be submitting to a given account. I would also recommend using AccountingStorageEnforce to make sure users only submit to accounts they have access to so that usage doesn't get lost.

As you've experienced, you can have users and accounts created without limiting the users to the accounts then have access to. However, if the user has an association in an account and they request an account they don't have an association in then the job will be recorded in the database as going to their default account. The log entry will look like this when that happens:
[2021-03-02T11:35:52.156] _job_create: account 'sub4' has no association for user 1002 using default account 'sub1'

If the user doesn't have an association in any account then it goes to the root account and doesn't get recorded in the database, which is the behavior you observed.

If you want to report on the account usage accurately then you do need to have user associations created for the accounts they are going to be using, though you don't necessarily have to use AccountingStorageEnforce to prevent them from requesting something else. If you primarily have scripts submitting workflows in certain ways then you should be pretty safe to just configure the associations to accommodate the scripted workflows. If you have a lot of users submitting jobs manually then the chances of someone requesting an account they don't have access to goes up and you might want to consider limiting users to accounts they have access to.

Let me know if you have any questions about this.

Thanks,
Ben

Comment 2 Michael Schoenfelder 2021-03-03 16:16:38 MST

Re "be pretty safe to just configure the associations to accommodate the scripted workflows. "
But the scripted workflows run as the individual user.  The idea is that the scripted workflow calls srun with "-A some_account" rather than depend on an association to match the unix user to a slurm user to an account.

We essentially will have two accounts per team, so the flow will decide which account is most appropriate.  Associations won't help us there.

I understand that not having AccountingStorageEnforce nor associations allows for some users to "sneak" in.  That will just be another reporting category so we can understand what users/flows we are missing the account specification.

We could set up user associations, but are trying to avoid having yet another account tracking system and dealing with hiring and people leaving and switching orgs. Yes, I have seen the user contributions for syncing unix groups and slurm associations.  We can't use it as-is, but it would be a starting point.

My concern about going associationless is the loss of data issue.  I SWAGged that not having associations was causing slurm* to miss the association cache each time, and that was keeping the slurm* busy refreshing from the DB which caused loss of data.  If you guys think that using accounts w/o associations is technically bad, we'll bite the bullet.  We understand that it is philosophically unusual.

Comment 3 Ben Roberts 2021-03-04 09:55:43 MST

Hi Michael,

To have the account information accurately/reliably recorded you would need to have user associations created for each user in both accounts.  I'm trying to think of alternatives and the best alternative I can think of would be to use WCKeys, which you stated you're already using for something else.  

Another possibility that might work for you is to use the JobCompType plugin to record information about the job in another way when it finishes.  You can configure it to record the data to a flat file or a database (among other options).  This shifts things around so that even though the jobs may not be reported by sacct there will be a record stored with the account information in another location.  This would require you to use your own tools to process the data, but it may be a better option for you than managing users/accounts.

Here's a quick example where I submitted a job as a user who didn't have an association currently on the system.  

$ sbatch -N1 -Asub4 --wrap='srun sleep 30'
Submitted batch job 25787


When the job completed there isn't a record of it in the normal job database.

$ sacct -j 25787
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 



But I configured my system to write completed job data to a flat file, so I can see the information about it there.

$ grep 25787 jobcomp_data.txt 
JobId=25787 UserId=user5(1005) GroupId=user5(1005) Name=wrap JobState=COMPLETED Partition=debug TimeLimit=3600 StartTime=2021-03-04T10:44:51 EndTime=2021-03-04T10:45:22 NodeList=node01 NodeCnt=1 ProcCnt=1 WorkDir=/home/user5 ReservationName= Tres=cpu=1,node=1,billing=1 Account=sub4 QOS=normal WcKey= Cluster=unknown SubmitTime=2021-03-04T10:44:51 EligibleTime=2021-03-04T10:44:51 DerivedExitCode=0:0 ExitCode=0:0 



These are the options I used to configure this.

$ scontrol show config | grep JobComp
JobCompHost             = localhost
JobCompLoc              = /home/ben/slurm/src/jobcomp_data.txt
JobCompPort             = 0
JobCompType             = jobcomp/filetxt
JobCompUser             = root


You can read more about this here:
https://slurm.schedmd.com/slurm.conf.html#OPT_JobCompType


Let me know if this sounds like a workable option.

Thanks,
Ben

Comment 4 Michael Schoenfelder 2021-03-04 12:41:56 MST

Ben,

The JobCompType plugin is an interesting idea.  We've been considering an ELK stack for monitoring various signals in our infrastructure, so that would fit right in.

I realize that I left out an important point.  One reason we want to track account usage is so that we can assign fairshare weights whenever we implement our fairshare scheme.  We want to balance the fairshare based on accounts.  However, I can see that we may confuse the fairshare algorithms if we don't have slurm users.  Plus, missing job records would not be good for fairshare.

When I tried your "$ sbatch -N1 -Asub4 --wrap='srun sleep 30'" example, I did get a job record.  It does seem like this (ab)use of accounts is unreliable.

Thank you for your feedback.  You may close the ticket.

Comment 5 Ben Roberts 2021-03-04 13:09:49 MST

You're right, if you're using Fairshare then the unreliability of the jobs being recorded properly will have an effect.  I'll go ahead and close it, but let us know if there's anything else we can do to help.

Thanks,
Ben