| Summary: | Different sets of fairshare parameters on a same slurmdbd / mysql | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Damien <damien.leong> |
| Component: | Scheduling | Assignee: | Albert Gil <albert.gil> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 20.02.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=12320 | ||
| Site: | Monash University | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | CentOS | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Damien
2019-09-27 00:36:00 MDT
Hi Damien, I'm not sure if I understood you issue correctly. > These additional machines are for VIP users only, but because of their high usage and fairshare in the existing cluster, their jobs to these specialised machines are being delayed because of fairshare scheduling. If only VIP users are allowed to use these specialized machines, the fairshare relation with non-VIP shouldn't be a problem. If the specialized machines are *only* in the "spec" partition, if these users submit to that partition, the scheduler should run the jobs as soon as there are resources in that partition available, regardless of the fairshare. Is not what you are seeing? The only problem that I can see is the inverse one, that due the high usage of the specialized machines, the VIP users have less priority when they want to use non-specialized machines. Is this your problem? To solve this I would recommend any of these: - Increase the Priority of the account to compensate it. - Have a separated account to use only the specialized machines. > New PartitionName=spec Default=no Nodes=sp0[1-8] State=Up MaxTime=1-0 account=pa003 AllowQOS=pa003 Are you sure about using AllowQOS? Why not AllowAccount? Am I answering your question? Albert Hi Albert Thanks for your email. Yes, it should be: -- PartitionName=spec Default=no Nodes=sp0[1-8] State=Up MaxTime=1-0 AllowAccount=pa003 AllowQOS=pa003 -- From "If only VIP users are allowed to use these specialized machines, the fairshare relation with non-VIP shouldn't be a problem. If the specialized machines are *only* in the "spec" partition, if these users submit to that partition, the scheduler should run the jobs as soon as there are resources in that partition available, regardless of the fairshare. " Your observation is correct, but our VIP users/accounts would still want to continue use the existing cluster as per normal (using their fairshare) and semi-exclusive on these specialised servers. Basically, they don't want their fairshare parameters to be affected on the exisiting cluster, and also vice versa when using the specialised servers. So I am hoping to split the fairshare values into two separate metrics to manage these. Especially, when we are still facing scheduling issues on our cluster: https://bugs.schedmd.com/show_bug.cgi?id=7686 We don't wish to increase the Priority of these VIP accounts to compensate it, because if that's the case, they will have an unfair advantage for submitting in the existing cluster, which could be mis-used or mis-managed. Having separated accounts with different parent accounts for the VIPs to use the specialized machines, might be the right approach, but it would mean creating whole same sets of account/users with another parent accounts. Your thoughts ? Cheers Damien Hi Damien, > Yes, it should be: > -- > PartitionName=spec Default=no Nodes=sp0[1-8] State=Up MaxTime=1-0 > AllowAccount=pa003 AllowQOS=pa003 > -- I'm not sure if you need both Allow, but as you also mentioned "semi-exclusive", then maybe you do. > Your observation is correct, but our VIP users/accounts would still want to > continue use the existing cluster as per normal (using their fairshare) and > semi-exclusive on these specialised servers. Basically, they don't want > their fairshare parameters to be affected on the exisiting cluster, and also > vice versa when using the specialised servers. Ok, now I understand better the problem. > So I am hoping to split the > fairshare values into two separate metrics to manage these. Yes, divide and conquer. > We don't wish to increase the Priority of these VIP accounts to compensate > it, because if that's the case, they will have an unfair advantage for > submitting in the existing cluster, which could be mis-used or mis-managed. I don't like it neither. > Having separated accounts with different parent accounts for the VIPs to use > the specialized machines, might be the right approach, but it would mean > creating whole same sets of account/users with another parent accounts. > Your thoughts ? Yes, it means creating new accounts/users, but it's the right way to do what you want. Actually, although we think in Users and Accounts, note that internally Slurm thinks more on Associations and Hierarchies. Internally Users and Accounts are almost only a pair of a name and an id (some more info, but not really important). But when we create a user or an account Slurm also creates an association in a hierarchical way, and that that association contains the really important values for the account or user: limits, qos, usage, parent... All the accounting, resource limits and fairshare is fully done "per association", and one of the main reason is to allow the split that you want: independent associations for the same user. Is the way that Slurm works. Some scripting may help to create/replicate the "new" accounts and users (the new associations)? Regards, Albert Hi Damien, I'm closing the ticket as infogiven for now, but please feel free to reopen it if you have further questions. Regards, Albert Good Afternoon Slurm Support How are you doing ? I would like to re-open this ticket to review this in another view. We are currently using Slurm V 20.02.7 Cheers Damien Hi Damien, > How are you doing ? Good! I hope you too! :-) > I would like to re-open this ticket to review this in another view. > We are currently using Slurm V 20.02.7 We are happy to help you reviewing this topic again. But as this was a very old ticket (almost 2 years and based on 18.08), I think that it's better if we just create a new one mentioning this old one as a reference there, but to review it there, in a fresh one. Do you mind to create a new bug with your new thoughts about this topic? Thanks! Albert This is the situation, Our production cluster has some additional specialised servers (for paid partners). These partners or accounts have very high usage levels in the existing cluster, therefore their current fairshare metric are fairly low. The additional machines are for paid users only, but because of their high usage and fairshare in the existing cluster, their jobs to these specialised machines are being delayed because of fairshare scheduling . It is because of the high usage of the specialized nodes, these paid users have less priority when they want to use non-specialized nodes (The common nodes inside the cluster). Our current TresBilling rate is 1 CPU = 1 Billing unit, For example: CfgTRES=cpu=24,mem=257669M,billing=24, ***Question We are thinking if we put all these specialised nodes in a 'special' partition for paid users only (Using AllowAccount=pa003 AllowQOS=pa003). Can we configure this partition to have TRESBillingWeights=0, or <1, so whoever users use this special partition, it will NOT affect or contribute to their respective accounts' fairshare metrics ? Is this a viable solution ? Or can you propose an optimised option ? Kindly advise Thanks Damien Yes, if open a new ticket suits your workflow, Please open a new ticket for this query. Thanks |