Summary: | Change in 25.05: If new QoSs are created with Flags=-1, then we are unable to submit jobs that use the QoS | ||
---|---|---|---|
Product: | Slurm | Reporter: | Omen Wild <omen> |
Component: | slurmdbd | Assignee: | Benjamin Witham <benjamin.witham> |
Status: | OPEN --- | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | benjamin.witham |
Version: | 25.05.0 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | UC Davis | Slinky Site: | --- |
Alineos Sites: | --- | Atos/Eviden Sites: | --- |
Confidential Site: | --- | Coreweave sites: | --- |
Cray Sites: | --- | DS9 clusters: | --- |
Google sites: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- |
Description
Omen Wild
2025-06-30 17:49:08 MDT
Hello Omen, I'm able to replicate this, and I'm working towards a solution. It looks like a code change led to this unintended regression. I'll send updates when I know more. Hello Omen,
After investigating further, I have to retract this statement.
> It looks like a code change led to this unintended regression.
There is a problem with the handling of the qos flags, and the code change that I referenced above did not cause the issue.
For some context behind the issue, the Flags=-1 is used to remove all flags from an existing QOS. This is done by setting all flags, then also setting a remove flag. This remove flag signals to the slurmctld and slurmdbd to remove those flags from the existing QOS. When the Flags=-1 is used when creating a new QOS, all flags and the remove flag are set. The slurmdbd recognizes the remove flag and adjusts appropriately, but the slurmctld has no such logic, creating a QOS that has ALL flags instead of none of the flags! If the slurmctld is restarted, the correct QOS is pulled from the slurmdbd.
When I was reproducing your issue, I was unintentionally restarting the slurmctld, as I can no longer reproduce without a slurmctld restart; jobs will be denied with an INVALID_QOS error until the restart.
Do you often restart your slurmctld? If not, are you able to upload your slurm.conf as well as a typical job command from a user? I'm confident that a patch I have will fix the issue, but I'd like to check to confirm we're hitting your issue and not a similar but unrelated one.
Hi Benjamin, An example srun that was failing is: > srun --time=1:00:00 --account=bnbaileygrp --partition=gpu-6000_ada-h --gres=gpu:1 --cpus-per-task=8 --mem=64G --pty bash This is the same cluster as another open ticket. The slurm.conf is here: https://support.schedmd.com/attachment.cgi?id=42344 We typically only restart slurmctld when we add new nodes, which could a couple of times a month, but sometimes 3-6 months. |