Summary: | Jobs Not Being Preempted in Lower Priority Partition | ||
---|---|---|---|
Product: | Slurm | Reporter: | Paul Edmon <pedmon> |
Component: | Scheduling | Assignee: | Will Shanks <will> |
Status: | OPEN --- | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | ||
Version: | 24.11.2 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Harvard University | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: |
Current slurm.conf
Current topology.conf slurmd log for holygpu8a18104 for 03-24-25 9-10am |
Description
Paul Edmon
2025-03-24 07:49:36 MDT
Created attachment 41238 [details]
Current slurm.conf
Created attachment 41239 [details]
Current topology.conf
Hello Paul, I'm still digging through all of the info you have already given, but would it be possible to get the slurmd log for the "busy" node the seas_gpu job won't run on too? Looks like it is holygpu8a18104. -- Will Created attachment 41240 [details]
slurmd log for holygpu8a18104 for 03-24-25 9-10am
Yup, I've attached it for the hour around that log snippet. Happy to pull more. -Paul Edmon- On 3/24/2025 1:01 PM, bugs@schedmd.com wrote: > > *Comment # 3 <https://support.schedmd.com/show_bug.cgi?id=22421#c3> on > ticket 22421 <https://support.schedmd.com/show_bug.cgi?id=22421> from > Will Shanks <mailto:will@schedmd.com> * > Hello Paul, > > I'm still digging through all of the info you have already given, but would it > be possible to get the slurmd log for the "busy" node the seas_gpu job won't > run on too? Looks like it is holygpu8a18104. > > -- Will > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. > Thank you for the logs. On a first pass it looks like you have a lot of errors related to AcctGatherInterconnectType[1] in your slurm.conf, but I expect that is probably unrelated to the current issue. >cat holygpu8a18104-03-24-25.log | grep slurm | grep error | cut -d' ' -f7- | sed 's/JOB [[:digit:]]\+ /JOB JOBID /' | sed 's/09:[[:digit:]][[:digit:]]:[[:digit:]][[:digit:]]/TIME/' |sort | uniq -c | sort -n | tail -n5 > 1 stepd_cleanup: done with step (rc[0x100]:Unknown error 256, cleanup_rc[0x0]:No error) > 50 stepd_cleanup: done with step (rc[0xf]:Block device required, cleanup_rc[0x0]:No error) > 111 debug levels are stderr='error', logfile='info', syslog='verbose' > 111 error: *** JOB JOBID ON holygpu8a18104 CANCELLED AT 2025-03-24TTIME DUE TO PREEMPTION *** > 222 error: TRES ic/sysfs not configured I will let you know when I have a better idea of what is going on. -- Will [1]: https://slurm.schedmd.com/slurm.conf.html#OPT_acct_gather_interconnect/sysfs Thanks for the tip about sysfs. I totally missed that setting. I will get that fixed. -Paul Edmon- On 3/24/2025 2:25 PM, bugs@schedmd.com wrote: > > *Comment # 6 <https://support.schedmd.com/show_bug.cgi?id=22421#c6> on > ticket 22421 <https://support.schedmd.com/show_bug.cgi?id=22421> from > Will Shanks <mailto:will@schedmd.com> * > Thank you for the logs. On a first pass it looks like you have a lot of errors > related to AcctGatherInterconnectType[1] in your slurm.conf, but I expect that > is probably unrelated to the current issue. > > >cat holygpu8a18104-03-24-25.log | grep slurm | grep error | cut -d' ' -f7- | sed 's/JOB [[:digit:]]\+ /JOB JOBID /' | sed 's/09:[[:digit:]][[:digit:]]:[[:digit:]][[:digit:]]/TIME/' |sort | uniq -c | sort -n | tail -n5 > 1 stepd_cleanup: done with step (rc[0x100]:Unknown error 256, > cleanup_rc[0x0]:No error) > 50 stepd_cleanup: done with step > (rc[0xf]:Block device required, cleanup_rc[0x0]:No error) > 111 debug > levels are stderr='error', logfile='info', syslog='verbose' > 111 > error: *** JOB JOBID ON holygpu8a18104 CANCELLED AT 2025-03-24TTIME > DUE TO PREEMPTION *** > 222 error: TRES ic/sysfs not configured > > I will let you know when I have a better idea of what is going on. > > -- Will > > > [1]: > https://slurm.schedmd.com/slurm.conf.html#OPT_acct_gather_interconnect/sysfs > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. > Hello, I'm having trouble reproducing this, would it be possible to set DebugFlags=Backfill[1] and SlurmctldDebug=debug3[2] for at least the duration of a few of these cycles and share the slurmctld.log? I'm hoping this will help me narrow down the search space for reproducing this locally. -- Will [1]:https://slurm.schedmd.com/slurm.conf.html#OPT_Backfill [2]:https://slurm.schedmd.com/slurm.conf.html#OPT_debug3 Sure, I can set it next time this happens. The issue is definitely sensitive to cluster state as its not happening right now. However when I see a job again in this state I will hike the debugging and then dump the slurmctld log for you. -Paul Edmon- On 4/1/25 1:55 PM, bugs@schedmd.com wrote: > > *Comment # 9 <https://support.schedmd.com/show_bug.cgi?id=22421#c9> on > ticket 22421 <https://support.schedmd.com/show_bug.cgi?id=22421> from > Will Shanks <mailto:will@schedmd.com> * > Hello, > > I'm having trouble reproducing this, would it be possible to set > DebugFlags=Backfill[1] and SlurmctldDebug=debug3[2] for at least the duration > of a few of these cycles and share the slurmctld.log? I'm hoping this will help > me narrow down the search space for reproducing this locally. > > -- Will > > [1]:https://slurm.schedmd.com/slurm.conf.html#OPT_Backfill > [2]:https://slurm.schedmd.com/slurm.conf.html#OPT_debug3 > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. > |