| Summary: | [825617] - Slurmd slow to CONFIRM ALPS reservation resulting in the Slurm Controller killing jobs | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Jason Coverston <jason.coverston> |
| Component: | Cray ALPS | Assignee: | Danny Auble <da> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | brian.gilmer, brian, da |
| Version: | 14.11.3 | ||
| Hardware: | Cray XC | ||
| OS: | Linux | ||
| Site: | KAUST | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 14.11.7 15.08.0-0pre5 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Jason Coverston
2015-04-27 05:31:28 MDT
Jason commits 26624602504 97aedaf890f 01da71b836f 5ab69ccb6de should address this issue. Or you can just to a git pull on the 14.11 branch. I think this along with the commits mentions in bug 1623 will fix all the KAUST problems outside of the speed of job launch due to serializing the reservation requests to ALPS. Unless needed this will most likely remain the way it is. Please report back on your findings. I will mention sched_min_interval=1000000 should be added to SchedulerParameters in your slurm.conf The time is in microseconds, so 1 second is the time listed here. You can put larger numbers as well if you would like, but this is what I tested with and it appeared to work fine. Hi Danny, The patches look very good! We tested with sched_min_interval=1000000 today on the system. Everything went well, until we tried 3 x 10,000 array element jobs. The 3 x 10,000 array element job locked up the controller daemon with this setting and we saw the "unbound" reservations appear in apstat. Jobs dropped to zero eventually. We killed off everything and bumped it to sched_min_interval=2000000. The controller daemon eventually locked up again. We had to end our test session after this unfortunately and could not test with a higher setting. If you have any input on what a good setting would be please advise. Thanks!! Jason Thanks for the feed back Jason, if this happens again could you please try to attach to the slurmctld and a slurmd with gdb and sent the output of thread apply all bt On April 29, 2015 8:20:32 PM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=1622 > >--- Comment #3 from Jason Coverston <jcovers@cray.com> --- >Hi Danny, > >The patches look very good! > >We tested with sched_min_interval=1000000 today on the system. > >Everything went well, until we tried 3 x 10,000 array element jobs. > >The 3 x 10,000 array element job locked up the controller daemon with >this >setting and we saw the "unbound" reservations appear in apstat. Jobs >dropped to >zero eventually. > >We killed off everything and bumped it to sched_min_interval=2000000. >The >controller daemon eventually locked up again. > >We had to end our test session after this unfortunately and could not >test with >a higher setting. If you have any input on what a good setting would be >please >advise. > >Thanks!! > >Jason > >-- >You are receiving this mail because: >You are on the CC list for the bug. >You are the assignee for the bug. Hi Danny, Will do. What is the impact of setting this sched_min_interval parameter to 2000000 or perhaps to a higher value such as 3000000 or 4000000? Would a higher value be exepcted to avoid the problems seen with the 3 x 10,000 array element job, and allow that test to complete? Also, could the job launch rate be affected the higher we go with the interval setting? All that means is there is more time between scheduler runs. It at worst will only slow scheduling a chunk of jobs by that time period. I would have expected the 1000000 to solve the problem. Since this happens on 100000 and 2000000 it really points to something else unknown at the moment. Can you test on Crystal and see if you can reproduce? I am hoping if we get a backtrace while it is happening we will be able to see the bottleneck. Since scheduling jobs is linked to job launch rate it would affect it, but not dramatically unless the number was really high (in the 10s of seconds). Jason, would you be able to get some logs from the slurctld and slurmd while this issue was happening? Jason could you also attach the slurm.conf on the system? Looks like the current state is acceptable. Please reopen if needed. |