| Summary: | SLURM real_decay problem | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Jason Coverston <jason.coverston> |
| Component: | Other | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | da |
| Version: | 2.5.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | CRAY | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | proposed patch | ||
Created attachment 249 [details]
proposed patch
The proposed patch will not work on all systems because DBL_MIN_10_EXP is not defined on all systems. The variable DBL_MIN seems to be more universal. This patch checks for the appropriate header files and includes one that should define DBL_MIN and then uses that as a minimum value. This change will be in Slurm v2.5.7 when we release that.
I should note that the patch described in the original bug report should work fine on systems where DBL_MIN_10_EXP is defined. My variant is just usable on more system types. I do believe this is fixed in the current release of Slurm or in the patches. Please reopen this ticket or open a new ticket if not resolved. |
The customer is having problems with SLURM v2.5.4 when using a decay halftime of 1 minute (60 sec). In the file plugins/priority/multifactor/priority_multifactor.c, at line 1029, the real_decay variable is evaluated using pow function: real_decay = pow(decay_factor, (double)run_delta); When the decay_factor is evaluated at line 980: decay_factor = 1 - (0.693 / decay_hl); For a decay halftime of 1 minute (decay_hl=60), decay_factor = (double)1 - (double)(0.693/(double)60)= 0.98845 For large value of run_delta ( .i.e. run_delta > 61000), the pow function returns a numerical result out of range. Then we get the error: problem applying decay in the slurmctld.log file. This means that NO decay will be applied EVER. The decay is VERY important in ensuring that proper Fairshare is applied to each job. The customer proposes the following bug fix: / * In order to prevent this, we can find the largest run_delta possible. In other words: * if run_delta * log(decay_factor) <= DBL_MIN_10_EXP * log(10), we have a problem because run_delta is too large * and causes an underflow of the double. However, since decay_factor is always in the range [0,1], the log of it will always be negative, so we need to check if * run_delta is larger than (DBL_MIN_10_EXP/log(decay_factor)). * Reminder, DBL_MIN_10_EXP (-337 on some systems) from float.h provides the exponent for the smallest possible value * of a base 10 exponent to produce a valid double. */ Replace line 1029 real_decay = pow(decay_factor, (double)run_delta); by tle following lines of code: if(run_delta >= (DBL_MIN_10_EXP/log(decay_factor))) { real_decay = pow(10, DBL_MIN_10_EXP); // This sets real_decay to the smallest representable double. } else { real_decay = pow(decay_factor, (double)run_delta); }