Ticket 287 - SLURM real_decay problem
Summary: SLURM real_decay problem
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 2.5.x
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Moe Jette
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2013-05-14 03:50 MDT by Jason Coverston
Modified: 2013-05-20 08:24 MDT (History)
1 user (show)

See Also:
Site: CRAY
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
proposed patch (2.89 KB, patch)
2013-05-14 06:42 MDT, Moe Jette
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Jason Coverston 2013-05-14 03:50:26 MDT
The customer is having problems with SLURM v2.5.4 when using a decay halftime of 1 minute (60 sec).

In  the file plugins/priority/multifactor/priority_multifactor.c, at line 1029, the real_decay variable is evaluated using pow function:

real_decay = pow(decay_factor, (double)run_delta);

When the decay_factor is evaluated at line 980:
decay_factor = 1 - (0.693 / decay_hl);

For a decay halftime of 1 minute (decay_hl=60), 
decay_factor = (double)1 - (double)(0.693/(double)60)= 0.98845

For large value of run_delta ( .i.e. run_delta > 61000), the pow function returns a numerical result  out of range.

Then we get the error: problem applying decay in the slurmctld.log file.
This means that NO decay will be applied EVER.
The decay is VERY important in ensuring that proper Fairshare is applied to each job.
The customer proposes the following bug fix:
/ * In order to prevent this, we can find the largest run_delta possible. In other words:
  * if run_delta * log(decay_factor) <= DBL_MIN_10_EXP * log(10), we have a problem because run_delta is too large
  * and causes an underflow of the double. However, since decay_factor is always in the range [0,1], the log of it will always be negative, so we need to check if 
  * run_delta is larger than (DBL_MIN_10_EXP/log(decay_factor)).
  * Reminder, DBL_MIN_10_EXP (-337 on some systems) from float.h provides the exponent for the smallest possible value
  * of a base 10 exponent to produce a valid double. */

Replace line 1029
                                real_decay = pow(decay_factor, (double)run_delta);

by tle following lines of code:

                if(run_delta >= (DBL_MIN_10_EXP/log(decay_factor)))
                {
                        real_decay = pow(10, DBL_MIN_10_EXP); // This sets real_decay to the smallest representable double.     
                }
                else
                {
                        real_decay = pow(decay_factor, (double)run_delta);
                }
Comment 1 Moe Jette 2013-05-14 06:42:44 MDT
Created attachment 249 [details]
proposed patch

The proposed patch will not work on all systems because DBL_MIN_10_EXP is not defined on all systems. The variable DBL_MIN seems to be more universal. This patch checks for the appropriate header files and includes one that should define DBL_MIN and then uses that as a minimum value. This change will be in Slurm v2.5.7 when we release that.
Comment 3 Moe Jette 2013-05-14 06:44:38 MDT
I should note that the patch described in the original bug report should work fine on systems where DBL_MIN_10_EXP is defined. My variant is just usable on more system types.
Comment 4 Moe Jette 2013-05-20 08:24:21 MDT
I do believe this is fixed in the current release of Slurm or in the patches. Please reopen this ticket or open a new ticket if not resolved.