Ticket 3010 - possible to store requested/used node features in slurm accounting database?
Summary: possible to store requested/used node features in slurm accounting database?
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: KNL (show other tickets)
Version: 16.05.4
Hardware: Cray XC Linux
: 5 - Enhancement
Assignee: Unassigned Developer
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2016-08-18 16:53 MDT by Doug Jacobsen
Modified: 2024-09-25 14:53 MDT (History)
4 users (show)

See Also:
Site: NERSC
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Doug Jacobsen 2016-08-18 16:53:16 MDT
Hello,

We're going to need to be able to determine which portions of our workload uses which settings.  Would it be possible to store nodefeature information in the database and expose it via sacct?  Or is it possible and I'm missing it?

Thanks,
Doug

ctlnet1:~ # sacct -j 690 --format=all -X -p
AllocCPUS|AllocGRES|AllocNodes|AllocTRES|Account|AssocID|AveCPU|AveCPUFreq|AveDiskRead|AveDiskWrite|AvePages|AveRSS|AveVMSize|BlockID|Cluster|Comment|ConsumedEnergy|ConsumedEnergyRaw|CPUTime|CPUTimeRAW|DerivedExitCode|Elapsed|Eligible|End|ExitCode|GID|Group|JobID|JobIDRaw|JobName|Layout|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|MaxPages|MaxPagesNode|MaxPagesTask|MaxRSS|MaxRSSNode|MaxRSSTask|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|MinCPU|MinCPUNode|MinCPUTask|NCPUS|NNodes|NodeList|NTasks|Priority|Partition|QOS|QOSRAW|ReqCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqCPUS|ReqGRES|ReqMem|ReqNodes|ReqTRES|Reservation|ReservationId|Reserved|ResvCPU|ResvCPURAW|Start|State|Submit|Suspended|SystemCPU|Timelimit|TotalCPU|UID|User|UserCPU|WCKey|WCKeyID|
272|craynetwork:4,7168616:0|1|cpu=272,mem=90G,node=1|mpccc|2115|||||||||gerty||||2-17:25:52|235552|0:0|00:14:26|2016-08-18T15:36:37|Unknown|0:0|46371|cookbg|690|690|test.sl||||||||||||||||||||272|1|nid00047||1|knl|normal|1|Unknown|Unknown|Unknown|Unknown|1|craynetwork:1|90Gn|1|cpu=1,node=1|||00:00:02|00:00:02|2|2016-08-18T15:36:39|RUNNING|2016-08-18T15:36:37|00:00:00||00:30:00|00:00:00|46371|cookbg|||0|
ctlnet1:~ # scontrol show job 690
JobId=690 JobName=test.sl
   UserId=cookbg(46371) GroupId=cookbg(46371) MCS_label=N/A
   Priority=1 Nice=0 Account=mpccc QOS=normal
   JobState=CONFIGURING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:14:41 TimeLimit=00:30:00 TimeMin=N/A
   SubmitTime=2016-08-18T15:36:37 EligibleTime=2016-08-18T15:36:37
   StartTime=2016-08-18T15:36:39 EndTime=2016-08-18T16:06:39 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=knl AllocNode:Sid=gert01:42928
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=nid00047
   BatchHost=nid00047
   NumNodes=1 NumCPUs=272 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=272,mem=90G,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=90G MinTmpDiskNode=0
   Features=quad&cache Gres=craynetwork:1 Reservation=(null)
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/global/u1/c/cookbg/upc/test.sl
   WorkDir=/global/u1/c/cookbg/upc
   StdErr=/global/u1/c/cookbg/upc/slurm-690.out
   StdIn=/dev/null
   StdOut=/global/u1/c/cookbg/upc/slurm-690.out
   Power=


ctlnet1:~ #
Comment 1 Doug Jacobsen 2016-08-18 16:54:38 MDT
Sorry, this is related to KNL feature requests (like quad,cache,hemi,etc).

Also, it would be very valuable to know the time spent in the configuring state, and record that in a way we can get at it in the database.

Thanks,
Doug
Comment 2 Tim Wickberg 2016-08-19 08:37:58 MDT
Node feature requests aren't stored currently; it's something that will need to wait until 17.02 for the RPC formats to change to support.

I've seen some other sites overload the 'comment' field by packing additional info in through a job submit plugin; that might work as a stop-gap in the meantime.

Time spent in CF isn't captured either (aside from messages in slurmctld.log), although I can see how that may be more important with KNL taking a while to switch modes.

If it's alright with you, I'll reclassify this as an enhancement targeting the 17.02 release.

- Tim
Comment 4 Doug Jacobsen 2016-08-19 09:19:04 MDT
OK, I'll look at populating the comment field for now, that's a good idea.

Thanks,
Doug

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Fri, Aug 19, 2016 at 7:37 AM, <bugs@schedmd.com> wrote:

> Tim Wickberg <tim@schedmd.com> changed bug 3010
> <https://bugs.schedmd.com/show_bug.cgi?id=3010>
> What Removed Added
> Assignee support@schedmd.com tim@schedmd.com
> CC   jette@schedmd.com, tim@schedmd.com
> Component Database KNL
>
> *Comment # 2 <https://bugs.schedmd.com/show_bug.cgi?id=3010#c2> on bug
> 3010 <https://bugs.schedmd.com/show_bug.cgi?id=3010> from Tim Wickberg
> <tim@schedmd.com> *
>
> Node feature requests aren't stored currently; it's something that will need to
> wait until 17.02 for the RPC formats to change to support.
>
> I've seen some other sites overload the 'comment' field by packing additional
> info in through a job submit plugin; that might work as a stop-gap in the
> meantime.
>
> Time spent in CF isn't captured either (aside from messages in slurmctld.log),
> although I can see how that may be more important with KNL taking a while to
> switch modes.
>
> If it's alright with you, I'll reclassify this as an enhancement targeting the
> 17.02 release.
>
> - Tim
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 7 Chris Samuel (NERSC) 2024-09-25 14:53:42 MDT
Cori is gone now.