| Summary: | sacct -l output | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Don Lipari <lipari1> |
| Component: | Bluegene select plugin | Assignee: | Danny Auble <da> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 2.4.x | ||
| Hardware: | IBM BlueGene | ||
| OS: | Linux | ||
| Site: | LLNL | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | Patch to make sacct not print errors on systems like BGQ on sub node jobs | ||
For the record, I just noticed that the max_pages_node, max_rss_node, max_vsize_node, and min_cpu_node fields in the database for a job step are all '0' (not surprising really). There is absolutely no node range to display! None of this makes any since unless you are actually using jobacct_gather. I would suggest alter sacct/process.c find_hostname() to just return if using a front_end system. What do you think? But this doesn't work on a multi cluster system since sacct doesn't get information about the cluster when it goes to get information about the job. (In reply to comment #3) > But this doesn't work on a multi cluster system since sacct doesn't get > information about the cluster when it goes to get information about the job. I suppose you could change find_hostname() to return NULL when the hosts is whatever it is when the field from the db is '0'. I don't understand what you are referring to as 0? I would expect 0 to be a valid number for at least some jobs for any field. What field are you referring to? (In reply to comment #5) > I don't understand what you are referring to as 0? I would expect 0 to be a > valid number for at least some jobs for any field. What field are you > referring to? max_pages_node, max_rss_node, max_vsize_node, and min_cpu_node I'm assuming on a Linux cluster with the jobacct_gather enabled, these fields are populated with node name. You assumption would be wrong ;). They contain a node index, so 0 is quite valid :). The only thing I can think of is alter sacct to look up the cluster before hand so it knows what to do in this case. (In reply to comment #8) > The only thing I can think of is alter sacct to look up the cluster before > hand so it knows what to do in this case. I suppose if there are no identifying characteristics in the contents of the four *_node fields when the node index is '0', then looking up the cluster sounds like a reasonable solution. Created attachment 138 [details]
Patch to make sacct not print errors on systems like BGQ on sub node jobs
Here is a patch for 2.4 that fixes this. You will have to update the slurmdbd, slurmctld and sacct to make it work correctly.
Since this changes behaviour I am going to put it in 2.5 instead of 2.4.
You can update the database for older jobs as well by setting the respected nodeid's to 4294967294 instead of 0 in the step_table of each cluster.
FYI, this has been fixed completely in 2.5. Now if a cluster isn't running a real jobacct_gather plugin all the statistics gathered by the plugin will be blank in the output of sacct. |
For those looking to invoke the sacct -l option on BG/Q machines, all of the following fields display an error on certain job steps: MaxVMSizeNode, MaxRSSNode, MaxPagesNode, and MinCPUNode. Here's what the error looks like: sacct: error: hostlist.c:1774 Invalid range: `10000x13331': Invalid argument 0 0 0 sacct: error: hostlist.c:1774 Invalid range: `10000x13331': Invalid argument I suspected this was due to the dimension of the sub-block nodelist range being > what slurmdb_setup_cluster_dims() returns. However, the following job step zero output displays fine: sacct -a -o JobID,JobName,Partition,nodelist%30 JobID JobName Partition NodeList ------------ ---------- ---------- ------------------------------ 22229 runit pdebug vulcan0020 22229.batch batch vulcan0020 22229.0 /nfs/tmp2+ vulcan0020[10000x13331] 22230 runit pdebug vulcan0020 22230.batch batch vulcan0020 22230.0 /nfs/tmp2+ vulcan0020[10000x13331] So I don't know what the optimal fix is: fix the display or eliminate the problem fields from the sacct -l output on BG/Q systems.