For those looking to invoke the sacct -l option on BG/Q machines, all of the following fields display an error on certain job steps: MaxVMSizeNode, MaxRSSNode, MaxPagesNode, and MinCPUNode. Here's what the error looks like: sacct: error: hostlist.c:1774 Invalid range: `10000x13331': Invalid argument 0 0 0 sacct: error: hostlist.c:1774 Invalid range: `10000x13331': Invalid argument I suspected this was due to the dimension of the sub-block nodelist range being > what slurmdb_setup_cluster_dims() returns. However, the following job step zero output displays fine: sacct -a -o JobID,JobName,Partition,nodelist%30 JobID JobName Partition NodeList ------------ ---------- ---------- ------------------------------ 22229 runit pdebug vulcan0020 22229.batch batch vulcan0020 22229.0 /nfs/tmp2+ vulcan0020[10000x13331] 22230 runit pdebug vulcan0020 22230.batch batch vulcan0020 22230.0 /nfs/tmp2+ vulcan0020[10000x13331] So I don't know what the optimal fix is: fix the display or eliminate the problem fields from the sacct -l output on BG/Q systems.
For the record, I just noticed that the max_pages_node, max_rss_node, max_vsize_node, and min_cpu_node fields in the database for a job step are all '0' (not surprising really). There is absolutely no node range to display!
None of this makes any since unless you are actually using jobacct_gather. I would suggest alter sacct/process.c find_hostname() to just return if using a front_end system. What do you think?
But this doesn't work on a multi cluster system since sacct doesn't get information about the cluster when it goes to get information about the job.
(In reply to comment #3) > But this doesn't work on a multi cluster system since sacct doesn't get > information about the cluster when it goes to get information about the job. I suppose you could change find_hostname() to return NULL when the hosts is whatever it is when the field from the db is '0'.
I don't understand what you are referring to as 0? I would expect 0 to be a valid number for at least some jobs for any field. What field are you referring to?
(In reply to comment #5) > I don't understand what you are referring to as 0? I would expect 0 to be a > valid number for at least some jobs for any field. What field are you > referring to? max_pages_node, max_rss_node, max_vsize_node, and min_cpu_node I'm assuming on a Linux cluster with the jobacct_gather enabled, these fields are populated with node name.
You assumption would be wrong ;). They contain a node index, so 0 is quite valid :).
The only thing I can think of is alter sacct to look up the cluster before hand so it knows what to do in this case.
(In reply to comment #8) > The only thing I can think of is alter sacct to look up the cluster before > hand so it knows what to do in this case. I suppose if there are no identifying characteristics in the contents of the four *_node fields when the node index is '0', then looking up the cluster sounds like a reasonable solution.
Created attachment 138 [details] Patch to make sacct not print errors on systems like BGQ on sub node jobs Here is a patch for 2.4 that fixes this. You will have to update the slurmdbd, slurmctld and sacct to make it work correctly. Since this changes behaviour I am going to put it in 2.5 instead of 2.4. You can update the database for older jobs as well by setting the respected nodeid's to 4294967294 instead of 0 in the step_table of each cluster.
FYI, this has been fixed completely in 2.5. Now if a cluster isn't running a real jobacct_gather plugin all the statistics gathered by the plugin will be blank in the output of sacct.