Created attachment 12982 [details] Patch which improves acct_gather_profile/influxdb data format While trying out acct_gather_profile/influxdb I was disappointed to see that its output puts each field on a line of its own, all with the same field name of "value". That isn't the usual InfluxDB way, which would be to put all the fields of one sample (8 fields in the case of the "task" profile) on the same line, since they all share a timestamp and come from the same source. This current format is inefficient and causes problems downstream. Lech Nieroda described the impact on InfluxDB on slurm-users (https://www.mail-archive.com/slurm-users@lists.schedmd.com/msg04487.html) and independently suggested the same change as I am describing here. I plan to use https://docs.timescale.com/latest/tutorials/telegraf-output-plugin instead of InfluxDB, , but the current format would cause similar problems there, creating a separate table for each field rather than a single multi-field table for each profile type. So, the attached patch produces normalised (in the database schema sense) output. And also changes a couple of things which flow on from that: - Parts of the output which stay constant over time are moved out of acct_gather_profile_p_add_sample_data and are instead stored in table->name, renamed to table->prefix. - Translate the negative step numbers into "batch" and "extern" like sacct does. - Since the best choice of InfluxDB "measurement" name was not obvious to me I made that customisable, with "%P" expanding to the profile name ("Task", "Energy" etc.) and "%p" to the lowercase equivalent. Defaulting to "%p_profile". That bit might be overkill.
Thanks for submitting this patch - I'm about to try and run this up on some systems at Pawsey (I'd seen Lech's comments on the mailing list and was about to try and turn it into a patch. Just one wishlist item - Do you know if $SLURM_CLUSTER_NAME is available to the plugin? reason being it'd be really nice to have an additional tag cluster=$SLURM_CLUSTER_NAME to be able to use as a variable Many thanks Andrew
Created attachment 13139 [details] Improved version of the patch, includes cluster tag. This second version of my patch leaves out some redundant code accidentally included in the first version, adds a cluster tag as suggested by Andrew Elwell, and stops the current behaviour where acct_gather_profile_p_task_end needlessly contacts influxdb to send it no data when profile=None.
(I suspect this should be spun off as a separate patch/bug rather than confusing this unit of work) One minor niggle with this plugin is that defining ProfileInfluxDBRTPolicy is compulsory > fatal("No ProfileInfluxDBRTPolicy in your acct_gather.conf file. This is required to use the %s plugin", I feel it should be optional, as if it's not definined in the API, it falls back to default see https://docs.influxdata.com/influxdb/v1.7/guides/writing_data/ > When writing points, you must specify an existing database in the db query parameter. > Points will be written to db’s default retention policy if you do not supply a retention policy via the rp query parameter. > See the InfluxDB API Reference documentation for a complete list of the available query parameters.
Gentle poke to Tim to see if this can be included in the 20.11 release (given that it's a breaking change to include in a point-relese mid 20.02 cycle) It'd be nice not to have to merge changes in locally at build time :-)
Another gentle poke, given that my attempts to re-write the patch for 20.11 "didn't quite work" (tm). (See https://bugs.schedmd.com/show_bug.cgi?id=10344#c2) In my defence, I merely assumed that it would be the changes SchedMD made to the g_job struct that would need need to be rolled forwards: clearly not!
Created attachment 16967 [details] Attempted rewrite of patch for 20.11.0 - does not work Attaching here after Jason Booth suggested that we do so. Initially applying the orignal patch saw the compilation fail, because of changes to the g_job struct. This patch merely alters the g_job struct members so that compilation does complete. When deployed within 20.11.0, we saw failures of jobs to launch, Removing the patch saw jobs launch. Suggestion is that more changes to the patch rae required for 20.11.0. HTH
Created attachment 37049 [details] Patch updated to 23.11.7 Patch updated to 23.11.7. Also makes ProfileInfluxDBRTPolicy optional. Also makes the default for ProfileInfluxDBDefault be None rather than All, so matching docs.
Created attachment 37050 [details] Patch further updated to 24.05.0
Hi, Thanks for the contribution proposal and sorry for the late reply on this. I agree it'd be better that fields of the same type of profiling be written all together to the same time series measurement point, instead of the current design approach, which looking back wasn't the best thought. If this change is done though, sites with written data older to the change willing to query data cross pre/post the change would have their queries and backward-compatibility disrupted. It is feasible to adapt queries to do cross-data matches with something like this: > select * from RSS,task_profile where (value=192 OR RSS=192); Because before the patch fields were written as measurements and the only field key was "value", and after the patch a single measurement of the same type would have as many fields as original measurements. Even if this change would be announced in our CHANGELOG, it is quiet disruptive and I'd need to discuss internally if we're willing to do that, specially due to other sites already using the plugin before this change. I like the idea of naming the measurements like: "task", "network", "energy", etc. rather than "task_profile", "network_profile", "energy_profile". If we do this, perhaps there's no need to rename the table_t struct "name" field to "prefix" either. Also can you elaborate on the value of the new ProfileInfluxDBMeasurementName option to acct_gather.conf? On my testing with your contribution, if set, then all fields from all types of profiling would end up inside that measurement, instead of being split into the different "task", "network", "energy", etc. measurements. I have the feeling that we either do the split always or put all fields of all kinds into a single measurement, but I think the split would be better, and would avoid potential field key naming conflicts. What would be the value of having the possibility of both? If configuration changes we would end up with data again landing in different layouts. I like the lowercase for the naming convention, as seems aligned with InfluxDB best practices. The change to make ProfileInfluxDBRTPolicy optional instead of required with a fatal() if not present looks reasonable, due to the write operation falling back to the databse default. I'm good with this bit. The change to make ProfileInfluxDBDefault=NONE be the default also looks reasonable, I guess it was motivated to be aligned with the acct_gather.conf documentation, where it says it is the default, but the code would default to ALL. There are also various changes to tag key sets, and tag values: 1. You made it so that "Task" kind of profile measurements will include a "task" tag key. But the other kind of profiles "network", "energy", etc. this "task" key is removed (all callers of acct_gather_profile_g_create_dataset() with NO_PARENT argument, which pass the name of the kind of profiling). The "Task" kind of profiling passes the taskid as name argument. This looks reasonable, but again affects potential sites relying on existing queries and workflows. 2. You changed the "step" tag value to use "batch" for batch steps, "extern" for extern steps, etc. Also looks reasonable, but again affects potential sites relying on existing queries and workflows. Also this is missing "interactive" steps and the case of SLURM_PENDING_STEP. We have functions like log_build_step_id_str() that automatically handle the cases formatting. In this case probably with the STEP_ID_FLAG_NO_PREFIX flag. 3. The cluster tag has been added. In ticket 23254 they also requested to add the step username as a tag for group by purposes... While again this looks reasonable, this changes the tag key set for existing data and affects existing queries and workflows. Also keep in mind that tags are indexed, and adding new tags might affect performance and memory consumption. Not sure if it'd be better to have a separate database for each cluster instead of adding a new cluster tag... username seems more reasonable... In summary, there are good ideas worth considering, but I feel they should be discussed further. Also, for ease of reviewing, and as documented in CONTRIBUTING, please break patches up into logically separate chunks, while ensuring that each patch can still be compiled. While not required, we encourage use of git format-patch to generate the patch. This ensures the relevant author line and commit message stay attached. Thansk.
>If this change is done though, sites with written data older to the >change willing to query data cross pre/post the change would have >their queries and backward-compatibility disrupted. Possibly (given that's been years since I could even recall that I had added to this conversation) overthinking this, but, might it be possible to support both the "old" and "new" formats, with the choice being made by using a config variable that would need to be added into Slurm, at least until the old format finally got deprecated? That mght afford sites, using the old format, the opportunity to contunue to do so, whilst they look to swap over to the new format, without any immediate "re-tooling" being required.
If ProfileInfluxDBMeasurementName is set to something which does not contain "%P" or "%p" then yes, fields from all types of profiling would end up inside that one measurement. That might be enough reason to scrap ProfileInfluxDBMeasurementName. I'm OK with just using the lowercase dataset name, equivalent to forcing ProfileInfluxDBMeasurementName="%p". I agree that it would be possible and desirable to support both formats, at least for a while. So in my very latest version of the patch I have code to produce either format (necessitating the split of my table_t struct "prefix" field into "name" and "tags" fields) and a boolean config option "ProfileInfluxDBNormalized". I'll attach that later as it is totally untested at present. The same goes for the other format changes which have accumulated here - they can be made optional so as to allow backward compatibility. An additional boolean config option could select between raw step numbers or passing them through log_build_step_id_str() to get step names. ProfileInfluxDBStepNames? Those two boolean option could initially default to false, then true in a later release, and maybe be phased out after a few years. Inclusion of the cluster name, user name, and any other additional tags anyone requests later (uid?, account?) could be selected via a more permanent option done in the style of a Slurm "Parameters" or "Flags" option: a comma delimited string which can contain "cluster" and/or "user". "ProfileInfluxDBExtraTags"?
(In reply to Peter Maxwell from comment #12) > If ProfileInfluxDBMeasurementName is set to something which does not contain > "%P" or "%p" then yes, fields from all types of profiling would end up > inside that one measurement. That might be enough reason to scrap > ProfileInfluxDBMeasurementName. I'm OK with just using the lowercase dataset > name, equivalent to forcing ProfileInfluxDBMeasurementName="%p". All right, thanks. > I agree that it would be possible and desirable to support both formats, at > least for a while. So in my very latest version of the patch I have code to > produce either format (necessitating the split of my table_t struct "prefix" > field into "name" and "tags" fields) and a boolean config option > "ProfileInfluxDBNormalized". I'll attach that later as it is totally > untested at present. Let me discuss how much we want to support multiple formats internally and we'll come back to you. > The same goes for the other format changes which have accumulated here - > they can be made optional so as to allow backward compatibility. > > An additional boolean config option could select between raw step numbers or > passing them through log_build_step_id_str() to get step names. > ProfileInfluxDBStepNames? And discuss this as well. > Those two boolean option could initially default to false, then true in a > later release, and maybe be phased out after a few years. Perhaps it'd be better to create a ProfileInfluxDBFlags or Params and add booleans governing behavior/formats there, if at all. > Inclusion of the cluster name, user name, and any other additional tags > anyone requests later (uid?, account?) could be selected via a more > permanent option done in the style of a Slurm "Parameters" or "Flags" > option: a comma delimited string which can contain "cluster" and/or "user". > "ProfileInfluxDBExtraTags"? That's something I need to discuss too.
Not a big deal, but here is one more minor thing to consider changing: the number formats. According to https://docs.influxdata.com/influxdb/v1/write_protocols/line_protocol_reference/ integers should be suffixed with "i" to be recognised as such, and floats don't need to contain a decimal point. So that probably explains why I only see (double precision) floats in my database. Which has been fine actually. For the floats, "%.15g" might be more appropriate than the current "%.2f", as that would avoid a loss of precision on any very small numbers and an excessive length of any very large numbers.
Created attachment 42963 [details] Patch updated to 24.11.6
Created attachment 42964 [details] Patch now with choice of format and choice of extra tags. This version follows from recent discussion on the ticket. It removes the option InfluxDBProfileName but adds options InfluxDBNewFormat (boolean) and InfluxDBExtraTags (string). The "new" format is normalised, uses the special step names (as before, but now via Slurm library code), and now also uses different numerical formatting of the values. The "old" format is intended to be the same as without the patch. InfluxDBExtraTags applies to either and can include any combination of "cluster", "uid", and "user".
Created attachment 44693 [details] Updated to 25.11.4 Equivalent to previous versions of this patch, plus using "u" instead of "i" so as to let InfluxDB know that the ints are unsigned ones, and a couple of minor fixes in acct_gather_profile_p_conf_values.
Created attachment 44694 [details] Make ProfileInfluxDBRTPolicy optional.
Created attachment 44695 [details] ProfileInfluxDBDefault default of None rather than All, so matching docs.
Created attachment 44696 [details] Make ProfileInfluxDBRTPolicy optional.
Created attachment 44697 [details] No empty message to InfluxDB at end of an un-profiled job.
Created attachment 44698 [details] Optional new normalised format for data sent to InfluxDB.
Created attachment 44699 [details] Use names for special steps such as batch and extern.
Created attachment 44700 [details] Represent InfluxDB uints and floats as such.
Created attachment 44701 [details] Allow cluster, user, or uid as extra InfluxDB tags.
In addition to updating my patch for 25.11 I have also updated it for the current master and uploaded that version here as a sequence of 7 smaller patches. Of those 7, the first 3 are small and independent fixes which were discovered along the way. The 4th introduces the new format which is the main point of this ticket, and then the 5th and 6th make further improvements to that new format. The final one (extra tags) applies to both formats.