Ticket 16919 - Failure in sending InfluxDB profiling data
Summary: Failure in sending InfluxDB profiling data
Status: RESOLVED DUPLICATE of ticket 16941
Alias: None
Product: Slurm
Classification: Unclassified
Component: Profiling (show other tickets)
Version: 23.02.2
Hardware: All All
: 4 - Minor Issue
Assignee: Alejandro Sanchez
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2023-06-07 13:18 MDT by schedmail
Modified: 2023-06-12 11:08 MDT (History)
1 user (show)

See Also:
Site: -Other-
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description schedmail 2023-06-07 13:18:29 MDT
I highly suspect you have a bug in the acct gathering that prevents the sending of profiling data to InfluxDB. 
(This is from Jul 29, 2022 and is in the following commit:
https://github.com/SchedMD/slurm/commit/05e64074205a76c25e6a89a0ac33a951a5e8d8df)
The bug is as follows(for line references please look in the mentioned commit):

In line 685 GPUUtilization is defined as type float:
		{ "GPUUtilization", PROFILE_FIELD_DOUBLE },
And in line 785 it takes an u64 value(if the node has gpu in its tres config):
			data[FIELD_GPUUTIL].u64 =
				jobacct->tres_usage_in_tot[gpuutil_pos];

So if we look at acct_gather_profile_p_add_sample_data function in https://github.com/SchedMD/slurm/blob/master/src/plugins/acct_gather_profile/influxdb/acct_gather_profile_influxdb.c you will see the following:
for(; i < table->size; i++) {
		switch (table->types[i]) {
		case PROFILE_FIELD_UINT64:
			xstrfmtcat(str, "%s,job=%d,step=%d,task=%s,"
				   "host=%s value=%"PRIu64" "
				   "%"PRIu64"\n", table->names[i],
				   g_job->step_id.job_id,
				   g_job->step_id.step_id,
				   table->name, g_job->node_name,
				   ((union data_t*)data)[i].u,
				   (uint64_t)sample_time);
			break;
		case PROFILE_FIELD_DOUBLE:
			xstrfmtcat(str, "%s,job=%d,step=%d,task=%s,"
				   "host=%s value=%.2f %"PRIu64""
				   "\n", table->names[i],
				   g_job->step_id.job_id,
				   g_job->step_id.step_id,
				   table->name, g_job->node_name,
				   ((union data_t*)data)[i].d,
				   (uint64_t)sample_time);
			break;

so we get that it's parsed as if it were a double.
This results in sometimes unparsable double numbers for influx which results in InfluxDB denying the request and no data being sent.