Ticket 16919

Summary: Failure in sending InfluxDB profiling data
Product: Slurm Reporter: schedmail
Component: ProfilingAssignee: Alejandro Sanchez <alex>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: tim
Version: 23.02.2   
Hardware: All   
OS: All   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description schedmail 2023-06-07 13:18:29 MDT
I highly suspect you have a bug in the acct gathering that prevents the sending of profiling data to InfluxDB. 
(This is from Jul 29, 2022 and is in the following commit:
https://github.com/SchedMD/slurm/commit/05e64074205a76c25e6a89a0ac33a951a5e8d8df)
The bug is as follows(for line references please look in the mentioned commit):

In line 685 GPUUtilization is defined as type float:
		{ "GPUUtilization", PROFILE_FIELD_DOUBLE },
And in line 785 it takes an u64 value(if the node has gpu in its tres config):
			data[FIELD_GPUUTIL].u64 =
				jobacct->tres_usage_in_tot[gpuutil_pos];

So if we look at acct_gather_profile_p_add_sample_data function in https://github.com/SchedMD/slurm/blob/master/src/plugins/acct_gather_profile/influxdb/acct_gather_profile_influxdb.c you will see the following:
for(; i < table->size; i++) {
		switch (table->types[i]) {
		case PROFILE_FIELD_UINT64:
			xstrfmtcat(str, "%s,job=%d,step=%d,task=%s,"
				   "host=%s value=%"PRIu64" "
				   "%"PRIu64"\n", table->names[i],
				   g_job->step_id.job_id,
				   g_job->step_id.step_id,
				   table->name, g_job->node_name,
				   ((union data_t*)data)[i].u,
				   (uint64_t)sample_time);
			break;
		case PROFILE_FIELD_DOUBLE:
			xstrfmtcat(str, "%s,job=%d,step=%d,task=%s,"
				   "host=%s value=%.2f %"PRIu64""
				   "\n", table->names[i],
				   g_job->step_id.job_id,
				   g_job->step_id.step_id,
				   table->name, g_job->node_name,
				   ((union data_t*)data)[i].d,
				   (uint64_t)sample_time);
			break;

so we get that it's parsed as if it were a double.
This results in sometimes unparsable double numbers for influx which results in InfluxDB denying the request and no data being sent.