Summary: | Kafka JobComp Plugin erroneously reports multiple partitions | ||
---|---|---|---|
Product: | Slurm | Reporter: | Thomas Langford <thomas.langford> |
Component: | Accounting | Assignee: | Alejandro Sanchez <alex> |
Status: | OPEN --- | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | tim |
Version: | 23.11.10 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Yale | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- |
Description
Thomas Langford
2025-03-17 13:25:03 MDT
Hi everyone, it's been a week since I submitted this ticket and I haven't heard anything back yet. Is there any additional information I could provide to help out? These are my JobComp config settings: rdkafka.conf -------------- bootstrap.servers=XXX.XXX.XXX.XXX:9092 debug=broker,topic,msg linger.ms=400 log_level=7 slurm.conf -------------- JobCompType=jobcomp/kafka JobCompLoc=/opt/slurm/current/etc/rdkafka.conf JobCompParams=flush_timeout=200,poll_interval=3,requeue_on_msg_timeout,topic=slurm_accounting Happy to provide anything else that's useful. Hi, Sorry I didn't get back to this earlier. Historically all jobcomp plugins (including kafka and elasticsearch sharing a common serialization code path) sent the partition field off of the job_record_t->partition field as-is, which is defined: char *partition; /* name of job partition(s) */ Making a change to suit your expectations would be a change in behavior potentially disrupting other sites expectations. We can discuss this internally and come back to you. Thanks for the clarification. I'm surprised that there isn't a field for "partition where this job ran". That's what I expected the "partition" value to be, perhaps there could be a separation of "submitted partition list" from "partition"? The corresponding field from sacct gets updated to be the "partition where the job ran", hence my expectation that this would be the same information. Do other sites not use the JobComp plugins for long-term accounting? I don't really see the value of the "submitted partition", since that list could contain all partitions. We really care about the breakdown of jobs that ran in privately owned partitions vs commons partitions, as that factors into how we report usage to our various research departments. Is there another field that I'm missing in the JobComp datastream that indicates which partition a job actually ran in? Thanks so much, -t Hi Thomas, Just to give you an update, after internal discussion we'll work on changing the "partition" field in jobcomp plugins for 25.05 to reflect the partition the job ran on, instead of the current multi-part submission comma separated string list. We'll make sure to communicate the change at release time for sites used to previous behavior. I'll come back to you when it's ready. Thanks. Fantastic, thanks! Looking forward to the implementation. I've been really happy with the Kafka jobcomp plugin, thanks for all the hard work! |