Ticket 12534

Summary:	jobcomp/elasticsearch reaching limit of 1000000 enqueued jobs
Product:	Slurm	Reporter:	ARC Admins <arc-slurm-admins>
Component:	Other	Assignee:	Nate Rini <nate>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	nate
Version:	20.11.8
Hardware:	Linux
OS:	Linux
Site:	University of Michigan	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf slurmctld.log slurmctld.log with elasticsearch debug flag enabled

Description ARC Admins 2021-09-22 09:21:21 MDT

Hello,

We're seeing these errors on our slurmctld server that originates from the jobcomp/elasticsearch plugin:
---
error: jobcomp/elasticsearch: Limit of 1000000 enqueued jobs in memory waiting to be indexed reached. Job 25688571 discarded

We looked at the plugin's source code and we noticed a few things:
- MAX_JOBS is set to 1,000,000
- We think that the plugin reads the elasticsearch_state file as data.
- We think the plugin compares the elasticsearch_state data against MAX_JOBS

We checked the line count on the elasticsearch_state and found that the count was 1,000,001 (using the linux command wc). If our observations are correct, it would make sense that we would get this error if the job count in elasticsearch_state is over 1,000,000.

How can we make sure all slurm job data captured by this plugin will make it to Elasticsearch and not be discarded?

Separately, but relatedly, is the elasticsearch_state file transient (i.e. is it a cache location that is regularly flushed), or does it keep a full history of all the jobs that complete? We want to try to understand the nature of this file a bit more, so if could provide any additional information that would be very helpful.

Let me know if you need any other information from me.

Thank you,
Caleb

Comment 1 Nate Rini 2021-09-22 09:34:53 MDT

Please attach the current slurm.conf (& friends) and the slurmctld.log.

(In reply to ARCTS Admins from comment #0)
> How can we make sure all slurm job data captured by this plugin will make it
> to Elasticsearch and not be discarded?
> 
> Separately, but relatedly, is the elasticsearch_state file transient (i.e.
> is it a cache location that is regularly flushed), or does it keep a full
> history of all the jobs that complete? We want to try to understand the
> nature of this file a bit more, so if could provide any additional
> information that would be very helpful.

The Elasticsearch (ES) updates are not making it successfully to the ES servers.  Slurm only caches the entries until the ES server has accepted them and then it purges them since it can be space-intensive.

Comment 2 ARC Admins 2021-09-22 12:47:56 MDT

Created attachment 21390 [details]
slurm.conf

Comment 3 ARC Admins 2021-09-22 12:49:39 MDT

Created attachment 21391 [details]
slurmctld.log

Comment 4 ARC Admins 2021-09-22 12:52:45 MDT

Hi Nate,

Okay great, that makes sense. Thank you for clarifying!

I've attached our slurm.conf and slurmctld.log. Though, I did not know what you meant by "and friends". Were you asking for slurmdbd.conf?

Best,
Caleb

Comment 5 Nate Rini 2021-09-22 13:32:04 MDT

(In reply to ARCTS Admins from comment #4)
> I've attached our slurm.conf and slurmctld.log. Though, I did not know what
> you meant by "and friends". Were you asking for slurmdbd.conf?

The actual error is not getting logged. Please add this to slurm.conf:
> debugflags=Elasticsearch

and restart slurmctld. It should log in a few minutes the actual issue. Please upload the log.

Comment 6 ARC Admins 2021-09-23 11:28:07 MDT

Created attachment 21415 [details]
slurmctld.log with elasticsearch debug flag enabled

Comment 7 ARC Admins 2021-09-23 11:30:15 MDT

I've attached the slurmctld.log that had the Elasticsearch debug flag enabled.

Thanks,
Caleb

Comment 8 Nate Rini 2021-09-23 12:03:18 MDT

(In reply to ARCTS Admins from comment #7)
> I've attached the slurmctld.log that had the Elasticsearch debug flag
> enabled.

I was hoping for a slightly more verbose error:
> HTTP status code 405 received from http://10.242.11.36:9200
which is
> 405 Method Not Allowed

Looks like the elasticsearch server is unhappy. Usually at this point, I will activate the debug logging in ES:
>  curl -XPUT "http://localhost:9200/_cluster/settings"  -H 'Content-Type: application/json' -d'{"transient":{"logger._root":"DEBUG"}}'

Please attach the ES logs.

Please also call (and correct URL as needed):
> curl "http://localhost:9200/_cluster/health"

Comment 9 ARC Admins 2021-09-29 08:50:58 MDT

Hi Nate,

Apologies for the delay. We found the source of the issue while we were doing some investigation of our network and our proxy server. We thought our proxy might have been interfering, but we deduced that was not the case.

We then did another review of our slurm.conf settings for jobcomp/elasticsearch and we found that we did not specify the index and type in JobCompLoc. We had overlooked the JobCompLoc URL endpoint change from the 20.11.8 release notes, so our ES cluster did not know what index to put the jobcomp data into. 

Once we added "slurm/jobcomp" into our existing value of <host>:<port> in JobCompLoc, our ES cluster began to index the jobcomp data correctly and we are now seeing that data in the slurm index in ES. The jobcomp/elasticsearch errors from our ctld server have also ceased.

I think this bug can be closed. Thank you for your time!

Best,
Caleb

Comment 10 Nate Rini 2021-09-29 08:52:29 MDT

(In reply to ARCTS Admins from comment #9)
> I think this bug can be closed. Thank you for your time!

Closing per the last comment.