Hello, We're seeing these errors on our slurmctld server that originates from the jobcomp/elasticsearch plugin: --- error: jobcomp/elasticsearch: Limit of 1000000 enqueued jobs in memory waiting to be indexed reached. Job 25688571 discarded We looked at the plugin's source code and we noticed a few things: - MAX_JOBS is set to 1,000,000 - We think that the plugin reads the elasticsearch_state file as data. - We think the plugin compares the elasticsearch_state data against MAX_JOBS We checked the line count on the elasticsearch_state and found that the count was 1,000,001 (using the linux command wc). If our observations are correct, it would make sense that we would get this error if the job count in elasticsearch_state is over 1,000,000. How can we make sure all slurm job data captured by this plugin will make it to Elasticsearch and not be discarded? Separately, but relatedly, is the elasticsearch_state file transient (i.e. is it a cache location that is regularly flushed), or does it keep a full history of all the jobs that complete? We want to try to understand the nature of this file a bit more, so if could provide any additional information that would be very helpful. Let me know if you need any other information from me. Thank you, Caleb
Please attach the current slurm.conf (& friends) and the slurmctld.log. (In reply to ARCTS Admins from comment #0) > How can we make sure all slurm job data captured by this plugin will make it > to Elasticsearch and not be discarded? > > Separately, but relatedly, is the elasticsearch_state file transient (i.e. > is it a cache location that is regularly flushed), or does it keep a full > history of all the jobs that complete? We want to try to understand the > nature of this file a bit more, so if could provide any additional > information that would be very helpful. The Elasticsearch (ES) updates are not making it successfully to the ES servers. Slurm only caches the entries until the ES server has accepted them and then it purges them since it can be space-intensive.
Created attachment 21390 [details] slurm.conf
Created attachment 21391 [details] slurmctld.log
Hi Nate, Okay great, that makes sense. Thank you for clarifying! I've attached our slurm.conf and slurmctld.log. Though, I did not know what you meant by "and friends". Were you asking for slurmdbd.conf? Best, Caleb
(In reply to ARCTS Admins from comment #4) > I've attached our slurm.conf and slurmctld.log. Though, I did not know what > you meant by "and friends". Were you asking for slurmdbd.conf? The actual error is not getting logged. Please add this to slurm.conf: > debugflags=Elasticsearch and restart slurmctld. It should log in a few minutes the actual issue. Please upload the log.
Created attachment 21415 [details] slurmctld.log with elasticsearch debug flag enabled
I've attached the slurmctld.log that had the Elasticsearch debug flag enabled. Thanks, Caleb
(In reply to ARCTS Admins from comment #7) > I've attached the slurmctld.log that had the Elasticsearch debug flag > enabled. I was hoping for a slightly more verbose error: > HTTP status code 405 received from http://10.242.11.36:9200 which is > 405 Method Not Allowed Looks like the elasticsearch server is unhappy. Usually at this point, I will activate the debug logging in ES: > curl -XPUT "http://localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'{"transient":{"logger._root":"DEBUG"}}' Please attach the ES logs. Please also call (and correct URL as needed): > curl "http://localhost:9200/_cluster/health"
Hi Nate, Apologies for the delay. We found the source of the issue while we were doing some investigation of our network and our proxy server. We thought our proxy might have been interfering, but we deduced that was not the case. We then did another review of our slurm.conf settings for jobcomp/elasticsearch and we found that we did not specify the index and type in JobCompLoc. We had overlooked the JobCompLoc URL endpoint change from the 20.11.8 release notes, so our ES cluster did not know what index to put the jobcomp data into. Once we added "slurm/jobcomp" into our existing value of <host>:<port> in JobCompLoc, our ES cluster began to index the jobcomp data correctly and we are now seeing that data in the slurm index in ES. The jobcomp/elasticsearch errors from our ctld server have also ceased. I think this bug can be closed. Thank you for your time! Best, Caleb
(In reply to ARCTS Admins from comment #9) > I think this bug can be closed. Thank you for your time! Closing per the last comment.