| Summary: | jobcomp/elasticsearch reaching limit of 1000000 enqueued jobs | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | ARC Admins <arc-slurm-admins> |
| Component: | Other | Assignee: | Nate Rini <nate> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | nate |
| Version: | 20.11.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | University of Michigan | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm.conf
slurmctld.log slurmctld.log with elasticsearch debug flag enabled |
||
|
Description
ARC Admins
2021-09-22 09:21:21 MDT
Please attach the current slurm.conf (& friends) and the slurmctld.log. (In reply to ARCTS Admins from comment #0) > How can we make sure all slurm job data captured by this plugin will make it > to Elasticsearch and not be discarded? > > Separately, but relatedly, is the elasticsearch_state file transient (i.e. > is it a cache location that is regularly flushed), or does it keep a full > history of all the jobs that complete? We want to try to understand the > nature of this file a bit more, so if could provide any additional > information that would be very helpful. The Elasticsearch (ES) updates are not making it successfully to the ES servers. Slurm only caches the entries until the ES server has accepted them and then it purges them since it can be space-intensive. Created attachment 21390 [details]
slurm.conf
Created attachment 21391 [details]
slurmctld.log
Hi Nate, Okay great, that makes sense. Thank you for clarifying! I've attached our slurm.conf and slurmctld.log. Though, I did not know what you meant by "and friends". Were you asking for slurmdbd.conf? Best, Caleb (In reply to ARCTS Admins from comment #4) > I've attached our slurm.conf and slurmctld.log. Though, I did not know what > you meant by "and friends". Were you asking for slurmdbd.conf? The actual error is not getting logged. Please add this to slurm.conf: > debugflags=Elasticsearch and restart slurmctld. It should log in a few minutes the actual issue. Please upload the log. Created attachment 21415 [details]
slurmctld.log with elasticsearch debug flag enabled
I've attached the slurmctld.log that had the Elasticsearch debug flag enabled. Thanks, Caleb (In reply to ARCTS Admins from comment #7) > I've attached the slurmctld.log that had the Elasticsearch debug flag > enabled. I was hoping for a slightly more verbose error: > HTTP status code 405 received from http://10.242.11.36:9200 which is > 405 Method Not Allowed Looks like the elasticsearch server is unhappy. Usually at this point, I will activate the debug logging in ES: > curl -XPUT "http://localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'{"transient":{"logger._root":"DEBUG"}}' Please attach the ES logs. Please also call (and correct URL as needed): > curl "http://localhost:9200/_cluster/health" Hi Nate, Apologies for the delay. We found the source of the issue while we were doing some investigation of our network and our proxy server. We thought our proxy might have been interfering, but we deduced that was not the case. We then did another review of our slurm.conf settings for jobcomp/elasticsearch and we found that we did not specify the index and type in JobCompLoc. We had overlooked the JobCompLoc URL endpoint change from the 20.11.8 release notes, so our ES cluster did not know what index to put the jobcomp data into. Once we added "slurm/jobcomp" into our existing value of <host>:<port> in JobCompLoc, our ES cluster began to index the jobcomp data correctly and we are now seeing that data in the slurm index in ES. The jobcomp/elasticsearch errors from our ctld server have also ceased. I think this bug can be closed. Thank you for your time! Best, Caleb (In reply to ARCTS Admins from comment #9) > I think this bug can be closed. Thank you for your time! Closing per the last comment. |