Ticket 7153

Summary: Elasticsearch plugin not working
Product: Slurm Reporter: Josko Plazonic <plazonic>
Component: slurmctldAssignee: Alejandro Sanchez <alex>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 18.08.7   
Hardware: Linux   
OS: Linux   
Site: Princeton (PICSciE) Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 19.05.1 20.02.0pre1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Josko Plazonic 2019-05-30 13:47:39 MDT
Hullo,

elasticsearch plugin does not work for us at this time.  It used to in 17.11 but with 18.08 we get this in logs:

[2019-05-30T15:40:21.414] error: jobcomp/elasticsearch: JobCompLoc parameter not configured

though this is defined:

[root@tiger2-slurm slurm]# scontrol show config | grep -i JobComp
JobCompHost             = localhost
JobCompLoc              = http://hood.sn2907:9200
JobCompPort             = 0
JobCompType             = jobcomp/elasticsearch
JobCompUser             = root

[root@tiger2-slurm slurm]# grep JobComp /etc/slurm/slurm.conf 
JobCompType=jobcomp/elasticsearch
JobCompLoc=http://hood.sn2907:9200

Any hints as to why this is happening? I looked at plugin and it is not clear to me that slurm_jobcomp_set_location is ever called to actually set log_url (compare to jobcomp_mysql.c).

Thanks!

Josko
Comment 1 Alejandro Sanchez 2019-05-31 05:36:41 MDT
Hi Josko,

I've just tried locally with Slurm 18.08 HEAD and the plugin worked. 

Have you changed JobCompType from jobcomp/none (or any other) to jobcomp/elasticsearch and then just run reconfigure? note that changing the job completion plugin requires a restart.
Comment 2 Josko Plazonic 2019-05-31 09:52:53 MDT
These configs were like that for months now, through numerous restarts - after all, they show up in scontrol show config so they clearly are there...
Comment 3 Alejandro Sanchez 2019-05-31 10:01:00 MDT
Is the error logged once only at ctld startup or once for every finished job?
Comment 4 Josko Plazonic 2019-05-31 10:13:04 MDT
This is from last restart, from restart until it stops complaining:

[2019-05-30T15:40:18.557] error: chdir(/var/log/slurm): Permission denied
[2019-05-30T15:40:21.144] error: jobcomp/elasticsearch: JobCompLoc parameter not configured
[2019-05-30T15:40:21.144] error: jobcomp/elasticsearch: JobCompLoc parameter not configured
......
[2019-05-30T15:40:21.414] error: jobcomp/elasticsearch: JobCompLoc parameter not configured
[2019-05-30T15:40:21.414] error: jobcomp/elasticsearch: JobCompLoc parameter not configured
[2019-05-30T15:40:21.414] error: jobcomp/elasticsearch: JobCompLoc parameter not configured

(there are 46182 lines like above) and then it never complains again until restart.  Not sure why 46k lines - not like we allow that many jobs in the queue (max is 10k).
Comment 5 Alejandro Sanchez 2019-05-31 10:24:04 MDT
And after it stops complaining, are the jobs indexes to the server?

The error can be logged from within two spots in the plugin:

1. From _index_job function which is called for every job in the internal list of the plugin in a separate thread that is created from the init function.

2. From slurm_jobcomp_set_location (responsible for setting log_url as specified in the API). This is called from g_slurm_jobcomp_init when reading the configuration at ctld start, reconfig or run backup. It's also called when using 'sacct -c' which isn't intended for use with this plugin.

If after the errors stop complaining the jobs are properly indexes, what I suspect is happening is that the separate thread tries to index loaded jobs from the state before slurm_jobcomp_set_location has yet set log_url.
Comment 6 Josko Plazonic 2019-05-31 12:08:28 MDT
Aha - you are right, it is attempting but has a problem on the other side (access denied to the relevant index) which prevented new data from being logged.  So partial apology - we saw 40k lines of errors and assumed this was the reason.  It would still be good to get that fixed too...
Thanks!
Comment 7 Alejandro Sanchez 2019-05-31 12:39:25 MDT
If access to the server was denied I'd be expecting a different error. I'm not sure I understand what was wrong on your side. Are jobs logged now and you see no more instances of the error even after restart?
Comment 8 Alejandro Sanchez 2019-05-31 12:41:06 MDT
(Note there's additional plugin-specific logs if temporarily enablig Elasticsearch debug flag)
Comment 9 Josko Plazonic 2019-05-31 13:08:30 MDT
I still see errors at restart (same errors though this time 101k lines) even though I can see communication going to elasticsearch.  It seems to be due to cached jobs that could not be sent earlier (/var/spool/slurmctld/elasticsearch is over a gig in size) - now that I fixed the elasticsearch it is sending data that it could not before and it sure seems like some of this is attempted before everything is properly setup - hence errors.
Comment 10 Alejandro Sanchez 2019-06-03 08:32:09 MDT
Hi Josko,

I was able to reproduce the one-time error spike at ctld startup by placing a sleep syscall in a specific place (between init() and *set_location() functions).

Besides the errors being logged once at startup,

are now jobs indexed properly after you fixed the server-side of the problem?

is the size of elasticsearch_state reduced?

I'm gonna trigger the review process for a patch to fix this minor harmless race.
Comment 12 Josko Plazonic 2019-06-06 08:54:58 MDT
Hullo,

it's definitely much better - elasticsearch_state decreased in size (after next restart).  I don't know if this is the right spot to remark this but it looks like some of the fields would need extra escaping/data sanitation - we have a few more fields that keep failing - this is from tcpdump so excuse a bit messy output:

{"jobid":1909896,"username":"shuow","user_id":118666,"groupname":"molbio","group_id":30054,"@start":"2018-12-12T04:14:31","@end":"2018-12-12T04:14:31","elapsed":0,"partition":"donia","alloc_node":".","nodes":"(null)","total_cpus":0,"total_nodes":0,"derived_ec":"0:0","exit_code":"0:0","state":"CANCELLED","cpu_hours":0.000000,"pack_job_id":0,"pack_job_offset":0,"@submit":"2018-12-12T03:39:38","@eligible":"2018-12-12T03:39:38","queue_wait":2093,"work_dir":"/tigress/DONIA/data/HMP_bowtie/quantification/oral/SRS011247.fasta","std_in":"/dev/null","cluster":"tiger2","qos":"tiger-donia","ntasks":0,"ntasks_per_node":0,"cpus_per_task":4,"excluded_nodes":"tiger-h19d1","time_limit":86400,"job_name":"bowtie2_SRS019327.cmd","account":"molbio","script":"#!\/bin\/bash\n\n#SBATCH -p donia\n#SBATCH --exclude=tiger-h19d1\n#SBATCH -N 1\n#SBATCH -c 4\n#SBATCH -t 23:59:59\n#SBATCH --mem=32000\n#SBATCH -D \/tigress\/DONIA\/data\/HMP_bowtie\/quantification\/oral\/SRS011247.fasta\n\nbowtie2 --end-to-end --sensitive -p 4 -x \/tigress\/DONIA\/data\/HMP_bowtie\/oral\/SRS011247.fasta -1 \/tigress\/DONIA\/data\/HMP_DATA\/healthy\/published\/tongue_dorsum\/SRS019327\/SRS019327-filtered-reads-fastq\/SRS019327-good_1.fastq -2 \/tigress\/DONIA\/data\/HMP_DATA\/healthy\/published\/tongue_dorsum\/SRS019327\/SRS019327-filtered-reads-fastq\/SRS019327-good_2.fastq -U \/tigress\/DONIA\/data\/HMP_DATA\/healthy\/published\/tongue_dorsum\/SRS019327\/SRS019327-filtered-reads-fastq\/SRS019327-good_1_singletons.fastq,\/tigress\/DONIA\/data\/HMP_DATA\/healthy\/published\/tongue_dorsum\/SRS019327\/SRS019327-filtered-reads-fastq\/SRS019327-good_2_singletons.fastq 2> \/tigress\/DONIA\/data\/HMP_bowtie\/quantification\/oral\/SRS011247.fasta\/SRS019327.err| samtools view -u -S -@ 3 | samtools sort -l 0 -@ 3 > \/tigress\/DONIA\/data\/HMP_bowtie\/quantification\/oral\/SRS011247.fasta\/SRS019327.bam\nsamtools index -b \/tigress\/DONIA\/data\/HMP_bowtie\/quantification\/oral\/SRS011247.fasta\/SRS019327.bam\nsamtools idxstats \/tigress\/DONIA\/data\/HMP_bowtie\/quantification\/oral\/SRS011247.fasta\/SRS019327.bam > \/tigress\/DONIA\/data\/HMP_bowtie\/quantification\/oral\/SRS011247.fasta\/SRS019327.tsv\n","parent_accounts":"/root/molbio/molbio"}

HTTP/1.1 400 Bad Request
content-type: application/json; charset=UTF-8
content-length: 485

{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"failed to parse [alloc_node]"}],"type":"mapper_parsing_exception","reason":"failed to parse [alloc_node]","caused_by":{"type":"json_parse_exception","reason":"Illegal unquoted character ((CTRL-CHAR, code 11)): has to be escaped using backslash to be included in string value\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@16755510; line: 1, column: 199]"}},"status":400}

It looks like alloc_node is "." for cancelled jobs that never ran and elastic search hates that.

Thanks again.

Josko
Comment 13 Alejandro Sanchez 2019-06-06 10:39:11 MDT
Yes, what you mention is already discussed in 6859. I provided there a workaround patch only suited for that specific site and state combination, which should only fix their specific state.

For master branch future 20.02 I want to refactor part of the plugin so that it makes use of the json-c library to offload the [de]serialization process of all the fields (not just the script) to that library plus other improvements I have in mind. But in the meantime, I don't plan to create more site-specific patches to workaround states with bad or unescaped characters for every site. Perhaps removing such bad elasticsearch_state and restarting is probably the best short-term option if such info about that jobs isn't that relevant.

There's also a separate older known issue which is bug 2265 (sanitization of user-provided strings).

None of these issues have to do with the minor race setting JobCompLoc with this bug though, whose patch is still pending review process.
Comment 15 Alejandro Sanchez 2019-06-20 03:44:50 MDT
Josko,

The patch has been pushed and will be available since 19.05.1:

https://github.com/SchedMD/slurm/commit/8105f5aefb09

Thanks for reporting.