Hi Team, Here at the DKRZ we noticed some sort of issues with the elasticsearch job complete plugin. What we noticed is that elasticsearch is complaining about the size of the messages. Our logs are being spammed with: [2019-04-15T15:55:26,928][DEBUG][o.e.a.b.TransportShardBulkAction] [ma3] [slurm][1] failed to execute bulk item (index) BulkShardRequest [[slurm][1]] containing [index {[slurm][jobcomp][AWohR_xPSKL63ncfLTBZ], source[n/a, actual length: [10.8kb], max length: 2kb]}] [2019-04-15T15:55:26,942][DEBUG][o.e.a.b.TransportShardBulkAction] [ma3] [slurm][3] failed to execute bulk item (index) BulkShardRequest [[slurm][3]] containing [index {[slurm][jobcomp][AWohR_xes-p2C7Ge78QZ], source[n/a, actual length: [10.8kb], max length: 2kb]}] [2019-04-15T15:55:26,949][DEBUG][o.e.a.b.TransportShardBulkAction] [ma3] [slurm][1] failed to execute bulk item (index) BulkShardRequest [[slurm][1]] containing [index {[slurm][jobcomp][AWohR_xkAlBr5lb5kHIQ], source[n/a, actual length: [10.8kb], max length: 2kb]}] We're not sure when this has started, since all the logs we have, are showing these errors. Configuration: [root@ma1 slurm]# scontrol show config | grep -iE 'comp' CompleteWait = 40 sec JobCompHost = localhost JobCompLoc = http://10.50.64.90:9201 JobCompPort = 0 JobCompType = jobcomp/elasticsearch JobCompUser = root Could you please advice how would we go about resolving this issue? Regards, Andrea
Hi, will you temporarily increase the debug level to at least 'debug' and enable the Elasticsearch debug flag, then let some jobs finish and finally reset the debug level back to the original value, remove the flag and attach slurmctld.log? $ scontrol setdebug debug $ scontrol setdebugflags +elasticsearch Let some jobs finish $ scontrol setdebugflags -elasticsearch $ scontrol setdebug info (or whatever you had as original) I'd like to know if the error is logged due to a single job index retry or for every job and why. Thanks.
Created attachment 9917 [details] Slurmctld logs
Hi Alejandro, Thank you for your answer. Kindly find attached the slurmctld logs. Regards, Andrea
Will you stop slurmctld, copy to a temp directory StateSaveLocation/elasticsearch_state file, then start slurmctld and attach the file here? What version of elasticsearch are you using? I'm wondering if there's a problem with any of the fields being malformed or any issue with the esearch expected mappings so that the server status code response is always 400. So I'm interested in loading those jobs locally and see if I can index them locally or find any common pattern among the failing ones. The logs show there are jobs still being indexed after the failure. Thanks.
Hi Alejandro, This is a production system and we can't afford stopping slurmctld for this - is there any other alternative? Regards, Andrea
Regarding the elasticsearch version: elasticsearch-5.5.1-1.noarch
Then just attach a tar file with elasticsearc_state[.old] 2 files. Just wanted to minimize the likelihood of copying the files while slurmctld is manipulating them with the restart.
Created attachment 9929 [details] elasticsearcg Hi Alejandro, Kindly find attached the required files. Regards, Andrea
Hi Alejandro, Have you had any chance to look into this? Regards, Andrea
I can reproduce the issue with the provided state. I'm looking into it.
Do you happen to have handy any job script associated with any of the failed to be indexed jobs? Some of the jobids are 6980257 or 7000918 but they all can be found by inspecting the attached state file.
Hi Alejandro, Unfortunately no. I would have to ask around. Regards, Andrea Andrea Del Monaco HPC Consultant – Big Data & Security [GER] M: +31 612031174 Burgemeester Rijnderslaan 30 – 1185 MC Amstelveen – The Netherlands atos.net On Tue, Apr 23, 2019 at 2:05 PM +0200, "bugs@schedmd.com" <bugs@schedmd.com<mailto:bugs@schedmd.com>> wrote: Comment # 11<https://bugs.schedmd.com/show_bug.cgi?id=6859#c11> on bug 6859<https://bugs.schedmd.com/show_bug.cgi?id=6859> from Alejandro Sanchez<mailto:alex@schedmd.com> Do you happen to have handy any job script associated with any of the failed to be indexed jobs? Some of the jobids are 6980257 or 7000918 but they all can be found by inspecting the attached state file. ________________________________ You are receiving this mail because: * You reported the bug.
It seems there are invalid UTF-8 characters being sent as part of the serialized json representation of jobs: alex@polaris:~$ sudo grep "Invalid UTF" /var/log/elasticsearch/elasticsearch.log | sort -u Caused by: com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 middle byte 0x63 Caused by: com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 middle byte 0x6e Caused by: com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 middle byte 0x74 Caused by: com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xa1 Caused by: com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xb0 Caused by: com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xfc alex@polaris:~$ Either that, or there's a bug in the FasterXML Jackson JSON parser[1]. I'm thinking how best would be to proceed at this point. [1] https://github.com/FasterXML/jackson
Hi, Your attached elasticsearch_state contains 337 jobs that could not be indexed. I've managed to disaggregate the reasons: 329 due to Invalid UTF-8. 6 due to script string containing control character ACK. failed to parse field [script] of type [text] ... Illegal unquoted character ((CTRL-CHAR, code 6)): has to be escaped using backslash. 1 due to job name containing a quotation mark. Unexpected character ('"' (code 34)): was expecting comma to separate Object entries. Was due to a job name containing a quotation mark: ... "time_limit":1800,"job_name":"icon"","account":"bm0974", ... ^ 1 due to alloc_node containing a vertical tab. failed to parse field [alloc_node] of type [text]. Illegal unquoted character ((CTRL-CHAR, code 11)): has to be escaped using backslash. ... "partition":"prepost","alloc_node":"^K", ... ^^ Said this, I have good and bad news. The good news is I've created a very specific patch that manages to solve all these problems for your site's specific attached state, meaning with the patch applied all the loaded jobs from the attached state are successfully indexed. To do so, I've written a function to convert from ISO-8859-1 character encoding (I suspected some strings contained German-specific characters) to UTF-8 by using the iconv() family of syscalls. I also created another function to strip out any control characters. Finally, I also used a string substitution function for the job named 'icon"'. The bad news is unfortunately that the scope of the problem isn't just this plugin, but whole Slurm code is affected today. Bug 2265 is a known issue: Slurm today doesn't perform an aggressive sanitization on user provided strings, and this means problems can happen in many places currently. I admit that the plugin could be improved in several ways (i.e. make use of json-c library instead of manually trying to create a valid json object off each finished job record), but still if the job record gets to the plugin with unsanitized strings in the job names, the job script, and so, edge-cases can always be triggered. On Monday I will do some more tests on the workaround patch for this specific attached state. If you want, I will attach it so that you still have the chance to index these finished jobs. But still, if users continue submitting scripts with non-UTF8 characters, control characters and similar then I foresee more problems. I'll give you more details on the patch next week. Longer-term (and not earlier than future 20.02) we would like to tackle this problem with unsanitized strings, as well as improve how the plugin is implemented today. Until then, this is a known unresolved issue, and the most I can give you is this short-term and very specific workaround for your site/state.
Created attachment 10124 [details] 6859_1711_v1 Hi, please, see attached a patch to sanitize your 337 unindexed serialized jobs. What this patch does is in the thread processing the list of pending to be indexed jobs, once a job has failed, call a function _soft_sanitize() to address all the problems specific to these 337 jobs in the attached state. Once the serialized strings are amended, the thread retries to index them after INDEX_RETRY_INTERVAL 30, which should then succeed (it does for me). After that, I would remove the patch. Otherwise, the code will try to _soft_sanitize() future jobs that failed to be indexed (including those that couldn't for other reasons, i.e. elasticsearch server is down). And we don't want that. So it is intended as a one-time thing to index this state. The patch only changes this file src/plugins/jobcomp/elasticsearch/jobcomp_elasticsearch.c so applying/removing the patch will require recompiling using 'make' && 'make install' against the Makefile in /path/to/slurm/build/src/plugins/jobcomp/elasticsearch/Makefile Please, let me know how it goes. Thanks.
(In reply to Alejandro Sanchez from comment #18) > so applying/removing the patch will require recompiling ... (and restart slurmctld after doing so).
Hi. Any updates on this? thanks.
Hi Alejandro, We have recompiled the Slurm packages with your patch and will be installing them during a downtime window happening in June. Thank you for your help with this case. Will this patch be included in the master branch? Regards, Andrea
(In reply to Andrea from comment #21) > Hi Alejandro, > > We have recompiled the Slurm packages with your patch and will be installing > them during a downtime window happening in June. > > Thank you for your help with this case. > Will this patch be included in the master branch? Not at all. At most I'd refactor the plugin in master branch, but this patch is very specific to your site/state and wouldn't make sense to other sites. It even doesn't make sense for future finished jobs in your site. It should be a one-time applied work-around to make your state be indexed and then removed. > Regards, > Andrea
Hi Alejandro, Thank you for your patience. Last week, we've performed the upgrade of Slurm packages, but we keep seeing the following messages being spammed: [2019-06-17T11:23:34,025][DEBUG][o.e.a.b.TransportShardBulkAction] [ma3] [slurm][3] failed to execute bulk item (index) BulkShardRequest [[slurm][3]] containing [index {[slurm][jobcomp][AWtkv7YJs-p2C7GevghS], source[{"jobid":60526,"username":"k202068","user_id":20586,"groupname":"k20200","group_id":1567,"@start":"2016-06-16T08:19:38","@end":"2016-06-16T08:19:38","elapsed":0,"partition":"compute","alloc_node":"btlogin1","nodes":"(null)","total_cpus":0,"total_nodes":0,"derived_exitcode":0,"exitcode":0,"state":"CANCELLED","cpu_hours":0.000000,"@submit":"2016-06-16T08:19:09","eligible_time":18446744073709551345,"work_dir":"/mnt/lustre/pf/k/k202068/mpi_programme","std_err":"/mnt/lustre/scratch/k/k202068/dev-shm.e%j","std_in":"/dev/null","std_out":"/mnt/lustre/scratch/k/k202068/dev-shm.o%j","cluster":"bullp","qos":"normal","ntasks":48,"ntasks_per_node":48,"cpus_per_task":1,"time_limit":3600,"account":"k20200","script":"#!/bin/bash\n#SBATCH --job-name=dev-shm \n#SBATCH --partition=temp\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=48\n#SBATCH --time=00:60:00\n#SBATCH --mail-type=FAIL\n#SBATCH --mail-user=beyer@dkrz.de\n#SBATCH --account=k20200\n#SBATCH --output=/mnt/lustre/scratch/k/k202068/dev-shm.o%j\n#SBATCH --error=/mnt/lustre/scratch/k/k202068/dev-shm.e%j\n#SBATCH --begin=now+300\n####SBATCH --reservation=k202068_59\n####SBATCH --constraint=haswell\n\nexport SESSION_TMP_DIR=\"/dev/shm/${SLURM_JOB_USER}_${SLURM_JOB_ID}\"\n\nenv | grep -i SESSION_TMP_DIR\n\nfor i in `seq 0 9`; do\n dd if=/dev/zero of=${SESSION_TMP_DIR}/dd_testfile_100M_${i} bs=1M count=100\ndone\n\nsleep 1\n\nls -la ${SESSION_TMP_DIR}\n\nsleep 1\n\nmodule load bullxmpi/bullxmpi-1.2.8.4\nmodule load intel/16.0.3\n\nsrun -l --propagate=STACK --cpu_bind=threads --distribution=block:cyclic ./heat-mpi-slow-big_5120-intel-16.03-bullxmpi-1.2.8.4\n\n\n","parent_accounts":"/root/dkrz/k20200"}]}] org.elasticsearch.index.mapper.MapperParsingException: failed to Any idea?
Hi Andrea, can you attach your current elasticsearc_state[.old] and elasticsearch.log? Thanks
Created attachment 10618 [details] Elastic search SLURM's file
Created attachment 10619 [details] elasticsearch logs
Hi, Kindly find attached the required files. Regards, Andrea
Hi Andrea, Looking at the logs alex@polaris:~/Downloads/new$ grep "Caused by" elasticsearch.log | sort -u Caused by: com.fasterxml.jackson.core.JsonParseException: Numeric value (18446744073709548016) out of range of long (-9223372036854775808 - 9223372036854775807) Caused by: com.fasterxml.jackson.core.JsonParseException: Numeric value (18446744073709551345) out of range of long (-9223372036854775808 - 9223372036854775807) Caused by: java.lang.NullPointerException alex@polaris:~/Downloads/new$ Elasticearch specifies[1] for long number type: long - A signed 64-bit integer with a minimum value of -2^63 and a maximum value of 2^63-1. This range matches the mentioned in the logs, where there are jobs containing ... "eligible_time":18446744073709551345 ... these jobs were probably serialized before this commit, so using a version prior to slurm-16-05-10-1 https://github.com/SchedMD/slurm/commit/0a4d5770e9e that commit changed eligible_time from being formatted as an unsigned long to be formatted as @eligible as date type and add queue_time as a long (signed 64-bit integer). With this command you can inspect the type of each field (mappings) for your slurm index: alex@polaris:~/t$ curl -XGET http://localhost:9200/slurm/_mapping?pretty { "slurm" : { "mappings" : { "jobcomp" : { "properties" : { "@eligible" : { "type" : "date" }, ... "queue_wait" : { "type" : "long" }, ... I noted something strange though. The jobs that elastic logs complain about, like source[{"jobid":60526,"username":"k202068",..."eligible_time":18446744073709551345,..."cluster":"bullp"... doesn't appear in the elasticsearch_state you attached. The elasticsearch_state you provided seems to contain jobs from cluster mistral and with and running a newer version of Slurm. Instead, the logs show cluster bullp with the old formatting eligible_time as explained above. This reasoning is also backed up by the fact that the state you attached is properly indexed with the patch I provided in this bug. [1] https://www.elastic.co/guide/en/elasticsearch/reference/current/number.html
Created attachment 10666 [details] Outlook-Atos logo.png Hi Alejandro, Thank you for the analysis, but then how do we move from here? It's weird that you see bullp cluster and mistral - it has always been mistral afaik. Mit freundlichen Grüßen / Kind regards, [Atos logo] Andrea Del Monaco HPC Consultant – Big Data & Security [GER] M: +31 612031174 Burgemeester Rijnderslaan 30 – 1185 MC Amstelveen – The Netherlands atos.net<https://atos.net/> [LinkedIn icon]<https://www.linkedin.com/company/1259/> [Twitter icon] <https://twitter.com/atos> [Facebook icon] <https://www.facebook.com/Atos/> [Youtube icon] <https://www.youtube.com/user/Atos> ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Thursday, June 20, 2019 5:57 PM To: Del Monaco, Andrea Subject: [Bug 6859] elasticsearch jobcomplete plugin Comment # 28<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6859%23c28&data=02%7C01%7Candrea.delmonaco%40atos.net%7C9d5f8f0657894ba73a7c08d6f5980dba%7C33440fc6b7c7412cbb730e70b0198d5a%7C0%7C0%7C636966430747301921&sdata=GujeyxCCKDKo0%2BXKmNn89ojHon%2FbVqZ3JWnro9cGtQY%3D&reserved=0> on bug 6859<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6859&data=02%7C01%7Candrea.delmonaco%40atos.net%7C9d5f8f0657894ba73a7c08d6f5980dba%7C33440fc6b7c7412cbb730e70b0198d5a%7C0%7C0%7C636966430747311936&sdata=YcCc3WqarJWBO7ge6QaERvgYajbKw%2BUmQD0NycsazJc%3D&reserved=0> from Alejandro Sanchez<mailto:alex@schedmd.com> Hi Andrea, Looking at the logs alex@polaris:~/Downloads/new$ grep "Caused by" elasticsearch.log | sort -u Caused by: com.fasterxml.jackson.core.JsonParseException: Numeric value (18446744073709548016) out of range of long (-9223372036854775808 - 9223372036854775807) Caused by: com.fasterxml.jackson.core.JsonParseException: Numeric value (18446744073709551345) out of range of long (-9223372036854775808 - 9223372036854775807) Caused by: java.lang.NullPointerException alex@polaris:~/Downloads/new$ Elasticearch specifies[1] for long number type: long - A signed 64-bit integer with a minimum value of -2^63 and a maximum value of 2^63-1. This range matches the mentioned in the logs, where there are jobs containing ... "eligible_time":18446744073709551345 ... these jobs were probably serialized before this commit, so using a version prior to slurm-16-05-10-1 https://github.com/SchedMD/slurm/commit/0a4d5770e9e<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSchedMD%2Fslurm%2Fcommit%2F0a4d5770e9e&data=02%7C01%7Candrea.delmonaco%40atos.net%7C9d5f8f0657894ba73a7c08d6f5980dba%7C33440fc6b7c7412cbb730e70b0198d5a%7C0%7C0%7C636966430747311936&sdata=vGNpOWfO6UEr2rZ3ZcSaS2a4BBsjOd707twEW0sbvzU%3D&reserved=0> that commit changed eligible_time from being formatted as an unsigned long to be formatted as @eligible as date type and add queue_time as a long (signed 64-bit integer). With this command you can inspect the type of each field (mappings) for your slurm index: alex@polaris:~/t$ curl -XGET http://localhost:9200/slurm/_mapping?pretty { "slurm" : { "mappings" : { "jobcomp" : { "properties" : { "@eligible" : { "type" : "date" }, ... "queue_wait" : { "type" : "long" }, ... I noted something strange though. The jobs that elastic logs complain about, like source[{"jobid":60526,"username":"k202068",..."eligible_time":18446744073709551345,..."cluster":"bullp"... doesn't appear in the elasticsearch_state you attached. The elasticsearch_state you provided seems to contain jobs from cluster mistral and with and running a newer version of Slurm. Instead, the logs show cluster bullp with the old formatting eligible_time as explained above. This reasoning is also backed up by the fact that the state you attached is properly indexed with the patch I provided in this bug. [1] https://www.elastic.co/guide/en/elasticsearch/reference/current/number.html<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.elastic.co%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fnumber.html&data=02%7C01%7Candrea.delmonaco%40atos.net%7C9d5f8f0657894ba73a7c08d6f5980dba%7C33440fc6b7c7412cbb730e70b0198d5a%7C0%7C0%7C636966430747321935&sdata=6zRvuVbsdvFobCqLVXphSvVvK7U9dV62SueywA55nhQ%3D&reserved=0> ________________________________ You are receiving this mail because: * You reported the bug.
Created attachment 10667 [details] Outlook-LinkedIn i.png
Created attachment 10668 [details] Outlook-Twitter ic.png
Created attachment 10669 [details] Outlook-Facebook i.png
Created attachment 10670 [details] Outlook-Youtube ic.png
The problem is the last state you attached doesn't contain the serialized jobs the logs complain about. For example, the log complains about job 60526 but the state doesn't contain information about this job. Do you have/had more than one cluster trying to index to the same elastic server? Can you attach the state containing the jobs that according to the logs can't be indexed? then with this state, I can prepare a modification of the previous patch to amend the out of range of long issue.
Created attachment 10682 [details] image006.png Hi Alejandro, I have attached the files you asked for, so I am not sure what else I should attach… Are there any other files that I might attach? Mit freundlichen Grüßen / Kind regards, [Atos logo] Andrea Del Monaco HPC Consultant – Big Data & Security [GER] M: +31 612031174 Burgemeester Rijnderslaan 30 – 1185 MC Amstelveen – The Netherlands atos.net<https://atos.net/> [LinkedIn icon]<https://www.linkedin.com/company/1259/> [Twitter icon] <https://twitter.com/atos> [Facebook icon] <https://www.facebook.com/Atos/> [Youtube icon] <https://www.youtube.com/user/Atos> From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, June 24, 2019 12:43 PM To: Del Monaco, Andrea <andrea.delmonaco@atos.net> Subject: [Bug 6859] elasticsearch jobcomplete plugin Comment # 34<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6859%23c34&data=02%7C01%7Candrea.delmonaco%40atos.net%7C697bdcf90a414d2ab44608d6f890cd61%7C33440fc6b7c7412cbb730e70b0198d5a%7C0%7C0%7C636969698135134797&sdata=P3KToXVniJfU4JCvcyQmX1V12kKtuBbCj2O%2FPFrTcPo%3D&reserved=0> on bug 6859<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6859&data=02%7C01%7Candrea.delmonaco%40atos.net%7C697bdcf90a414d2ab44608d6f890cd61%7C33440fc6b7c7412cbb730e70b0198d5a%7C0%7C0%7C636969698135134797&sdata=K0FopA6bQZAFTOy4FfXk%2FwdyXOX7s8rKxcmTjNd42yg%3D&reserved=0> from Alejandro Sanchez<mailto:alex@schedmd.com> The problem is the last state you attached doesn't contain the serialized jobs the logs complain about. For example, the log complains about job 60526 but the state doesn't contain information about this job. Do you have/had more than one cluster trying to index to the same elastic server? Can you attach the state containing the jobs that according to the logs can't be indexed? then with this state, I can prepare a modification of the previous patch to amend the out of range of long issue. ________________________________ You are receiving this mail because: * You reported the bug.
Created attachment 10683 [details] image007.png
Created attachment 10684 [details] image008.png
Created attachment 10685 [details] image009.png
Created attachment 10686 [details] image010.png
The plugin has an internal List (pool) of finished jobs to be indexed of size MAX_JOBS 1000000. The ones that can't be indexed for any reason are retried every INDEX_RETRY_INTERVAL 30 (seconds). When the plugin finishes execution due to slurmctld termination, the mentioned List is saved to the elasticsearch_state file. Complains about jobs like 60526 are mentioned in the elasticsearch.log but such job isn't in the elasticsearch_state you attached: alex@polaris:~/Downloads/enew$ grep -c 60526 elasticsearch.log 737 alex@polaris:~/Downloads/enew$ grep -c 60526 elasticsearch_state 0 alex@polaris:~/Downloads/enew$ This means that either: a) Job 60526 was already indexed (unlikely but could you double-check?) b) You have more than one cluster and you attached an elasticsearch_state or elasticsearch.log from different clusters; or the ClusterName for this system changed at some point in time: alex@polaris:~/Downloads/enew$ grep -c bullp elasticsearch.log 1494 alex@polaris:~/Downloads/enew$ grep -c mistral elasticsearch.log 0 alex@polaris:~/Downloads/enew$ grep -c bullp elasticsearch_state 0 alex@polaris:~/Downloads/enew$ grep -c mistral elasticsearch_state 337 alex@polaris:~/Downloads/enew$ c) Job 60526 wasn't saved in the elasticsearch_state for some reason (unlikely).
Hi, is there anything else you need from here? thanks.
Hi, I'm gonna close this out as per the standalone workaround patch provided. I'm planning to refactor the plugin to address a couple of issues I've been identifying with time. This won't happen until 20.02 at the soonest.