Ticket 9122

Summary: slurmrestd sends back invalid json payload
Product: Slurm Reporter: Andrew Bruno <aebruno2>
Component: slurmrestdAssignee: Nate Rini <nate>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: nate, tim
Version: 20.02.2   
Hardware: Linux   
OS: Linux   
Site: University of Buffalo (SUNY) Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 20.02.4, 20.11
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: system info for openstack instance
system info for bare metal machine
slurm.conf
Patch to fix corrupt json

Description Andrew Bruno 2020-05-28 14:10:13 MDT
We're interested in consuming jobs data using the slurmrestd /slurm/v0.0.35/jobs endpoint operation. We're finding this endpoint will consistently return invalid json data. Here's an example:

     "cpu_frequency_maximum": null,
     "cpu_frequency_governor": null,
     "cpus_per_tres": "",
     "deadline": 0,
     "delay_boot": 0,
     "dependency": "",
     "derived_exit_code": 0,
   439,
     "user_name": "",
     "wckey": "",


The error will randomly appear in a different spot in the output. For example:

     "burst_buffer": "",
     "burst_buffer_state": "",
     "cluster": "ub-hpc",
     "cluster_features": "",
     "command": "\/projects\/academic\/user\/script",
     "commealse,
     "resize_time": 0,
     "restart_cnt": 0,
     "resv_name": "",
     "shared": null,
     "show_flags": [
       "SHOW_ALL",
       "SHOW_LOCAL"
     ],
     "sockets_per_board": 0,
     "sockets_per_node": null,

We also will sometimes (thought not as frequent) get a valid json response. We have about 1.5K jobs that get returned from the endpoint and the typical response body size is:

< Content-Length: 4078929

Did some testing and doesn't appear to be related to json serialization but perhaps related to conmgr.c _handle_write (buffer/memory issue?).

We compiled slurm with the following:

./configure --enable-slurmrestd

And we're running using a unix socket:

slurmrestd unix:/path/to/slurm.sock

Then we test with curl and python to validate that correct json is returned:

curl -vvv -H 'Accept: application/json' --unix-socket /path/to/slurm.sock http:/slurm/v0.0.35/jobs | python -mjson.tool

Invalid control character at: line 7897 column 17 (char 219192)

Let us know if we can provide any other information.
Comment 1 Nate Rini 2020-05-28 14:32:10 MDT
(In reply to Andrew Bruno from comment #0)
> We're interested in consuming jobs data using the slurmrestd
> /slurm/v0.0.35/jobs endpoint operation. We're finding this endpoint will
> consistently return invalid json data. Here's an example:
Can you please attach the raw response from slurmrestd with the invalid response. Can you also verify how slurmrestd is being called.
Comment 2 Nate Rini 2020-05-28 14:33:49 MDT
(In reply to Nate Rini from comment #1)
> (In reply to Andrew Bruno from comment #0)
> Can you also verify how slurmrestd is being called.

Just to be more precise: is slurmrestd being called by root or a slurm user?
Comment 3 Andrew Bruno 2020-05-28 15:21:00 MDT
> Just to be more precise: is slurmrestd being called by root or a slurm user?

slurmrestd is being run by normal user account. 

$ id
uid=1000(centos) gid=1000(centos) groups=1000(centos),4(adm),10(wheel),190(systemd-journal)

$ slurmrestd unix:/home/centos/slurm.sock
lt-slurmrestd: _add_connection: [unix:/home/centos/slurm.sock] new connection input_fd=8 output_fd=8
lt-slurmrestd: _signal_change: sending
lt-slurmrestd: _watch: starting connections=0 listen=1
lt-slurmrestd: _watch: detected 1 events from event fd
lt-slurmrestd: _handle_connection: [unix:/home/centos/slurm.sock] waiting to read
lt-slurmrestd: _watch: queuing up listen
lt-slurmrestd: _listen: listeners=1
lt-slurmrestd: _listen: [unix:/home/centos/slurm.sock] listening
lt-slurmrestd: _listen: polling 3/3 file descriptors

Then from another terminal we run:

curl -vvv -H 'Accept: application/json' --unix-socket /home/centos/slurm.sock http:/slurm/v0.0.35/jobs
Comment 4 Nate Rini 2020-05-28 15:25:31 MDT
I'm going to try to replicate this locally first based on your description.

--Nate
Comment 5 Andrew Bruno 2020-05-28 15:41:03 MDT
(In reply to Nate Rini from comment #4)
> I'm going to try to replicate this locally first based on your description.
> 

Sounds good. It appears it requires a large enough job list to replicate. We tried in our dev environment with only a few jobs and couldn't reproduce.  

I'll work on getting you some sanitized raw response output. But in the meantime here's some more details.

We captured a good valid json response (good.json) and one that is invalid json (bad.json):

$ wc -l good.json bad.json 
 128863 good.json
 128153 bad.json

$ cat bad.json | grep '"job_id"' | wc -l
1043

$ cat bad.json | grep '"job_id"' | sort -u | wc -l
128


$ cat good.json | grep '"job_id"' | wc -l
1049

$ cat good.json | grep '"job_id"' | sort -u | wc -l
1049
Comment 6 Nate Rini 2020-05-28 15:57:38 MDT
(In reply to Andrew Bruno from comment #5)
> (In reply to Nate Rini from comment #4)
> > I'm going to try to replicate this locally first based on your descript
> Sounds good. It appears it requires a large enough job list to replicate. We
> tried in our dev environment with only a few jobs and couldn't reproduce.  
So far the naive replication with 1.5k jobs pending doesn't show it either.

> I'll work on getting you some sanitized raw response output. But in the
> meantime here's some more details.
Can you please provide an updated slurm.conf & friends from the cluster
Comment 8 Nate Rini 2020-05-28 16:03:17 MDT
Can you please also provide your version of json-c. Version '0.13.1+dfsg-4ubuntu0.1' has known data corruption issues.
Comment 9 Andrew Bruno 2020-05-28 16:13:38 MDT
(In reply to Nate Rini from comment #8)
> Can you please also provide your version of json-c. Version
> '0.13.1+dfsg-4ubuntu0.1' has known data corruption issues.

We're running on centos 7.8:

Name        : json-c
Arch        : x86_64
Version     : 0.11
Release     : 4.el7_0
Size        : 64 k

However, I thought this might have been related to json-c and compiled slurm against the latest release of json-c (json-c-0.14-20200419) and still saw the same invalid json responses.
Comment 10 Nate Rini 2020-05-28 16:14:53 MDT
(In reply to Andrew Bruno from comment #0)
>    439,
>      "commealse,
should be
>      "comment": "",

Is the corruption always 3-4 characters long?
Comment 13 Andrew Bruno 2020-05-28 17:30:19 MDT
(In reply to Nate Rini from comment #10)
> (In reply to Andrew Bruno from comment #0)
> >    439,
> >      "commealse,
> should be
> >      "comment": "",
> 
> Is the corruption always 3-4 characters long?

Here's a few more examples of the corruption:

.....
     "core_spec": null,
     "thread_spec": null,
     "cores_per_socket":ray_task_id": null,
     "array_max_tasks": 0,
     "array_task_string": "",
     "association_id": 735,

.....
     "contiguous": false,
  spend_time": 0,
     "system_comment": "",
     "time_limit": 1440,


....
     "wckey": "",
     "current_working_directory": "\/gpfs\/scratch\/hpcc"
  "",
     "tres_req_str": "cpu=1,mem=2800M,node=1,billing=1",
Comment 14 Nate Rini 2020-05-28 17:37:27 MDT
Can you please provide your slurm.conf & friends and the following output:
> lscpu
> lsmem
> cat /proc/meminfo
> numactl -a -s

I'm wondering if your cpu configuration is more likely to cause a race condition. Have you seen this corruption on any other machines?
Comment 15 Andrew Bruno 2020-05-28 17:59:57 MDT
(In reply to Nate Rini from comment #14)
> Can you please provide your slurm.conf & friends and the following output:
> > lscpu
> > lsmem
> > cat /proc/meminfo
> > numactl -a -s
> 
> I'm wondering if your cpu configuration is more likely to cause a race
> condition. Have you seen this corruption on any other machines?

Yes, I can reproduce on both a cloud instance (running in openstack) as well as a bare metal machine. Attached are the sysinfo-openstack.txt and sysinfo-baremetal.txt system info along with slurm.conf.

Let us know if you need anything else.
Comment 16 Andrew Bruno 2020-05-28 18:00:44 MDT
Created attachment 14434 [details]
system info for openstack instance
Comment 17 Andrew Bruno 2020-05-28 18:01:08 MDT
Created attachment 14435 [details]
system info for bare metal machine
Comment 18 Andrew Bruno 2020-05-28 18:01:39 MDT
Created attachment 14436 [details]
slurm.conf
Comment 19 Andrew Bruno 2020-05-28 20:09:18 MDT
Created attachment 14437 [details]
Patch to fix corrupt json
Comment 20 Andrew Bruno 2020-05-28 20:14:50 MDT
I was able to fix the issue (patch is attached). Narrowed it down to this line:

        memcpy(get_buf_data(con->out), (get_buf_data(con->out) + wrote),
               (get_buf_offset(con->out) - wrote));

The memory areas must not overlap when using memcpy? There may be a better approach but switching to using memmove fixed all the corrupted data I was seeing.
Comment 24 Nate Rini 2020-06-01 10:44:19 MDT
Andrew

This patch has been sent for review and inclusion upstream.

Thanks,
--Nate
Comment 26 Nate Rini 2020-06-05 13:41:09 MDT
This patch is upstream: https://github.com/SchedMD/slurm/commit/37aba351deb13f436665ca7e4cfeb80211d1fbb6

Thanks,
--Nate