| Summary: | slurmrestd sends back invalid json payload | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Andrew Bruno <aebruno2> |
| Component: | slurmrestd | Assignee: | Nate Rini <nate> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | nate, tim |
| Version: | 20.02.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | University of Buffalo (SUNY) | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 20.02.4, 20.11 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
system info for openstack instance
system info for bare metal machine slurm.conf Patch to fix corrupt json |
||
(In reply to Andrew Bruno from comment #0) > We're interested in consuming jobs data using the slurmrestd > /slurm/v0.0.35/jobs endpoint operation. We're finding this endpoint will > consistently return invalid json data. Here's an example: Can you please attach the raw response from slurmrestd with the invalid response. Can you also verify how slurmrestd is being called. (In reply to Nate Rini from comment #1) > (In reply to Andrew Bruno from comment #0) > Can you also verify how slurmrestd is being called. Just to be more precise: is slurmrestd being called by root or a slurm user? > Just to be more precise: is slurmrestd being called by root or a slurm user? slurmrestd is being run by normal user account. $ id uid=1000(centos) gid=1000(centos) groups=1000(centos),4(adm),10(wheel),190(systemd-journal) $ slurmrestd unix:/home/centos/slurm.sock lt-slurmrestd: _add_connection: [unix:/home/centos/slurm.sock] new connection input_fd=8 output_fd=8 lt-slurmrestd: _signal_change: sending lt-slurmrestd: _watch: starting connections=0 listen=1 lt-slurmrestd: _watch: detected 1 events from event fd lt-slurmrestd: _handle_connection: [unix:/home/centos/slurm.sock] waiting to read lt-slurmrestd: _watch: queuing up listen lt-slurmrestd: _listen: listeners=1 lt-slurmrestd: _listen: [unix:/home/centos/slurm.sock] listening lt-slurmrestd: _listen: polling 3/3 file descriptors Then from another terminal we run: curl -vvv -H 'Accept: application/json' --unix-socket /home/centos/slurm.sock http:/slurm/v0.0.35/jobs I'm going to try to replicate this locally first based on your description. --Nate (In reply to Nate Rini from comment #4) > I'm going to try to replicate this locally first based on your description. > Sounds good. It appears it requires a large enough job list to replicate. We tried in our dev environment with only a few jobs and couldn't reproduce. I'll work on getting you some sanitized raw response output. But in the meantime here's some more details. We captured a good valid json response (good.json) and one that is invalid json (bad.json): $ wc -l good.json bad.json 128863 good.json 128153 bad.json $ cat bad.json | grep '"job_id"' | wc -l 1043 $ cat bad.json | grep '"job_id"' | sort -u | wc -l 128 $ cat good.json | grep '"job_id"' | wc -l 1049 $ cat good.json | grep '"job_id"' | sort -u | wc -l 1049 (In reply to Andrew Bruno from comment #5) > (In reply to Nate Rini from comment #4) > > I'm going to try to replicate this locally first based on your descript > Sounds good. It appears it requires a large enough job list to replicate. We > tried in our dev environment with only a few jobs and couldn't reproduce. So far the naive replication with 1.5k jobs pending doesn't show it either. > I'll work on getting you some sanitized raw response output. But in the > meantime here's some more details. Can you please provide an updated slurm.conf & friends from the cluster Can you please also provide your version of json-c. Version '0.13.1+dfsg-4ubuntu0.1' has known data corruption issues. (In reply to Nate Rini from comment #8) > Can you please also provide your version of json-c. Version > '0.13.1+dfsg-4ubuntu0.1' has known data corruption issues. We're running on centos 7.8: Name : json-c Arch : x86_64 Version : 0.11 Release : 4.el7_0 Size : 64 k However, I thought this might have been related to json-c and compiled slurm against the latest release of json-c (json-c-0.14-20200419) and still saw the same invalid json responses. (In reply to Andrew Bruno from comment #0) > 439, > "commealse, should be > "comment": "", Is the corruption always 3-4 characters long? (In reply to Nate Rini from comment #10) > (In reply to Andrew Bruno from comment #0) > > 439, > > "commealse, > should be > > "comment": "", > > Is the corruption always 3-4 characters long? Here's a few more examples of the corruption: ..... "core_spec": null, "thread_spec": null, "cores_per_socket":ray_task_id": null, "array_max_tasks": 0, "array_task_string": "", "association_id": 735, ..... "contiguous": false, spend_time": 0, "system_comment": "", "time_limit": 1440, .... "wckey": "", "current_working_directory": "\/gpfs\/scratch\/hpcc" "", "tres_req_str": "cpu=1,mem=2800M,node=1,billing=1", Can you please provide your slurm.conf & friends and the following output:
> lscpu
> lsmem
> cat /proc/meminfo
> numactl -a -s
I'm wondering if your cpu configuration is more likely to cause a race condition. Have you seen this corruption on any other machines?
(In reply to Nate Rini from comment #14) > Can you please provide your slurm.conf & friends and the following output: > > lscpu > > lsmem > > cat /proc/meminfo > > numactl -a -s > > I'm wondering if your cpu configuration is more likely to cause a race > condition. Have you seen this corruption on any other machines? Yes, I can reproduce on both a cloud instance (running in openstack) as well as a bare metal machine. Attached are the sysinfo-openstack.txt and sysinfo-baremetal.txt system info along with slurm.conf. Let us know if you need anything else. Created attachment 14434 [details]
system info for openstack instance
Created attachment 14435 [details]
system info for bare metal machine
Created attachment 14436 [details]
slurm.conf
Created attachment 14437 [details]
Patch to fix corrupt json
I was able to fix the issue (patch is attached). Narrowed it down to this line:
memcpy(get_buf_data(con->out), (get_buf_data(con->out) + wrote),
(get_buf_offset(con->out) - wrote));
The memory areas must not overlap when using memcpy? There may be a better approach but switching to using memmove fixed all the corrupted data I was seeing.
Andrew This patch has been sent for review and inclusion upstream. Thanks, --Nate This patch is upstream: https://github.com/SchedMD/slurm/commit/37aba351deb13f436665ca7e4cfeb80211d1fbb6 Thanks, --Nate |
We're interested in consuming jobs data using the slurmrestd /slurm/v0.0.35/jobs endpoint operation. We're finding this endpoint will consistently return invalid json data. Here's an example: "cpu_frequency_maximum": null, "cpu_frequency_governor": null, "cpus_per_tres": "", "deadline": 0, "delay_boot": 0, "dependency": "", "derived_exit_code": 0, 439, "user_name": "", "wckey": "", The error will randomly appear in a different spot in the output. For example: "burst_buffer": "", "burst_buffer_state": "", "cluster": "ub-hpc", "cluster_features": "", "command": "\/projects\/academic\/user\/script", "commealse, "resize_time": 0, "restart_cnt": 0, "resv_name": "", "shared": null, "show_flags": [ "SHOW_ALL", "SHOW_LOCAL" ], "sockets_per_board": 0, "sockets_per_node": null, We also will sometimes (thought not as frequent) get a valid json response. We have about 1.5K jobs that get returned from the endpoint and the typical response body size is: < Content-Length: 4078929 Did some testing and doesn't appear to be related to json serialization but perhaps related to conmgr.c _handle_write (buffer/memory issue?). We compiled slurm with the following: ./configure --enable-slurmrestd And we're running using a unix socket: slurmrestd unix:/path/to/slurm.sock Then we test with curl and python to validate that correct json is returned: curl -vvv -H 'Accept: application/json' --unix-socket /path/to/slurm.sock http:/slurm/v0.0.35/jobs | python -mjson.tool Invalid control character at: line 7897 column 17 (char 219192) Let us know if we can provide any other information.