Ticket 24052

Summary: Missing headers and descriptive message in error handling
Product: Slurm Reporter: Rémi Palancher <remi+schedmd>
Component: slurmrestdAssignee: Nate Rini <nate>
Status: OPEN --- QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 25.11.x   
Hardware: Linux   
OS: Linux   
Site: -Other- Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 25.11.1
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Rémi Palancher 2025-11-05 08:35:43 MST
Dear Slurm devs,

I'm giving a try to Slurm 25.11.0 rc1 and I saw this change in slurmrestd errors handling:

With previous versions of Slurm:

$ slurmrestd -V
slurm 24.05.3
$ curl -v --header X-SLURM-USER-TOKEN:$SLURM_JWT http://localhost:6820/slurm/v0.0.41/fail
*   Trying 127.0.0.1:6820...
* Connected to localhost (127.0.0.1) port 6820 (#0)
> GET /slurm/v0.0.41/fail HTTP/1.1
> Host: localhost:6820
> User-Agent: curl/7.88.1
> Accept: */*
> X-SLURM-USER-TOKEN:<redacted>
> 
< HTTP/1.1 404 NOT FOUND
< Connection: Close
< Content-Length: 69
< Content-Type: text/plain
< 
* Closing connection 0
Unable find requested URL. Please view /openapi/v3 for API reference.

With Slurm 25.11.0 rc1:

$ slurmrestd -V
slurm 25.11.0-0rc1
$ curl -v --header X-SLURM-USER-TOKEN:$SLURM_JWT http://localhost:6820/slurm/v0.0.41/fail
*   Trying ::1:6820...
* connect to ::1 port 6820 failed: Connection refused
*   Trying 127.0.0.1:6820...
* Connected to localhost (127.0.0.1) port 6820 (#0)
> GET /slurm/v0.0.41/fail HTTP/1.1
> Host: localhost:6820
> User-Agent: curl/7.76.1
> Accept: */*
> X-SLURM-USER-TOKEN:<redacted>
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 404 NOT FOUND
< Connection: Close
< 
HTTP/1.1 404 NOT FOUND

* Closing connection 0

The response now miss the Content-Type and Content-Length headers, and the descriptive error message is not part of the response anymore. I suspect this to be a bug considering this recent change mentionned in the changelog: https://github.com/SchedMD/slurm/commit/c4965e98d8b553258b168ed9fa87ffc361bea1e9

Is this really the new expected slurmrestd error handling behavior? I wish you to tell me more, as this has an important impact on the development of Slurm-web.
Comment 2 Nate Rini 2025-11-05 10:39:17 MST
(In reply to Rémi Palancher from comment #0)
> The response now miss the Content-Type and Content-Length headers, and the
> descriptive error message is not part of the response anymore.

Thank you for reporting this bug (functional regression).
Comment 4 Nate Rini 2025-11-05 14:03:25 MST
This regression has now been fixed:
> https://github.com/SchedMD/slurm/commit/fd96208e9fe34ce722770562bdaf3afd480bcb70

This fix also causes a few more errors to be included on rejected requests.
Comment 5 Rémi Palancher 2025-11-06 04:05:33 MST
Thank you Nate! I tested your patch successfully in my environment.

I also discovered a difference coming with Slurm 25.11 when JWT is missing in request headers:

With previous versions of Slurm:

$ slurmrestd -V
slurm 24.05.8
$ curl -v http://localhost:6820/slurm/v0.0.41/ping
* Host localhost:6820 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
*   Trying [::1]:6820...
* connect to ::1 port 6820 from ::1 port 42362 failed: Connection refused
*   Trying 127.0.0.1:6820...
* Connected to localhost (127.0.0.1) port 6820
* using HTTP/1.x
> GET /slurm/v0.0.41/ping HTTP/1.1
> Host: localhost:6820
> User-Agent: curl/8.14.1
> Accept: */*
> 
* Request completely sent off
< HTTP/1.1 401 UNAUTHORIZED
< Connection: Close
< Content-Length: 22
< Content-Type: text/plain
< 
* shutting down connection #0
Authentication failure

With Slurm 25.11.0 rc1 (patched):

$ slurmrestd -V
slurm 25.11.0-0rc1
$ curl -v http://localhost:6820/slurm/v0.0.41/ping
* Host localhost:6820 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
*   Trying [::1]:6820...
* connect to ::1 port 6820 from ::1 port 32800 failed: Connection refused
*   Trying 127.0.0.1:6820...
* Established connection to localhost (127.0.0.1 port 6820) from 127.0.0.1 port 50946 
* using HTTP/1.x
> GET /slurm/v0.0.41/ping HTTP/1.1
> Host: localhost:6820
> User-Agent: curl/8.17.0-rc3
> Accept: */*
> 
* Request completely sent off
< HTTP/1.1 500 INTERNAL ERROR
< Connection: Close
< Content-Length: 40
< Content-Type: text/plain
< 
* Excess found writing body: excess = 117, size = 40, maxdownload = 40, bytecount = 40
* shutting down connection #0
Authentication does not apply to request

Beyond the error message, the HTTP status code changed from 401 to 500. I cannot find mention of this change in changelog, is this expected?
Comment 10 Nate Rini 2025-11-06 08:58:17 MST
(In reply to Rémi Palancher from comment #5)
> Beyond the error message, the HTTP status code changed from 401 to 500. I
> cannot find mention of this change in changelog, is this expected?

I'm not able to replicate this. How is slurmrestd being run? Is it possible to get this output?
> ps -ef|grep slurmrestd
> systemctl status slurmrestd
Comment 11 Rémi Palancher 2025-11-06 09:39:03 MST
(In reply to Nate Rini from comment #10)
> I'm not able to replicate this. How is slurmrestd being run? Is it possible
> to get this output?
> > ps -ef|grep slurmrestd
> > systemctl status slurmrestd

Of course!

root@admin:~# slurmrestd -V
slurm 25.11.0-0rc1

root@admin:~# ps -ef | grep slurmrestd
slurmre+   13169       1  0 16:59 ?        00:00:00 /usr/sbin/slurmrestd -a rest_auth/jwt [::]:6820
root       17792   17694  0 17:36 pts/1    00:00:00 grep slurmrestd

root@admin:~# systemctl status slurmrestd.service 
● slurmrestd.service - Slurm REST daemon
     Loaded: loaded (/usr/lib/systemd/system/slurmrestd.service; enabled; preset: enabled)
    Drop-In: /etc/systemd/system/slurmrestd.service.d
             └─firehpc.conf
     Active: active (running) since Thu 2025-11-06 16:59:33 CET; 37min ago
 Invocation: b5d4a140821d479fb1e5159ba9665268
   Main PID: 13169 (slurmrestd)
      Tasks: 33 (limit: 19070)
     Memory: 13.8M (max: 3G, available: 2.9G, peak: 14.8M)
        CPU: 639ms
     CGroup: /system.slice/slurmrestd.service
             └─13169 /usr/sbin/slurmrestd -a rest_auth/jwt "[::]:6820"

Nov 06 17:35:43 admin.nova slurmrestd[13169]: operations_router: [localhost:6820(fd:25)] GET /slurm/v0.0.41/jobs
Nov 06 17:35:43 admin.nova slurmrestd[13169]: rest_auth/jwt: slurm_rest_auth_p_authenticate: [localhost:6820(fd:25)] attempting user_name slurm token authentication pass through
Nov 06 17:36:43 admin.nova slurmrestd[13169]: [2025-11-06T17:36:43.244] operations_router: [localhost:6820(fd:29)] GET /slurm/v0.0.41/nodes
Nov 06 17:36:43 admin.nova slurmrestd[13169]: [2025-11-06T17:36:43.244] rest_auth/jwt: slurm_rest_auth_p_authenticate: [localhost:6820(fd:29)] attempting user_name slurm token authentication pass through
Nov 06 17:36:43 admin.nova slurmrestd[13169]: operations_router: [localhost:6820(fd:29)] GET /slurm/v0.0.41/nodes
Nov 06 17:36:43 admin.nova slurmrestd[13169]: rest_auth/jwt: slurm_rest_auth_p_authenticate: [localhost:6820(fd:29)] attempting user_name slurm token authentication pass through
Nov 06 17:36:43 admin.nova slurmrestd[13169]: [2025-11-06T17:36:43.275] operations_router: [localhost:6820(fd:29)] GET /slurm/v0.0.41/jobs
Nov 06 17:36:43 admin.nova slurmrestd[13169]: [2025-11-06T17:36:43.275] rest_auth/jwt: slurm_rest_auth_p_authenticate: [localhost:6820(fd:29)] attempting user_name slurm token authentication pass through
Nov 06 17:36:43 admin.nova slurmrestd[13169]: operations_router: [localhost:6820(fd:29)] GET /slurm/v0.0.41/jobs
Nov 06 17:36:43 admin.nova slurmrestd[13169]: rest_auth/jwt: slurm_rest_auth_p_authenticate: [localhost:6820(fd:29)] attempting user_name slurm token authentication pass through

root@admin:~# systemctl cat slurmrestd.service 
# /usr/lib/systemd/system/slurmrestd.service
[Unit]
Description=Slurm REST daemon
After=network-online.target remote-fs.target slurmctld.service
Wants=network-online.target
ConditionPathExists=/etc/slurm/slurm.conf

[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/slurmrestd
EnvironmentFile=-/etc/default/slurmrestd
# slurmrestd should never run as root or the slurm user.
# Use a drop-in to change the default User and Group to site specific IDs.
User=slurmrestd
Group=slurmrestd
ExecStart=/usr/sbin/slurmrestd $SLURMRESTD_OPTIONS
# Enable auth/jwt be default, comment out the line to disable it for slurmrestd
Environment=SLURM_JWT=daemon
# Listen on TCP socket by default.
Environment=SLURMRESTD_LISTEN=:6820
ExecReload=/bin/kill -HUP $MAINPID
LimitMEMLOCK=infinity

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/slurmrestd.service.d/firehpc.conf
[Service]
# Unset vendor unit ExecStart and Environment to avoid cumulative definition
ExecStart=
Environment=
Environment="SLURM_JWT=daemon"
ExecStart=/usr/sbin/slurmrestd $SLURMRESTD_OPTIONS -a rest_auth/jwt [::]:6820
DynamicUser=yes
User=slurmrestd
Group=slurmrestd
MemoryMax=3G
Restart=always
Comment 13 Nate Rini 2025-11-07 10:19:37 MST
The issue has been replicated, and I will update once it is corrected.
Comment 14 Rémi Palancher 2025-11-07 14:28:11 MST
(In reply to Nate Rini from comment #13)
> The issue has been replicated, and I will update once it is corrected.

Thank you for looking at this so carefully despite the absence of support contrat :)