Ticket 11152

Summary: JWT token questions
Product: Slurm Reporter: Ward Poelmans <ward.poelmans>
Component: DocumentationAssignee: Nate Rini <nate>
Status: RESOLVED INFOGIVEN QA Contact: Ben Roberts <ben>
Severity: 4 - Minor Issue    
Priority: --- CC: cinek, nate
Version: 20.11.5   
Hardware: Linux   
OS: Linux   
Site: VUB Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Ward Poelmans 2021-03-19 10:16:25 MDT
I have some question on the JWT tokens in slurm:

- `scontrol token` has a lifespan parameter but the man page doesn't mention any details. What is the default lifespan of a token and in what units should a lifespan be given (minutes?). Can we restrict the maximum lifespan of a token?

- What does slurm expect in the JWT key? Just random data (of any length?) or something specific? Is `dd if=/dev/random of=jwt.key bs=256 count=1` good enough?

- Where should we put the key? Is it sufficient to place it on the hosts running slurmctld?

- Can we see or log how many tokens are created and by whom?

- Is the JWT tokens used for anything else besides slurmrestd? I noticed some strange errors on testing jobs when the key was not present on a worker node.
Comment 1 Nate Rini 2021-03-19 10:40:40 MDT
(In reply to Ward Poelmans from comment #0)
> - `scontrol token` has a lifespan parameter but the man page doesn't mention
> any details.
I'll look at updating the docs for that.

> What is the default lifespan of a token and 
Default lifespan is 1800 seconds.

> in what units
> should a lifespan be given (minutes?).
The unit is in seconds.

> Can we restrict the maximum lifespan of a token?
Not currently. We have AuthAltParameters=disable_token_creation as an option to allow admins to provide controlled access to JWT if desired.

In bug#10634, we also provide an example of how to generate tokens outside of Slurm. It should be merged upstream shortly. I'll update once it is.

> - What does slurm expect in the JWT key? Just random data (of any length?)
> or something specific? Is `dd if=/dev/random of=jwt.key bs=256 count=1` good
> enough?

The value should be cryptographically random:
> https://www.redhat.com/en/blog/understanding-random-number-generators-and-their-limitations-linux

For most sites, I would expect /dev/random to be sufficient but this is very much a site policy decision as to the needed level of entropy.

> - Where should we put the key? Is it sufficient to place it on the hosts
> running slurmctld?

It needs to be visible to slurmdbd and slurmctld daemon. It should not be visible to anyone or anything else (other than root).
 
> - Can we see or log how many tokens are created and by whom?

slurmctld will log the creation of any token (since slurm-20.11.5):
> https://github.com/SchedMD/slurm/blob/master/src/plugins/auth/jwt/auth_jwt.c#L459
 
> - Is the JWT tokens used for anything else besides slurmrestd? I noticed
> some strange errors on testing jobs when the key was not present on a worker
> node.

JWT (client) tokens are acceptable for any RPC with Slurm that talks to slurmdbd or slurmctld, this includes job submission and querying. Currently, anything (srun in step mode) that needs to talk to slurmstepd or slurmd will reject JWT auth.
Comment 2 Ward Poelmans 2021-03-19 11:23:22 MDT
Thanks for the input.

Last question: what size should/could we use for the key? 256 bytes? 1024?

I would also add to the documentation that the key needs to be on slurmdbd host. I didn't see it anywhere I think?
Comment 3 Nate Rini 2021-03-19 11:33:31 MDT
(In reply to Ward Poelmans from comment #2)
> Last question: what size should/could we use for the key? 256 bytes? 1024?
The key is for HS256 per rfc7518:
> A key of the same size as the hash output (for instance, 256 bits for "HS256") or larger MUST be used with this algorithm. 

libjwt (and Slurm by extension) will accept any non-zero size but I suggest following the RFC of having at least 256bits (32 characters/bytes). We suggest 2048 bit key (from openssl) in our JWT instructions to be extra paranoid. As stated in comment #2, this is really dependent on your site security policies as I would expect a university would have different requirements than a national lab.

> I would also add to the documentation that the key needs to be on slurmdbd
> host. I didn't see it anywhere I think?
Yes, that needs to be updated too. The documentation is an on-going project based on expected questions and the ones we receive via bugs.
Comment 4 Nate Rini 2021-03-19 12:39:35 MDT
(In reply to Nate Rini from comment #1)
> > Can we restrict the maximum lifespan of a token?
> Not currently. We have AuthAltParameters=disable_token_creation as an option
> to allow admins to provide controlled access to JWT if desired.
> 
> In bug#10634, we also provide an example of how to generate tokens outside
> of Slurm. It should be merged upstream shortly. I'll update once it is.
This patch is now upstream:
> https://github.com/SchedMD/slurm/commit/c9e5ed775c2b5c1428f51844583fe77bd7aae3e7

It will appear in our web docs at the next point release but you can view directly above.
Comment 6 Ward Poelmans 2021-03-22 03:38:11 MDT
I've got another question on slurmdbd and JWT tokens.

We have several 'cluster' each with it's own slurmctld but one common slurmdbd instance. I'm using the same jwt key everywhere. I'm having issue with sacct if a tokens is definied.

$ slurmrestd localhost:12345 &
$ export $(scontrol token)
$ curl -H X-SLURM-USER-NAME:$(whoami) -H X-SLURM-USER-TOKEN:$SLURM_JWT 'http://localhost:12345/slurm'
...

This works but when I now run:

$ sacct
sacct: error: slurm_persist_conn_open: Something happened with the receiving/processing of the persistent connection init message to myslurmdb01.mydomain.os:6819: Failed to unpack SLURM_PERSIST_INIT message
sacct: error: Sending PersistInit msg: No error
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
sacct: error: g_slurm_auth_pack: protocol_version 6500 not supported
sacct: error: slurm_send_node_msg: g_slurm_auth_pack: REQUEST_PERSIST_INIT has  authentication error: Operation now in progress
sacct: error: slurm_persist_conn_open: failed to send persistent connection init message to myslurmdb01.mydomain.os:6819
sacct: error: Sending PersistInit msg: Protocol authentication error
sacct: error: DBD_GET_JOBS_COND failure: Unspecified error


After `unset SLURM_JWT` it works again.
Comment 7 Ward Poelmans 2021-03-22 03:59:32 MDT
Already found it: `AuthAltTypes` was missing from slurmdbd.conf.

Now everything works again.
Comment 8 Nate Rini 2021-03-22 13:44:54 MDT
(In reply to Ward Poelmans from comment #6)
> sacct: error: Sending PersistInit msg: Protocol authentication error

Security errors are explicitly obtuse on clients. Make sure to always check the server logs to help debug them.
Comment 10 Nate Rini 2021-03-24 11:14:49 MDT
Ward,

Several of the questions in this ticket have no been documented and will be included in the next release (20.11.6):
> https://github.com/SchedMD/slurm/compare/dbcb8518a09875f32cd8881b8c737081b3dd796b...ba76c72b77ec69605d1147d9cd8ea2a1dcc023bd

I'm going to close this ticket but please respond if you have any more questions.

Thanks,
--Nate