| Summary: | uid=4294967294 when scontrol update | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Akmal Madzlan <akmalm> |
| Component: | slurmctld | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | brian, da |
| Version: | 14.11.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | DownUnder GeoSolutions | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
The UID value of 4294967294 (or -1) reported by Slurm comes from the request's credential, which is generated by Munge. Munge initializes the UID and GID in its credentials with a value of -1. That gets changed once the credential is decoded. Munge should log an error otherwise. If you look at the Munge logs on both the client and server you should find some indication of why there was a failure. There may also be a munge error in slurmctld's log file before the lines you included in the first message. My best guess is that user does not have an account on the node where slurmctld runs. They don't need login access, but the account should exist. The default Munge log file location is "/var/log/munge/munged.log" Were you able to determine the sourced of the bad Munge credential? I'm unable to trace the cause and It seems like this has not happened anymore Thanks Moe, Akmal What was the real id of that user? David His real id is 1260 Akmal |
[2015-07-13T13:32:17.017] _slurm_rpc_update_job complete JobId=4275962 uid=1260 usec=754 [2015-07-13T13:32:17.062] _part_access_check: uid 4294967294 access to partition teamoxford denied, bad group [2015-07-13T13:32:17.063] _part_access_check: uid 4294967294 access to partition idle denied, bad group [2015-07-13T13:32:17.063] _part_access_check: uid 4294967294 access to partition desktopBigMem denied, bad group [2015-07-13T13:32:17.063] update_job: setting partition to lud54 for job_id 4275963 Any idea how slurm got those uid? One of our user try to update his job using his own script/bash function. qu() { ### Queue Update if [[ -n "$1" ]]; then grep -v JOBID | awk -v pp=$1 '{ print( "scontrol update job="substr($9,1,7)" priority="pp" partition=teamoxford,idle,desktopBigMem,lud54 " ) ; system ( "scontrol update job="substr($9,1,7)" priority="pp" partition=teamoxford,idle,desktopBigMem,lud54 " ) }' else grep -v JOBID | awk '{ print( "scontrol update job="substr($9,1,7)" priority=500 partition=teamoxford,idle,desktopBigMem,lud54 " ) ; system( "scontrol update job="substr($9,1,7)" priority=500 partition=teamoxford,idle,desktopBigMem,lud54 " ) }' fi else echo "qu Error" fi } And some of the scontrol spit out those access denied error. I'm unable to reproduce his issue. Maybe the issue is gone after I restarted slurmctld.