Ticket 1803

Summary: uid=4294967294 when scontrol update
Product: Slurm Reporter: Akmal Madzlan <akmalm>
Component: slurmctldAssignee: Moe Jette <jette>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: brian, da
Version: 14.11.8   
Hardware: Linux   
OS: Linux   
Site: DownUnder GeoSolutions Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Akmal Madzlan 2015-07-13 15:10:17 MDT
[2015-07-13T13:32:17.017] _slurm_rpc_update_job complete JobId=4275962 uid=1260 usec=754
[2015-07-13T13:32:17.062] _part_access_check: uid 4294967294 access to partition teamoxford denied, bad group
[2015-07-13T13:32:17.063] _part_access_check: uid 4294967294 access to partition idle denied, bad group
[2015-07-13T13:32:17.063] _part_access_check: uid 4294967294 access to partition desktopBigMem denied, bad group
[2015-07-13T13:32:17.063] update_job: setting partition to lud54 for job_id 4275963


Any idea how slurm got those uid?
One of our user try to update his job using his own script/bash function.

qu() {
### Queue Update

if [[ -n "$1" ]]; then
grep -v JOBID | awk -v pp=$1 '{ print( "scontrol update job="substr($9,1,7)" priority="pp" partition=teamoxford,idle,desktopBigMem,lud54 " ) ; system ( "scontrol update job="substr($9,1,7)" priority="pp" partition=teamoxford,idle,desktopBigMem,lud54 " ) }'
else
grep -v JOBID | awk '{ print( "scontrol update job="substr($9,1,7)" priority=500 partition=teamoxford,idle,desktopBigMem,lud54 " ) ; system( "scontrol update job="substr($9,1,7)" priority=500 partition=teamoxford,idle,desktopBigMem,lud54 " ) }'
fi

else
echo "qu Error"
fi

}

And some of the scontrol spit out those access denied error. I'm unable to reproduce his issue. Maybe the issue is gone after I restarted slurmctld.
Comment 1 Moe Jette 2015-07-14 03:08:53 MDT
The UID value of 4294967294 (or -1) reported by Slurm comes from the request's credential, which is generated by Munge. Munge initializes the UID and GID in its credentials with a value of -1. That gets changed once the credential is decoded. Munge should log an error otherwise.

If you look at the Munge logs on both the client and server you should find some indication of why there was a failure. There may also be a munge error in slurmctld's log file before the lines you included in the first message. My best guess is that user does not have an account on the node where slurmctld runs. They don't need login access, but the account should exist.

The default Munge log file location is "/var/log/munge/munged.log"
Comment 2 Moe Jette 2015-07-20 10:26:30 MDT
Were you able to determine the sourced of the bad Munge credential?
Comment 3 Akmal Madzlan 2015-07-20 14:13:32 MDT
I'm unable to trace the cause and It seems like this has not happened anymore

Thanks Moe,
Akmal
Comment 4 David Bigagli 2015-07-20 20:46:04 MDT
What was the real id of that user?

David
Comment 5 Akmal Madzlan 2015-07-20 21:14:11 MDT
His real id is 1260

Akmal