Ticket 2180 - slurm_script won't run due to execve Permission denied
Summary: slurm_script won't run due to execve Permission denied
Status: RESOLVED INVALID
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 15.08.3
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: David Bigagli
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2015-11-23 00:47 MST by Josko Plazonic
Modified: 2016-01-27 10:59 MST (History)
1 user (show)

See Also:
Site: Princeton (PICSciE)
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Josko Plazonic 2015-11-23 00:47:35 MST
We have been having a frequent problem where job/slurm_script won't start due to error:

execve(): /var/spool/slurmd/job1231290/slurm_script: Permission denied

It seems to come down to issues with ypbind and order in which slurmd and ypbind have started.  If ypbind starts first (or, I think, it gets restarted after slurmd has been running) then slurmd has the above problem.

Our fix has been to restart slurmd when that happens but as we cannot detect it until jobs start failing this is rather upsetting for our users.

I have straces of slurmd and slurmd debug logs that I can supply - I'll take another look at them but at first look job directory and slurm_script seem to have correct permissions.
Comment 1 David Bigagli 2015-11-23 00:51:41 MST
Yes please send the debug data along. Why should ypbind matter when it comes
to permissions? because it does not know who the user is?

David
Comment 2 Josko Plazonic 2015-11-23 01:33:56 MST
I have it at

http://tigress-web.princeton.edu/~plazonic/execve_trouble.tar.bz2 

There is a strace (abcd.PID), a.log showing ls -laR at the time job started and the rest should be obvious.
Comment 3 Josko Plazonic 2015-11-23 01:56:45 MST
Hm, hold on - don't look at it just yet - might be a local misconfig
Comment 4 David Bigagli 2015-11-23 01:58:51 MST
Ok

On 11/23/2015 04:56 PM, bugs@schedmd.com wrote:
>
> *Comment # 3 <http://bugs.schedmd.com/show_bug.cgi?id=2180#c3> on bug 
> 2180 <http://bugs.schedmd.com/show_bug.cgi?id=2180> from Josko 
> Plazonic <mailto:plazonic@princeton.edu> *
> Hm, hold on - don't look at it just yet - might be a local misconfig
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>   * You are the assignee for the bug.
>   * You are watching someone on the CC list of the bug.
>   * You are watching the assignee of the bug.
>
Comment 5 Josko Plazonic 2015-11-23 02:06:07 MST
Indeed it looks like it might be us - /var/spool/slurmd had wrong permissions which slurmd restart fixed though it is still puzzling that it needed two restarts to get it fixed.  I.e. after update it was wrong permissions, then slurmd is supposed to get restarted (and it did) but 

Sorry, sometimes one has to describe a problem to someone else to see the obvious :(
Comment 6 David Bigagli 2015-11-23 02:08:32 MST
Of course.. and as more one talks and more he thinks... oppssss :-)

David
Comment 7 David Bigagli 2015-11-23 19:12:35 MST
Local configuration issue.

David