We have been having a frequent problem where job/slurm_script won't start due to error: execve(): /var/spool/slurmd/job1231290/slurm_script: Permission denied It seems to come down to issues with ypbind and order in which slurmd and ypbind have started. If ypbind starts first (or, I think, it gets restarted after slurmd has been running) then slurmd has the above problem. Our fix has been to restart slurmd when that happens but as we cannot detect it until jobs start failing this is rather upsetting for our users. I have straces of slurmd and slurmd debug logs that I can supply - I'll take another look at them but at first look job directory and slurm_script seem to have correct permissions.
Yes please send the debug data along. Why should ypbind matter when it comes to permissions? because it does not know who the user is? David
I have it at http://tigress-web.princeton.edu/~plazonic/execve_trouble.tar.bz2 There is a strace (abcd.PID), a.log showing ls -laR at the time job started and the rest should be obvious.
Hm, hold on - don't look at it just yet - might be a local misconfig
Ok On 11/23/2015 04:56 PM, bugs@schedmd.com wrote: > > *Comment # 3 <http://bugs.schedmd.com/show_bug.cgi?id=2180#c3> on bug > 2180 <http://bugs.schedmd.com/show_bug.cgi?id=2180> from Josko > Plazonic <mailto:plazonic@princeton.edu> * > Hm, hold on - don't look at it just yet - might be a local misconfig > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You are on the CC list for the bug. > * You are the assignee for the bug. > * You are watching someone on the CC list of the bug. > * You are watching the assignee of the bug. >
Indeed it looks like it might be us - /var/spool/slurmd had wrong permissions which slurmd restart fixed though it is still puzzling that it needed two restarts to get it fixed. I.e. after update it was wrong permissions, then slurmd is supposed to get restarted (and it did) but Sorry, sometimes one has to describe a problem to someone else to see the obvious :(
Of course.. and as more one talks and more he thinks... oppssss :-) David
Local configuration issue. David