Ticket 3977 - Array tasks never start
Summary: Array tasks never start
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 17.02.6
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
: 3978 3998 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2017-07-10 13:12 MDT by Kilian Cavalotti
Modified: 2017-07-20 03:55 MDT (History)
3 users (show)

See Also:
Site: Stanford
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 17.02.7
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurmd log (518.31 KB, text/x-log)
2017-07-10 13:12 MDT, Kilian Cavalotti
Details
Full slurmd log (1.26 MB, application/x-bzip)
2017-07-12 09:25 MDT, Kilian Cavalotti
Details
Possible fix for bug 3977. (1.42 KB, patch)
2017-07-13 14:47 MDT, Tim Wickberg
Details | Diff
slurmd log with patch (17.58 KB, application/x-bzip)
2017-07-13 15:45 MDT, Kilian Cavalotti
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Kilian Cavalotti 2017-07-10 13:12:41 MDT
Created attachment 4897 [details]
slurmd log

Hi!

We're facing a weird issue with array jobs in 17.02.6. The symptom is that some of the tasks of an array don't seem to actually start any of the job's processes, and just linger around until reaching the time limit.

That is, the job array task shows up in squeue, but no process is actually started by slurmd on the node. Thus none exits and the job array task never ends.

This seems to happens only when multiple tasks of the same array are started at the same time on the same node. This is quite reproducible on {CentOS,RHEL}7 with cgroups enabled (I "feel" like this may be a race issue with cgroups, but don't have any solid evidence).

Simple reproducer:
-- 8< -------------------------------------------------------------------------------
$ sbatch -w sh-102-44 --array=1-8 --wrap='echo $SLURM_ARRAY_TASK_ID'
-- 8< -------------------------------------------------------------------------------

Resulting squeue and sacct output when most tasks have normally exited:
-- 8< -------------------------------------------------------------------------------
$ squeue -u kilian
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          459887_2    owners     wrap   kilian  R       1:22      1 sh-102-44

$ sacct -j 459887 -X
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
459887_8           wrap     owners      ruthm          1  COMPLETED      0:0 
459887_1           wrap     owners      ruthm          1  COMPLETED      0:0 
459887_2           wrap     owners      ruthm          1    RUNNING      0:0 
459887_3           wrap     owners      ruthm          1  COMPLETED      0:0 
459887_4           wrap     owners      ruthm          1  COMPLETED      0:0 
459887_5           wrap     owners      ruthm          1  COMPLETED      0:0 
459887_6           wrap     owners      ruthm          1  COMPLETED      0:0 
459887_7           wrap     owners      ruthm          1  COMPLETED      0:0 
-- 8< -------------------------------------------------------------------------------

The debug3 slurmd log from the node is attached, for the period when the job was launched. 
I noticed those messages, but I don't know if they're related or not:

-- 8< -------------------------------------------------------------------------------
slurmstepd[180578]: debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '0,2,8,10' for '/sys/fs/cgroup/cpuset/slurm/uid_215845/job_459889'
slurmstepd[180599]: debug2: _file_read_content: unable to open '/sys/fs/cgroup/cpuset/slurm/cpus' for reading : No such file or directory
slurmstepd[180599]: debug2: xcgroup_get_param: unable to get parameter 'cpus' for '/sys/fs/cgroup/cpuset/slurm'
[...]
slurmstepd[180661]: error: User requested launch of zero tasks!
[...]
slurmd[178796]: error: _remove_starting_step: step 459887.4294967295 not found
slurmd[178796]: error: Error cleaning up starting_step list
-- 8< -------------------------------------------------------------------------------

Anyway, if you have any way to reproduce this and get an idea of what's going on, it would be much appreciated. 

Thanks!
-- 
Kilian
Comment 1 Dominik Bartkiewicz 2017-07-12 04:38:36 MDT
Hi

Could you attach full slurmd.log?
Did you notice any unusual messages in dmesg/syslog?
Which kernel/systemd version do you use?

Dominik
Comment 3 Kilian Cavalotti 2017-07-12 09:25:29 MDT
Created attachment 4906 [details]
Full slurmd log
Comment 4 Kilian Cavalotti 2017-07-12 09:25:58 MDT
Hi Dominik, 

(In reply to Dominik Bartkiewicz from comment #1)
> Could you attach full slurmd.log?

It's huge but sure, here it is.

> Did you notice any unusual messages in dmesg/syslog?

No, syslog and dmesg are completely devoid of any warning/error.

> Which kernel/systemd version do you use?

Latest CentOS 7.3, so that's:

[root@sh-102-44 ~]# uname -r
3.10.0-514.26.2.el7.x86_64
[root@sh-102-44 ~]# systemctl --version
systemd 219
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ -LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN

Thanks!
-- 
Kilian
Comment 6 Dominik Bartkiewicz 2017-07-13 08:22:38 MDT
Hi

Maybe this is not related but in some points this looks similar to bug 3466.
Which sssd version are you using?


Dominik
Comment 7 Kilian Cavalotti 2017-07-13 09:16:03 MDT
Hi Dominik, 

(In reply to Dominik Bartkiewicz from comment #6)
> Maybe this is not related but in some points this looks similar to bug 3466.
> Which sssd version are you using?

That's interesting, but I don't think it's related either: we don't see any resource exhaustion message in any log, and the issue can be replicated with just a 4-task simple "sleep" job array on an otherwise idle 20-core machine.

We actually don't use sssd. That's by design, because we need full tables enumeration for our management tools, that sssd doesn't support anymore. We use nscld instead.

Cheers,
-- 
Kilian
Comment 8 Tim Wickberg 2017-07-13 13:54:54 MDT
*** Ticket 3978 has been marked as a duplicate of this ticket. ***
Comment 9 Tim Wickberg 2017-07-13 14:47:53 MDT
Created attachment 4915 [details]
Possible fix for bug 3977.

Hey guys -

I've merged bug 3978 into this one; I suspect both of you are seeing the same issue.

Can you try out the attached patch to 17.02.6 and see if this improves things? I believe this is a regression from 17.02.5, and this restores the prior behavior and I believe may mitigate this launch issue.

cheer,
- Tim
Comment 11 Kilian Cavalotti 2017-07-13 15:44:56 MDT
Hi Tim, 

(In reply to Tim Wickberg from comment #9)
> Can you try out the attached patch to 17.02.6 and see if this improves
> things? I believe this is a regression from 17.02.5, and this restores the
> prior behavior and I believe may mitigate this launch issue.

Thanks for the patch, but unfortunately, it doesn't seem to fix the problem: I still see "ghost" jobs with a simple:

$ sbatch -p test -w sh-101-59 --array=1-8 --wrap='echo $SLURM_ARRAY_TASK_ID'

Patched slurmd full log attached, withSlurmdDebug=debug3, on a fresh node start, no other job running. There are 2 ghost jobs out of 8:

$ squeue -u kilian
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          485277_6      test     wrap   kilian  R       2:34      1 sh-101-59
          485277_8      test     wrap   kilian  R       2:34      1 sh-101-59
Comment 12 Kilian Cavalotti 2017-07-13 15:45:21 MDT
Created attachment 4916 [details]
slurmd log with patch
Comment 13 Paul Edmon 2017-07-17 05:42:59 MDT
Just an update on this from our end.  We downgraded to 17.02.5 on our production cluster in order to minimize impact to users.  Our test cluster is still 17.02.6 and can reproduce the problem we were having on a single node.  Let me know if you need any other info from us.
Comment 14 Dominik Bartkiewicz 2017-07-17 06:15:39 MDT
Hi

finally I can recreate this.
I will try to provide patch as fast as it is possible.

Dominik
Comment 24 Dominik Bartkiewicz 2017-07-19 03:43:32 MDT
Hi

Commit https://github.com/SchedMD/slurm/commit/b40bd8d35ef85 should fix this issue.
It will be included in next 17.02 release.
Could you confirm if this solved problem on your environment?

Dominik
Comment 25 Kilian Cavalotti 2017-07-19 09:07:41 MDT
Hi Dominik, 

> Commit https://github.com/SchedMD/slurm/commit/b40bd8d35ef85 should fix this
> issue.
> It will be included in next 17.02 release.
> Could you confirm if this solved problem on your environment?

Just tested the patch, and I can confirm that it fixes the issue: all the jobs start and terminate correctly, no ghost job anymore,

Great work, and thank you!

Cheers,
-- 
Kilian
Comment 26 Paul Edmon 2017-07-19 11:06:40 MDT
I can also verify that it fixed the problem on our end as well. Well done!

-Paul Edmon-


On 7/19/2017 11:07 AM, bugs@schedmd.com wrote:
>
> *Comment # 25 <https://bugs.schedmd.com/show_bug.cgi?id=3977#c25> on 
> bug 3977 <https://bugs.schedmd.com/show_bug.cgi?id=3977> from Kilian 
> Cavalotti <mailto:kilian@stanford.edu> *
> Hi Dominik,
>
> > Commit https://github.com/SchedMD/slurm/commit/b40bd8d35ef85 should fix this 
> > issue. > It will be included in next 17.02 release. > Could you 
> confirm if this solved problem on your environment?
>
> Just tested the patch, and I can confirm that it fixes the issue: all the jobs
> start and terminate correctly, no ghost job anymore,
>
> Great work, and thank you!
>
> Cheers,
> -- 
> Kilian
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>
Comment 27 Dominik Bartkiewicz 2017-07-20 03:28:37 MDT
Hi

I am glad to hear that.
I am closing this ticket.

Dominik
Comment 28 Dominik Bartkiewicz 2017-07-20 03:55:05 MDT
*** Ticket 3998 has been marked as a duplicate of this ticket. ***