Ticket 10344

Summary:	First play with 20.11.0: Jobs hanging with reason:Prolog state
Product:	Slurm	Reporter:	Kevin Buckley <kevin.buckley>
Component:	Scheduling	Assignee:	Director of Support <support>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	darran.carey, felip.moll
Version:	20.11.0
Hardware:	Cray XC
OS:	Linux
Site:	Pawsey	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	SUSE	Machine Name:	chaos
CLE Version:	6.0 UP07	Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurmd log slurm.conf Patch from schedmd bug 8473 as applied to 20.11.0

Description Kevin Buckley 2020-12-02 20:47:15 MST

Created attachment 16938 [details]
slurmd log

Probably something missed in the configuration for 20.11.0
but there are a couple of oddities that are worth sending 
your way.

Lauching a job sees it start to run but then do nothing
and being reported as

JOBID        USER ACCOUNT                   NAME EXEC_HOST ST     REASON START_TIME       END_TIME  TIME_LEFT NODES   PRIORITY
10779    kbuckley pawsey0001      boinc-kbuckley  nid00013  R     Prolog 10:58:06         11:58:06      59:41     1      10056


After some time, the job enters a CG state which it never leaves

JOBID        USER ACCOUNT                   NAME EXEC_HOST ST     REASON START_TIME       END_TIME  TIME_LEFT NODES   PRIORITY
10779    kbuckley pawsey0001      boinc-kbuckley  nid00013 CG     Prolog 11:27:39         12:27:39    1:00:00     1      10076




One of the oddities seen in the log (have it running at debug5) is

[2020-12-03T10:58:06.401] [10779.extern] debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '0-39' for '/sys/fs/cgroup/cpuset/slurm/uid_20480/job_10779/step_ext
ern'
[2020-12-03T10:58:06.401] [10779.extern] error: _file_write_content: unable to open '/sys/fs/cgroup/cpuset/slurm/uid_20480/job_10779/step_extern/expected_usage_in_byte
s' for writing: No such file or directory
[2020-12-03T10:58:06.401] [10779.extern] debug2: xcgroup_set_param: unable to set parameter 'expected_usage_in_bytes' to '62914560000' for '/sys/fs/cgroup/cpuset/slurm
/uid_20480/job_10779/step_extern'

although, whilst the job hangs around, we see that that file is in existence


nid00013:~ # ls -o /sys/fs/cgroup/cpuset/slurm/uid_20480/job_10779/step_extern/
total 0
-rw-r--r-- 1 root 0 Dec  3 11:04 cgroup.clone_children
-rw-r--r-- 1 root 0 Dec  3 10:58 cgroup.procs
-rw-r--r-- 1 root 0 Dec  3 11:04 cpuset.cpu_exclusive
-rw-r--r-- 1 root 0 Dec  3 10:58 cpuset.cpus
-r--r--r-- 1 root 0 Dec  3 11:04 cpuset.effective_cpus
-r--r--r-- 1 root 0 Dec  3 11:04 cpuset.effective_mems
-rw-r--r-- 1 root 0 Dec  3 11:04 cpuset.expected_usage_in_bytes
-rw-r--r-- 1 root 0 Dec  3 11:04 cpuset.mem_exclusive
-rw-r--r-- 1 root 0 Dec  3 11:04 cpuset.mem_hardwall
-rw-r--r-- 1 root 0 Dec  3 11:04 cpuset.memory_migrate
-r--r--r-- 1 root 0 Dec  3 11:04 cpuset.memory_pressure
-rw-r--r-- 1 root 0 Dec  3 11:04 cpuset.memory_spread_page
-rw-r--r-- 1 root 0 Dec  3 11:04 cpuset.memory_spread_slab
-rw-r--r-- 1 root 0 Dec  3 10:58 cpuset.mems
-rw-r--r-- 1 root 0 Dec  3 11:04 cpuset.sched_load_balance
-rw-r--r-- 1 root 0 Dec  3 11:04 cpuset.sched_relax_domain_level
-rw-r--r-- 1 root 0 Dec  3 10:58 notify_on_release
-rw-r--r-- 1 root 0 Dec  3 11:04 tasks

Full log from the node and config file attached.

FWIW, this was our 20.02.5 config with two changes

 -- Remove SallocDefaultCommand option.

 -- The acct_gather_energy/cray_aries plugin has been renamed to
    acct_gather_energy/pm_counters.

made based on the content of the 20.11.0 RELEASE_NOTES.


Sure I must have missed something: would be good to know what !

Kevin

Comment 1 Kevin Buckley 2020-12-02 20:47:45 MST

Created attachment 16939 [details]
slurm.conf

Comment 2 Kevin Buckley 2020-12-02 21:15:14 MST

Created attachment 16940 [details]
Patch from schedmd bug 8473 as applied to 20.11.0

Nearly forgot to say that on our TestDevSystem,
we have been applying the patch associated with
schedmd bug 8473, however because of the change
in the g_job struct, we modified that patch so 
as to take account of the changes.

Patch as applied is therefore attached.

FYI, the AE in the filename is our very own
Andrew Elwell, who wanted to try stuff out that
requires the patch.

Comment 3 Kevin Buckley 2020-12-02 22:40:23 MST

There's a suggestion that we might be seeing the same
issue as described in #10275.

I will deploy that patch and see what happens.

Comment 4 Kevin Buckley 2020-12-03 00:07:33 MST

Still seeing jobs enter a Reason:Prolog state.

Comment 5 Kevin Buckley 2020-12-03 01:25:29 MST

Went back to backing out the 8473 patch for the InfluxDB stuff
and just applying the patch from 10275.

Have also altered


AcctGatherProfileType=acct_gather_profile/influx

back to 

AcctGatherProfileType=acct_gather_profile/none

This now appears to be working.

Suggestion is that the mods I made to the  8473 patch for the
InfluxDB stuff were correct enough to allow for a compilation
but may have buggered something else.

Interested to hear your thoughts, especialy as to the "correctness"
of the patch attached earlier.

Been a few too many variables in all of this to say that that's
where the blame lies, but at least we have a 20.11.0 that other
folk here can have a play with.

Comment 7 Jason Booth 2020-12-03 10:21:16 MST

Hi Kevin - I am marking this as a duplicate of bug#10275. 

In regard to your patch, I ask that you keep that conversation going through bug#8473 by attaching your patch there or opening a new bug with it so that we can keep the flow of contributions away from support issues.


In regard to your comments about the cgroups:

This xcgroup_set_param is incorrect in more recent versions of cgroup, the file is "cpuset.expected_usage_in_bytes".
Before it seems it was called without the prefix "cpuset.", see https://bugs.schedmd.com/show_bug.cgi?id=3154#c15

task_cgroup_cpuset.c

#ifdef HAVE_NATIVE_CRAY                                                         |                        
        /*                                                                      |                        
         * on Cray systems, set the expected usage in bytes.                    |                        
         * This is used by the Cray OOM killer                                  |                        
         */                                                                     |                        
        snprintf(expected_usage, sizeof(expected_usage), "%"PRIu64,             |                        
                 (uint64_t)job->step_mem * 1024 * 1024);                        |                        
        xcgroup_set_param(&step_cpuset_cg, "expected_usage_in_bytes",           |                        
                          expected_usage);                                      |                        
#endif             

We could do a couple of tries, first with cpuset.expected_usage_in_bytes and if this doesn't work try with the shortest form, but that too would need its own bug report so that we do not muddle this bug up with too many disjointed issues.

*** This ticket has been marked as a duplicate of ticket 10275 ***