Ticket 13670

Summary: scheduler service lockup after hetjob cancels
Product: Slurm Reporter: Adam <adam.munro>
Component: slurmctldAssignee: Chad Vizino <chad>
Status: RESOLVED CANNOTREPRODUCE QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: chad, eric.peskin
Version: 21.08.2   
Hardware: Linux   
OS: Linux   
Site: Yale Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: logfile
Full slurmctld log file covering March 21st
slurm.conf
sacct -a -j <jid> for failed hets followed by slurmctld unresponsiveness
slurmctd log for March 22nd (gzipped)
scheduler syslog covering March 21-22nd (gzipped)

Description Adam 2022-03-21 15:57:17 MDT
We've observed a relationship between the slurmctld service becoming unresponsive and a "Cancelling aborted hetjob submit:" message. The slurmctld service does not fail, and so by our existing alerting methods we don't get notified about this, ..it just stops responding to all requests.

example:

Mar 21 16:47:03 p2r1u25.farnam.hpc.yale.internal slurmctld[140280]: _job_complete: JobId=23925049_185(23925323) done
Mar 21 16:47:07 p2r1u25.farnam.hpc.yale.internal slurmctld[140280]: _pick_best_nodes: JobId=23925554 never runnable in partition pi_tomography
Mar 21 16:47:07 p2r1u25.farnam.hpc.yale.internal slurmctld[140280]: _slurm_rpc_submit_batch_het_job: Requested node configuration is not available
Mar 21 16:47:07 p2r1u25.farnam.hpc.yale.internal slurmctld[140280]: Cancelling aborted hetjob submit: JobId=23925553+0(23925553)
Mar 21 16:50:01 p2r1u25.farnam.hpc.yale.internal systemd: Started Session 235 of user root.
Mar 21 16:50:08 p2r1u25.farnam.hpc.yale.internal systemd-logind: New session 236 of user root.
Mar 21 16:50:08 p2r1u25.farnam.hpc.yale.internal systemd: Started Session 236 of user root.
Mar 21 16:50:08 p2r1u25.farnam.hpc.yale.internal dbus[1473]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Mar 21 16:50:08 p2r1u25.farnam.hpc.yale.internal dbus[1473]: [system] Successfully activated service 'org.freedesktop.problems'
Mar 21 16:51:01 p2r1u25.farnam.hpc.yale.internal slurmctld[140280]: Terminate signal (SIGINT or SIGTERM) received
Mar 21 16:51:01 p2r1u25.farnam.hpc.yale.internal systemd: Stopping Slurm controller daemon...

..with the events after 16:50 being a root user getting on the node to restart the service.

This has happened 3 times today, and happened once over this last weekend (>12 hour outage).

Thank you,
Adam
Comment 1 Jason Booth 2022-03-21 16:15:16 MDT
Adam, if the service is currently stuck, would you then please grab a back trace and attach that to this issue?

> gdb attach <slurmstepd_pid> 
> (gdb) set print pretty
> (gdb) t a a bt full


If client commands still work, then the output of sdiag would also be helpful to attach as well.
Comment 2 Adam 2022-03-21 16:31:20 MDT
Not stuck at the moment: we might be able to recreate the problem on our test cluster (and if so I'll send you the gdb output). Since the impact is so large we try to recover the service (systemctl stop slurmctld, systemctl start slurmctld) as soon as we notice, but I'll try to get more data next time. I don't know about sdiag but a command like sinfo will timeout when the system gets into this state. 

Here is the current sdiag output (the service is running normally):

[ACTIVE@p2r1u25:~] sdiag
*******************************************************
sdiag output at Mon Mar 21 18:31:06 2022 (1647901866)
Data since      Mon Mar 21 17:45:05 2022 (1647899105)
*******************************************************
Server thread count:  3
Agent queue size:     0
Agent count:          0
Agent thread count:   0
DBD Agent queue size: 0

Jobs submitted: 23
Jobs started:   451
Jobs completed: 428
Jobs canceled:  3
Jobs failed:    0

Job states ts:  Mon Mar 21 18:31:05 2022 (1647901865)
Jobs pending:   7082
Jobs running:   866

Main schedule statistics (microseconds):
	Last cycle:   29946
	Max cycle:    259350
	Total cycles: 70
	Mean cycle:   24696
	Mean depth cycle:  48
	Cycles per minute: 1
	Last queue length: 42

Backfilling stats
	Total backfilled jobs (since last slurm start): 39
	Total backfilled jobs (since last stats cycle start): 39
	Total backfilled heterogeneous job components: 0
	Total cycles: 18
	Last cycle when: Mon Mar 21 18:30:05 2022 (1647901805)
	Last cycle: 32836
	Max cycle:  57222
	Mean cycle: 38540
	Last depth cycle: 55
	Last depth cycle (try sched): 54
	Depth Mean: 58
	Depth Mean (try depth): 57
	Last queue length: 42
	Queue length mean: 43
	Last table size: 9
	Mean table size: 9

Latency for 1000 calls to gettimeofday(): 16 microseconds

Remote Procedure Call statistics by message type
	REQUEST_PARTITION_INFO                  ( 2009) count:6482   ave_time:227    total_time:1473720
	REQUEST_FED_INFO                        ( 2049) count:4444   ave_time:157    total_time:701616
	REQUEST_JOB_INFO_SINGLE                 ( 2021) count:4364   ave_time:244    total_time:1068571
	REQUEST_NODE_INFO_SINGLE                ( 2040) count:2430   ave_time:967    total_time:2350552
	REQUEST_COMPLETE_PROLOG                 ( 6018) count:817    ave_time:277    total_time:226596
	MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:687    ave_time:371    total_time:255370
	REQUEST_NODE_INFO                       ( 2007) count:475    ave_time:1078   total_time:512328
	REQUEST_STEP_COMPLETE                   ( 5016) count:475    ave_time:415    total_time:197463
	MESSAGE_EPILOG_COMPLETE                 ( 6012) count:467    ave_time:144    total_time:67271
	REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:459    ave_time:402    total_time:184930
	REQUEST_BATCH_SCRIPT                    ( 2051) count:443    ave_time:236    total_time:104973
	REQUEST_KILL_JOB                        ( 5032) count:371    ave_time:263    total_time:97930
	REQUEST_JOB_USER_INFO                   ( 2039) count:91     ave_time:357    total_time:32574
	REQUEST_JOB_READY                       ( 4019) count:18     ave_time:2128   total_time:38304
	REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:15     ave_time:3311   total_time:49667
	REQUEST_JOB_STEP_CREATE                 ( 5001) count:12     ave_time:1432   total_time:17194
	REQUEST_JOB_INFO                        ( 2003) count:12     ave_time:5701   total_time:68417
	REQUEST_BUILD_INFO                      ( 2001) count:10     ave_time:230    total_time:2309
	REQUEST_HET_JOB_ALLOC_INFO              ( 4027) count:7      ave_time:154    total_time:1083
	REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:7      ave_time:282    total_time:1974
	REQUEST_RESOURCE_ALLOCATION             ( 4001) count:6      ave_time:3081   total_time:18490
	REQUEST_UPDATE_NODE                     ( 3002) count:3      ave_time:128    total_time:384
	REQUEST_CANCEL_JOB_STEP                 ( 5005) count:2      ave_time:5728   total_time:11457
	REQUEST_PING                            ( 1008) count:2      ave_time:118    total_time:236

Remote Procedure Call statistics by user
	root            (       0) count:9947   ave_time:458    total_time:4558741
	sw2247          (   16774) count:1538   ave_time:228    total_time:351044
	dz287           (   15166) count:1380   ave_time:224    total_time:310490
	yh462           (   13841) count:1065   ave_time:233    total_time:248221
	ah2428          (   18322) count:876    ave_time:221    total_time:193731
	jc3893          (   18408) count:753    ave_time:233    total_time:175860
	ngd22           (   19289) count:752    ave_time:220    total_time:166016
	ijc7            (   18826) count:740    ave_time:216    total_time:160091
	mem268          (   18956) count:698    ave_time:216    total_time:151105
	yc753           (   16259) count:591    ave_time:222    total_time:131433
	eaz29           (   19040) count:549    ave_time:209    total_time:114849
	yx44            (   10855) count:449    ave_time:236    total_time:106246
	chs69           (   19356) count:440    ave_time:200    total_time:88361
	sw926           (   15794) count:423    ave_time:239    total_time:101176
	st757           (   19342) count:412    ave_time:252    total_time:103872
	tc732           (   17107) count:309    ave_time:226    total_time:69951
	crm54           (   20003) count:278    ave_time:212    total_time:59068
	vs428           (   16267) count:260    ave_time:225    total_time:58600
	je45            (   12015) count:218    ave_time:204    total_time:44557
	taw46           (   18820) count:114    ave_time:235    total_time:26802
	ms725           (   12732) count:107    ave_time:227    total_time:24360
	tl397           (   11917) count:40     ave_time:1862   total_time:74518
	mrb93           (   13687) count:18     ave_time:766    total_time:13797
	ls2286          (   16099) count:16     ave_time:979    total_time:15677
	bc758           (   18905) count:15     ave_time:206    total_time:3095
	hl796           (   16326) count:13     ave_time:380    total_time:4950
	ap2279          (   15164) count:12     ave_time:1326   total_time:15920
	ktt8            (   14015) count:10     ave_time:399    total_time:3996
	bb926           (   19980) count:9      ave_time:3158   total_time:28423
	zl436           (   15226) count:9      ave_time:318    total_time:2869
	jg2447          (   14101) count:9      ave_time:272    total_time:2454
	yy496           (   14135) count:9      ave_time:1069   total_time:9625
	slurm           (     666) count:6      ave_time:183    total_time:1103
	ep542           (   14978) count:5      ave_time:1473   total_time:7367
	mwr8            (   13872) count:5      ave_time:5412   total_time:27060
	cz378           (   18324) count:4      ave_time:2855   total_time:11423
	cl2384          (   17600) count:4      ave_time:1231   total_time:4926
	yw599           (   15755) count:3      ave_time:266    total_time:799
	yg247           (   12914) count:3      ave_time:217    total_time:653
	cz354           (   16968) count:3      ave_time:255    total_time:765
	lut2            (   14998) count:3      ave_time:198    total_time:594
	xw379           (   14526) count:2      ave_time:4174   total_time:8349
	gpr24           (   17481) count:2      ave_time:236    total_time:472

Pending RPC statistics
	No pending RPCs
Comment 3 Adam 2022-03-22 10:23:42 MDT
We downgraded our test cluster to 21.08.2 and recreated the hetjob messages on the test cluster by submitting a hetjob requesting resources unavailable to the test environment: this test did *not* lock up the scheduler service.
Comment 4 Chad Vizino 2022-03-22 10:56:42 MDT
(In reply to Adam from comment #3)
> We downgraded our test cluster to 21.08.2 and recreated the hetjob messages
> on the test cluster by submitting a hetjob requesting resources unavailable
> to the test environment: this test did *not* lock up the scheduler service.
Could you send a full slurmctld.log for Mar 21 from your main system where this has happened? I'd like to see how the scheduler was handling the affected het jobs.
Comment 5 Adam 2022-03-22 11:37:33 MDT
I was unable to recreate the problem on our test cluster, but the problem happened again today on the production system. sdiag was responsive and returned:

Last login: Tue Mar 22 13:23:32 on ttys006
aam233@Adams-MacBook-Pro ~ % cd Desktop 
aam233@Adams-MacBook-Pro Desktop % ls
GPS_internal_MOP_Performance_v1.6.docx						download.jpg									nissim_to_upload
GraceCoreParts.xlsx								grace.csv									openSMlog_postDec5.txt
NWR										hardware_installation_maintenance_guide_lenovo_storage_d1212_d1224_d3284.pdf	sstates
Performance Review Munro FY22-with-goals-edit.doc				list										this_week.txt
Policy										maint										wtf.txt
Projects									newblue.csv									yale university pixstor.png.pdf
Q244753_200000003687 CSV.xlsx							newslurm									zabbix_infiniband.xml
Yale InfiniBand Fabric Report-Feb-14-2022.docx					nfsrdma-vastdata-3.9.4.tar.xz							~$tterofExplanation_CreditInquiries.docx
aam233@Adams-MacBook-Pro Desktop % vim locked state
2 files to edit

















































[ACTIVE@p2r1u25:~] sdiag
*******************************************************
sdiag output at Tue Mar 22 13:24:10 2022 (1647969850)
Data since      Mon Mar 21 20:00:00 2022 (1647907200)
*******************************************************
Server thread count:  188
Agent queue size:     0
Agent count:          0
Agent thread count:   0
DBD Agent queue size: 0

Jobs submitted: 11768
Jobs started:   16372
Jobs completed: 14740
Jobs canceled:  65
Jobs failed:    0

Job states ts:  Tue Mar 22 13:20:44 2022 (1647969644)
Jobs pending:   1512
Jobs running:   738

Main schedule statistics (microseconds):
        Last cycle:   16512
        Max cycle:    518002
        Total cycles: 1986
        Mean cycle:   20866
        Mean depth cycle:  39
        Cycles per minute: 1
        Last queue length: 56

Backfilling stats
        Total backfilled jobs (since last slurm start): 1578
        Total backfilled jobs (since last stats cycle start): 1397
        Total backfilled heterogeneous job components: 8
        Total cycles: 417
        Last cycle when: Tue Mar 22 13:20:19 2022 (1647969619)
        Last cycle: 58523
        Max cycle:  240477
        Mean cycle: 30661
        Last depth cycle: 64
        Last depth cycle (try sched): 64
        Depth Mean: 40
        Depth Mean (try depth): 39
        Last queue length: 55
        Queue length mean: 31
        Last table size: 15
        Mean table size: 16

Latency for 1000 calls to gettimeofday(): 16 microseconds

Remote Procedure Call statistics by message type
        REQUEST_PARTITION_INFO                  ( 2009) count:138027 ave_time:223    total_time:30802000
        REQUEST_FED_INFO                        ( 2049) count:91892  ave_time:155    total_time:14304400
        REQUEST_JOB_INFO_SINGLE                 ( 2021) count:90428  ave_time:304    total_time:27500792
        REQUEST_NODE_INFO_SINGLE                ( 2040) count:63404  ave_time:521    total_time:33076247
        REQUEST_COMPLETE_PROLOG                 ( 6018) count:27486  ave_time:215    total_time:5929203
        REQUEST_STEP_COMPLETE                   ( 5016) count:18387  ave_time:443    total_time:8147613
        REQUEST_NODE_INFO                       ( 2007) count:18288  ave_time:1092   total_time:19979793
        MESSAGE_EPILOG_COMPLETE                 ( 6012) count:18210  ave_time:243    total_time:4435893
        REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:18044  ave_time:534    total_time:9641511
        REQUEST_BATCH_SCRIPT                    ( 2051) count:17870  ave_time:1621   total_time:28983610
      MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:10119  ave_time:303    total_time:3069775
        REQUEST_KILL_JOB                        ( 5032) count:9488   ave_time:273    total_time:2596344
        REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:1782   ave_time:2524   total_time:4498315
        REQUEST_JOB_USER_INFO                   ( 2039) count:1365   ave_time:456    total_time:622736
        REQUEST_JOB_INFO                        ( 2003) count:387    ave_time:6210   total_time:2403624
        REQUEST_JOB_READY                       ( 4019) count:382    ave_time:967    total_time:369394
        REQUEST_BUILD_INFO                      ( 2001) count:291    ave_time:229    total_time:66693
        REQUEST_JOB_STEP_CREATE                 ( 5001) count:244    ave_time:3283   total_time:801239
        REQUEST_HET_JOB_ALLOC_INFO              ( 4027) count:155    ave_time:166    total_time:25831
        REQUEST_RESOURCE_ALLOCATION             ( 4001) count:87     ave_time:4036   total_time:351188
        REQUEST_UPDATE_NODE                     ( 3002) count:85     ave_time:185    total_time:15748
        REQUEST_PING                            ( 1008) count:73     ave_time:140    total_time:10286
        REQUEST_CANCEL_JOB_STEP                 ( 5005) count:72     ave_time:979    total_time:70501
        REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:70     ave_time:424    total_time:29732
        REQUEST_SUBMIT_BATCH_HET_JOB            ( 4028) count:5      ave_time:4470   total_time:22353
        REQUEST_UPDATE_JOB                      ( 3001) count:5      ave_time:1633   total_time:8169
        REQUEST_AUTH_TOKEN                      ( 5039) count:3      ave_time:291    total_time:874
        REQUEST_STATS_INFO                      ( 2035) count:1      ave_time:150    total_time:150
        REQUEST_RESERVATION_INFO                ( 2024) count:1      ave_time:173    total_time:173

Remote Procedure Call statistics by user
        root            (       0) count:300247 ave_time:442    total_time:132863662
        yl2428          (   18649) count:51416  ave_time:237    total_time:12235703
        dz287           (   15166) count:18729  ave_time:240    total_time:4507587
        sw2247          (   16774) count:15081  ave_time:252    total_time:3801279
        ijc7            (   18826) count:14078  ave_time:250    total_time:3524550
        nkv6            (   19314) count:9997   ave_time:279    total_time:2790950
        tl688           (   19130) count:8449   ave_time:244    total_time:2063804
        sw926           (   15794) count:6031   ave_time:251    total_time:1514632
        ah2428          (   18322) count:5229   ave_time:224    total_time:1172223
...
        sl2373          (   12560) count:3      ave_time:253    total_time:760
        aj646           (   19804) count:3      ave_time:318    total_time:956
        crw42           (   11467) count:3      ave_time:226    total_time:680
        tw626           (   19760) count:3      ave_time:456    total_time:1369
        pag4            (   10001) count:2      ave_time:2213   total_time:4427
        pl543           (   16023) count:1      ave_time:1113   total_time:1113

Pending RPC statistics
        No pending RPCs

and the GDB output:
[ACTIVE@p2r1u25:~] gdb attach 211610
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
attach: No such file or directory.
Attaching to process 211610
Reading symbols from /opt/slurm/21.08.2/sbin/slurmctld...done.
Reading symbols from /opt/slurm/21.08.2/lib/slurm/libslurmfull.so...done.
Loaded symbols for /opt/slurm/21.08.2/lib/slurm/libslurmfull.so
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libresolv.so.2
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /lib64/libnss_sss.so.2...Reading symbols from /lib64/libnss_sss.so.2...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_sss.so.2
0x00007fb08673dccd in poll () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-324.el7_9.x86_64 sssd-client-1.16.5-10.el7_9.10.x86_64
(gdb) set print pretty
(gdb) t a a bt full

Thread 1 (Thread 0x7fb087939740 (LWP 211610)):
#0  0x00007fb08673dccd in poll () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007fb0873d6024 in _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x22d4440) at eio.c:352
        n = <optimized out>
        timeout = -1
#2  eio_handle_mainloop (eio=0x22cadc0) at eio.c:315
        retval = 0
        pollfds = 0x22d4440
        map = 0x22b31a0
        maxnfds = 1
        nfds = 2
        n = <optimized out>
        shutdown_time = <optimized out>
        __func__ = "eio_handle_mainloop"
#3  0x00000000004b720a in _slurmscriptd_mainloop () at slurmscriptd.c:837
No locals.
#4  slurmscriptd_init (argc=argc@entry=3, argv=argv@entry=0x7ffca84339a8) at slurmscriptd.c:1107
        i = <optimized out>
        ack = 0
        rc = 0
        proc_name = 0x4e86ba "slurmscriptd"
        log_prefix = 0x0
        to_slurmscriptd = {6, 7}
        to_slurmctld = {8, 9}
        __func__ = "slurmscriptd_init"
#5  0x000000000042dd6b in main (argc=3, argv=0x7ffca84339a8) at controller.c:433
        cnt = <optimized out>
        error_code = <optimized out>
        i = 3
        start = {
          tv_sec = 1647899099, 
          tv_usec = 27688
        }
        now = {
          tv_sec = 1647899099, 
          tv_usec = 27704
        }
        stat_buf = {
          st_dev = 64768, 
          st_ino = 100851449, 
          st_nlink = 1, 
          st_mode = 33261, 
          st_uid = 0, 
          st_gid = 0, 
          __pad0 = 0, 
          st_rdev = 0, 
          st_size = 1828, 
          st_blksize = 4096, 
          st_blocks = 8, 
          st_atim = {
            tv_sec = 1647892855, 
            tv_nsec = 296568077
          }, 
          st_mtim = {
            tv_sec = 1596213326, 
            tv_nsec = 0
          }, 
          st_ctim = {
            tv_sec = 1598892426, 
            tv_nsec = 683905770
          }, 
          __unused = {0, 0, 0}
        }
        rlim = {
          rlim_cur = 18446744073709551615, 
          rlim_max = 18446744073709551615
        }
        config_write_lock = {
          conf = WRITE_LOCK, 
---Type <return> to continue, or q <return> to quit---
          job = WRITE_LOCK, 
          node = WRITE_LOCK, 
          part = WRITE_LOCK, 
          fed = NO_LOCK
        }
        prep_callbacks = {
          prolog_slurmctld = 0x490f60 <prep_prolog_slurmctld_callback>, 
          epilog_slurmctld = 0x491199 <prep_epilog_slurmctld_callback>
        }
        create_clustername_file = false
        conf_file = <optimized out>
        __func__ = "main"
(gdb) 

...I'll upload the log files.
Comment 6 Adam 2022-03-22 11:56:15 MDT
Created attachment 24006 [details]
logfile

logfile covering several lockups
Comment 7 Adam 2022-03-22 12:14:50 MDT
Hi Chad, 

For the log file:

13:21:07 - along with the hetjob message the first lock-up event occurs (the sdiag and gdb output was collected after this point and before restarting the program)
13:29:52 - slurmctld is manually stopped 
13:32:28 - slurmctld is manually started

13:33:18 - another hetjob message and another lockup
13:36:59 - slurmctld is manually stopped 
13:39:52 - slurmctld is manually started

13:40:57 - another hetjob message and another lockup
13:48:14 - slurmctld is manually stopped 
13:50:07 - slurmctld is manually started

More generally, and for every one of these:

[ACTIVE@p2r1u25:/var/log] egrep hetjob /var/log/messages
Mar 21 16:47:07 p2r1u25.farnam.hpc.yale.internal slurmctld[140280]: Cancelling aborted hetjob submit: JobId=23925553+0(23925553)
Mar 21 17:39:49 p2r1u25.farnam.hpc.yale.internal slurmctld[156121]: Cancelling aborted hetjob submit: JobId=23926299+0(23926299)
Mar 22 13:21:07 p2r1u25.farnam.hpc.yale.internal slurmctld[211608]: Cancelling aborted hetjob submit: JobId=23944093+0(23944093)
Mar 22 13:33:18 p2r1u25.farnam.hpc.yale.internal slurmctld[97808]: Cancelling aborted hetjob submit: JobId=23944152+0(23944152)
Mar 22 13:40:57 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: Cancelling aborted hetjob submit: JobId=23944167+0(23944167)

the scheduler stopped responding to commands like 'sinfo' and stopped scheduling jobs. 

Thanks,
Adam
Comment 8 Adam 2022-03-22 13:12:58 MDT
Created attachment 24007 [details]
Full slurmctld log file covering March 21st

Full slurmctld log file covering March 21st
Comment 9 Chad Vizino 2022-03-22 15:02:28 MDT
(In reply to Adam from comment #8)
> Created attachment 24007 [details]
> Full slurmctld log file covering March 21st
Thanks for what you've sent today. I'm looking it over.
Comment 10 Chad Vizino 2022-03-22 17:58:00 MDT
The "Cancelling aborted" messages come from _het_job_cancel() which is called by _slurm_rpc_submit_batch_het_job() which is called for the sbatch command (as opposed to salloc or srun (needing an allocation)). We can see the error messages that are causing the batch jobs to fail. Mostly they are "Requested node configuration is not available" but there is one from today that is "Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)" so the problem doesn't seem to be limited to one kind of failure.

>Mar 21 14:53:45 p2r1u27.farnam.hpc.yale.internal slurmctld[5182]: _slurm_rpc_submit_batch_het_job: Requested node configuration is not available
>Mar 21 14:53:45 p2r1u27.farnam.hpc.yale.internal slurmctld[5182]: Cancelling aborted hetjob submit: JobId=23924851+0(23924851)
>--
>Mar 21 16:47:07 p2r1u25.farnam.hpc.yale.internal slurmctld[140280]: _slurm_rpc_submit_batch_het_job: Requested node configuration is not available
>Mar 21 16:47:07 p2r1u25.farnam.hpc.yale.internal slurmctld[140280]: Cancelling aborted hetjob submit: JobId=23925553+0(23925553)
>--
>Mar 21 17:39:49 p2r1u25.farnam.hpc.yale.internal slurmctld[156121]: _slurm_rpc_submit_batch_het_job: Requested node configuration is not available
>Mar 21 17:39:49 p2r1u25.farnam.hpc.yale.internal slurmctld[156121]: Cancelling aborted hetjob submit: JobId=23926299+0(23926299)
>--
>Mar 22 13:21:07 p2r1u25.farnam.hpc.yale.internal slurmctld[211608]: _slurm_rpc_submit_batch_het_job: Requested node configuration is not available
>Mar 22 13:21:07 p2r1u25.farnam.hpc.yale.internal slurmctld[211608]: Cancelling aborted hetjob submit: JobId=23944093+0(23944093)
>--
>Mar 22 13:33:18 p2r1u25.farnam.hpc.yale.internal slurmctld[97808]: _slurm_rpc_submit_batch_het_job: Requested node configuration is not available
>Mar 22 13:33:18 p2r1u25.farnam.hpc.yale.internal slurmctld[97808]: Cancelling aborted hetjob submit: JobId=23944152+0(23944152)
>--
>Mar 22 13:40:57 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: _slurm_rpc_submit_batch_het_job: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
>Mar 22 13:40:57 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: Cancelling aborted hetjob submit: JobId=23944167+0(23944167)
What's odd of each of these is that only the het leader is listed (+0) and not other component jobs (+1, +2, etc.). I'm not sure what to make of that yet.

I can contrive the failure by modifying the condition in my slurmctld (am using 21.08.2 also) to trigger a similar failure but it doesn't hang slurmctld:

>$ sbatch : --wrap="sleep 300"
>sbatch: error: Batch job submission failed: Requested node configuration is not available
Then in my slurmctld.log:

>[2022-03-22T17:23:35.595] _slurm_rpc_submit_batch_het_job: Requested node configuration is not available
>[2022-03-22T17:23:35.595] Cancelling aborted hetjob submit: JobId=164062+0(164062)
>[2022-03-22T17:23:35.595] Cancelling aborted hetjob submit: JobId=164062+1(164063)
As expected, I see two component het jobs cancelled. But in your log we are only seeing one (the +0) which is odd...I wonder about some kind of corruption of the job.

Do you have any details about the failed jobs that are listed in your log (ex. job script, same user)? How were you trying to reproduce the problem on your test cluster (job submission type, etc.)?

I'm not sure if these are related but there are a lot of them for the same two job ids in the log you provided (every ~15sec):

>Mar 21 00:00:08 p2r1u27.farnam.hpc.yale.internal slurmctld[5182]: prolog_complete: invalid JobId=18748205
>Mar 21 00:00:08 p2r1u27.farnam.hpc.yale.internal slurmctld[5182]: _slurm_rpc_complete_prolog JobId=18748205: Invalid job id specified
>Mar 21 00:00:08 p2r1u27.farnam.hpc.yale.internal slurmctld[5182]: prolog_complete: invalid JobId=18756999
>Mar 21 00:00:08 p2r1u27.farnam.hpc.yale.internal slurmctld[5182]: _slurm_rpc_complete_prolog JobId=18756999: Invalid job id specified
>Mar 21 00:00:08 p2r1u27.farnam.hpc.yale.internal slurmctld[5182]: _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=18756999 uid 0
>Mar 21 00:00:08 p2r1u27.farnam.hpc.yale.internal slurmctld[5182]: _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=18748205 uid 0
>Mar 21 00:00:08 p2r1u27.farnam.hpc.yale.internal slurmctld[5182]: job_str_signal(3): invalid JobId=18756999
>Mar 21 00:00:08 p2r1u27.farnam.hpc.yale.internal slurmctld[5182]: _slurm_rpc_kill_job: job_str_signal() uid=0 JobId=18756999 sig=9 returned: Invalid job id specified
>Mar 21 00:00:08 p2r1u27.farnam.hpc.yale.internal slurmctld[5182]: job_str_signal(3): invalid JobId=18748205
>Mar 21 00:00:08 p2r1u27.farnam.hpc.yale.internal slurmctld[5182]: _slurm_rpc_kill_job: job_str_signal() uid=0 JobId=18748205 sig=9 returned: Invalid job id specified
Given that the earliest job id allocated in the log is this one:

>Mar 21 00:00:02 p2r1u27.farnam.hpc.yale.internal slurmctld[5182]: sched: Allocate JobId=21938154 NodeList=c17n09 #CPUs=1 Partition=admintest
It would appear from comparing the two invalid job id numbers to this one that they are older. Do you know anything about them?

Also, would you supply your slurm.conf and prolog? Thanks.
Comment 11 Adam 2022-03-23 07:36:41 MDT
Created attachment 24020 [details]
slurm.conf

slurm.conf
Comment 12 Adam 2022-03-23 07:38:39 MDT
Slurm.conf attached. Our epilog is very short:

#!/bin/bash
# If there are no more of this user processes on this node so clean up directories

/usr/bin/perl /etc/slurm/slurm.epilog.supremm;

if ! pgrep -u $SLURM_JOB_UID > /dev/null
then
  # for $XDG_RUNTIME_DIR osTicket #010500 Remove /run/user/$SLURM_JOB_UID directory if no
  # processes on node
  rm -rf /run/user/$SLURM_JOB_UID
  find /dev/shm -uid $SLURM_JOB_UID -type f -delete
  find /tmp -uid $SLURM_JOB_UID -type f -delete
fi

exit 0

..and slurm.epliog.supremm is:

#!/usr/bin/perl

# Added for OS-Ticket #017703

use strict;
use Sys::Hostname;
use Date::Simple;
use POSIX qw(strftime);

my $jobid = $ENV{SLURM_JOB_ID};
my $today = Date::Simple->new;

my $logyear = $today->format("%Y");
my $logmonth = $today->format("%m");
my $logday = $today->format("%d");

my $jobdatelong = strftime "%Y%m%d.%H.%M.%S", localtime;
my $fullhost = hostname();

my $logdir = "/gpfs/ysm/SUPReMM_farnam/pmlogger/$logyear/$logmonth/$fullhost/$logyear-$logmonth-$logday";
system("env PMLOGGER_EPILOG=yes pmlogger -U pcp -c /etc/pcp/pmlogger/pmlogger-supremm.config -s 1 -l /tmp/job-$jobid-end-$jobdatelong.log $logdir/job-$jobid-end-$jobdatelong &> /dev/null");
system("pkill -u pcp -f job-$jobid-begin");


exit(0);
Comment 13 Adam 2022-03-23 07:58:10 MDT
In terms of the jobs they are mostly and definitely hetjobs and mostly submitted by one particular user. There was a second user submitting a (canceled) hetjob last Saturday, but the curious thing in a few other cases was that two of the canceled "hetjob" jobs did not appear to be hetjobs.

Here are a few examples:

(clearly a hetjob case, and from the user submitting most of the jobs):

#!/bin/bash
#SBATCH --job-name=3DRefine_3M_1pt1
#SBATCH --partition=general
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=8000M
#SBATCH --time=4-00:00:00
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=<REDACTED>
#SBATCH hetjob
#SBATCH --account=tomography
#SBATCH --partition=pi_tomography
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=2
#SBATCH --time=4-00:00:00
#SBATCH --gres=gpu:rtx5000:4
#SBATCH --gres-flags=enforce-binding
#SBATCH --mem-per-cpu=22G
module load RELION/3.0.8-fosscuda-2018b
#mkdir Refine3D/cluster4_7_protein
srun --pack-group=0,1 $(which relion_refine_mpi) \
		--continue Refine3D/cluster4_7_protein/run_it023_optimiser.star \
		--o Refine3D/cluster4_7_protein/run_ct23 \
		--dont_combine_weights_via_disc \
		--scratch_dir /tmp/<REDACTED> \
		--pool 50 \
		--pad 2 \
		--particle_diameter 576 \
		--solvent_mask MaskCreate/cluster4_2_protein/mask.mrc \
		--solvent_correct_fsc \
		--j 2 \
		--gpu \

..and 2 cases of "how is this a hetjob?" like this one:

#!/bin/bash
#SBATCH --array 0-2733
#SBATCH --output dsq-tarv_block_anxiety_ICD_rank-%A_%4a-%N.out
#SBATCH --job-name dsq-tarv_block_anxiety_ICD_rank
#SBATCH --mem-per-cpu=35g --time=5:00:00 -p scavenge

/ysm-gpfs/apps/software/dSQ/1.05/dSQBatch.py --job-file /gpfs/ysm/scratch60/<REDACTED>/<REDACTED>/IDP/realdata/anxiety/tmp/tarv_block_anxiety_ICD_rank.txt --status-dir /gpfs/ysm/scratch60/<REDACTED>/<REDACTED>/IDP/realdata/anxiety/tmp

We did make one change since opening the ticket and the last blitz of lockups: we added whole_hetjob to SchedulerParamaters. This was because it appeared as if parts of the canceled hetjobs still got started (confirmed to be running after recovering the scheduler). 

On our test cluster running the same version of slurm and (as far as possible) the same configuration we, like you, have not yet been able to recreate the "unresponsive but running slurmctld" condition, but can easily recreate the same cancelations and error messages. 


Please let me know if you need any additional information.
Comment 14 Chad Vizino 2022-03-23 17:50:21 MDT
I'm still going through what you supplied and do not yet have a clear understanding about why this is happening. Since you've told me that most of the het jobs are expected to have 2 components (+0 and +1 and are also expected to have successive job ids), I was able to pull out a bit more from the logs about some of the +1 jobs.

This helped show what you wondered about in comment 13 where a couple of the "het jobs" appeared to not actually be het jobs. It appears to be due to job id reuse since the cancelled het job ids appear to be reused by a batch job and array job in one case (see below). This is odd and I am not sure it is expected--on my own system I can contrive the het job failure and then submit another job but it gets the next job id (does not reuse one of the failed ones). So, I want to investigate this some more but I think it may be since slurmctld is getting killed hard (with SIGKILL). Here's one example (where a SIGKILL is sent about 90sec after a SIGTERM):

>Mar 21 16:47:07 p2r1u25.farnam.hpc.yale.internal slurmctld[140280]: Cancelling aborted hetjob submit: JobId=23925553+0(23925553)
>Mar 21 16:51:01 p2r1u25.farnam.hpc.yale.internal slurmctld[140280]: Terminate signal (SIGINT or SIGTERM) received
>Mar 21 16:52:31 p2r1u25.farnam.hpc.yale.internal systemd: slurmctld.service stop-sigterm timed out. Killing.
>Mar 21 16:52:31 p2r1u25.farnam.hpc.yale.internal systemd: slurmctld.service: main process exited, code=killed, status=9/KILL
> ...
>Mar 21 16:52:31 p2r1u25.farnam.hpc.yale.internal slurmctld[156121]: slurmctld version 21.08.2 started on cluster farnam
I wonder if slurmctld is not being given enough time to shut down so systemd is killing it which may be preventing slurmctld from writing its complete state (or sending stuff to slurmdbd). What does your slurmctld systemd unit file look like? Does it have a 90 sec timeout listed? You may want to try increasing it to let slurmctld more gracefully shut down. Maybe that will at least prevent the job id reuse issue we're seeing.

The last log segment below lists a bogus user id. That is also very suspicious and want to investigate that as well since it shouldn't be there.

-----

Only leader listed as aborted:

>Mar 21 14:53:45 p2r1u27.farnam.hpc.yale.internal slurmctld[5182]: _pick_best_nodes: JobId=23924852 never runnable in partition pi_tomography
>Mar 21 14:53:45 p2r1u27.farnam.hpc.yale.internal slurmctld[5182]: _slurm_rpc_submit_batch_het_job: Requested node configuration is not available
>Mar 21 14:53:45 p2r1u27.farnam.hpc.yale.internal slurmctld[5182]: Cancelling aborted hetjob submit: JobId=23924851+0(23924851)
>Mar 21 16:00:20 p2r1u27.farnam.hpc.yale.internal slurmctld[127413]: Recovered JobId=23924851+0(23924851) Assoc=2642
>Mar 21 16:00:20 p2r1u27.farnam.hpc.yale.internal slurmctld[127413]: Recovered JobId=23924852 Assoc=3953
>Mar 21 16:00:54 p2r1u25.farnam.hpc.yale.internal slurmctld[103651]: Recovered JobId=23924851+0(23924851) Assoc=2642
>Mar 21 16:00:54 p2r1u25.farnam.hpc.yale.internal slurmctld[103651]: Recovered JobId=23924852 Assoc=3953

Het job ids reused by array job elements:

>Mar 21 16:08:09 p2r1u25.farnam.hpc.yale.internal slurmctld[103651]: _slurm_rpc_submit_batch_job: JobId=23925049 InitPrio=5169 usec=1337
> 
>Mar 21 16:47:07 p2r1u25.farnam.hpc.yale.internal slurmctld[140280]: _pick_best_nodes: JobId=23925554 never runnable in partition pi_tomography
>Mar 21 16:47:07 p2r1u25.farnam.hpc.yale.internal slurmctld[140280]: _slurm_rpc_submit_batch_het_job: Requested node configuration is not available
>Mar 21 16:47:07 p2r1u25.farnam.hpc.yale.internal slurmctld[140280]: Cancelling aborted hetjob submit: JobId=23925553+0(23925553)
> 
>Mar 21 16:52:36 p2r1u25.farnam.hpc.yale.internal slurmctld[156121]: Purged files for defunct batch JobId=23925554
>Mar 21 16:52:36 p2r1u25.farnam.hpc.yale.internal slurmctld[156121]: Purged files for defunct batch JobId=23925553
> 
>Mar 21 17:08:15 p2r1u25.farnam.hpc.yale.internal slurmctld[156121]: _job_complete: JobId=23925049_314(23925554) WEXITSTATUS 0
>Mar 21 17:08:15 p2r1u25.farnam.hpc.yale.internal slurmctld[156121]: _job_complete: JobId=23925049_314(23925554) done
>Mar 21 17:21:58 p2r1u25.farnam.hpc.yale.internal slurmctld[156121]: _job_complete: JobId=23925049_313(23925553) WEXITSTATUS 0
>Mar 21 17:21:58 p2r1u25.farnam.hpc.yale.internal slurmctld[156121]: _job_complete: JobId=23925049_313(23925553) done

Only leader listed as aborted:

>Mar 21 17:39:49 p2r1u25.farnam.hpc.yale.internal slurmctld[156121]: _pick_best_nodes: JobId=23926300 never runnable in partition pi_tomography
>Mar 21 17:39:49 p2r1u25.farnam.hpc.yale.internal slurmctld[156121]: _slurm_rpc_submit_batch_het_job: Requested node configuration is not available
>Mar 21 17:39:49 p2r1u25.farnam.hpc.yale.internal slurmctld[156121]: Cancelling aborted hetjob submit: JobId=23926299+0(23926299)

Het job ids reused by regular batch job and array job element:

>Mar 22 13:21:07 p2r1u25.farnam.hpc.yale.internal slurmctld[211608]: _pick_best_nodes: JobId=23944094 never runnable in partition pi_tomography
>Mar 22 13:21:07 p2r1u25.farnam.hpc.yale.internal slurmctld[211608]: _slurm_rpc_submit_batch_het_job: Requested node configuration is not available
>Mar 22 13:21:07 p2r1u25.farnam.hpc.yale.internal slurmctld[211608]: Cancelling aborted hetjob submit: JobId=23944093+0(23944093)
>Mar 22 13:32:34 p2r1u25.farnam.hpc.yale.internal slurmctld[97808]: Purged files for defunct batch JobId=23944094
>Mar 22 13:32:34 p2r1u25.farnam.hpc.yale.internal slurmctld[97808]: Purged files for defunct batch JobId=23944093
>Mar 22 13:32:35 p2r1u25.farnam.hpc.yale.internal slurmctld[97808]: _slurm_rpc_submit_batch_job: JobId=23944093 InitPrio=11948 usec=538
>Mar 22 13:32:38 p2r1u25.farnam.hpc.yale.internal slurmctld[97808]: sched: Allocate JobId=23922086_5281(23944094) NodeList=c22n04 #CPUs=1 Partition=scavenge
>Mar 22 13:32:40 p2r1u25.farnam.hpc.yale.internal slurmctld[97808]: sched: Allocate JobId=23944093 NodeList=bigmem02 #CPUs=10 Partition=bigmem
>Mar 22 13:39:54 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: Recovered JobId=23922086_5281(23944094) StepId=extern
>Mar 22 13:39:54 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: Recovered JobId=23922086_5281(23944094) StepId=batch
>Mar 22 13:39:54 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: Recovered JobId=23922086_5281(23944094) Assoc=2562
>Mar 22 13:39:54 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: Recovered JobId=23944093 StepId=extern
>Mar 22 13:39:54 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: Recovered JobId=23944093 StepId=batch
>Mar 22 13:39:54 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: Recovered JobId=23944093 Assoc=4297
>Mar 22 13:50:13 p2r1u25.farnam.hpc.yale.internal slurmctld[104885]: Recovered JobId=23922086_5281(23944094) StepId=extern
>Mar 22 13:50:13 p2r1u25.farnam.hpc.yale.internal slurmctld[104885]: Recovered JobId=23922086_5281(23944094) StepId=batch
>Mar 22 13:50:13 p2r1u25.farnam.hpc.yale.internal slurmctld[104885]: Recovered JobId=23922086_5281(23944094) Assoc=2562
>Mar 22 13:50:13 p2r1u25.farnam.hpc.yale.internal slurmctld[104885]: Recovered JobId=23944093 StepId=extern
>Mar 22 13:50:13 p2r1u25.farnam.hpc.yale.internal slurmctld[104885]: Recovered JobId=23944093 StepId=batch
>Mar 22 13:50:13 p2r1u25.farnam.hpc.yale.internal slurmctld[104885]: Recovered JobId=23944093 Assoc=4297

Only leader listed as aborted:

>Mar 22 13:33:18 p2r1u25.farnam.hpc.yale.internal slurmctld[97808]: _pick_best_nodes: JobId=23944153 never runnable in partition scavenge_gpu
>Mar 22 13:33:18 p2r1u25.farnam.hpc.yale.internal slurmctld[97808]: _slurm_rpc_submit_batch_het_job: Requested node configuration is not available
>Mar 22 13:33:18 p2r1u25.farnam.hpc.yale.internal slurmctld[97808]: Cancelling aborted hetjob submit: JobId=23944152+0(23944152)
>Mar 22 13:39:54 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: Recovered JobId=23944152+0(23944152) Assoc=2642
>Mar 22 13:39:54 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: Recovered JobId=23944153 Assoc=2642

Only leader listed as aborted (but then 2 components recovered--suspicious user id == NO_VAL):

>Mar 22 13:40:57 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: Hetjob JobId=23944167 exceeded association/QOS limit for user 4294967294
>Mar 22 13:40:57 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: _slurm_rpc_submit_batch_het_job: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
>Mar 22 13:40:57 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: Cancelling aborted hetjob submit: JobId=23944167+0(23944167)
>Mar 22 13:50:13 p2r1u25.farnam.hpc.yale.internal slurmctld[104885]: Recovered JobId=23944167+0(23944167) Assoc=2642
>Mar 22 13:50:13 p2r1u25.farnam.hpc.yale.internal slurmctld[104885]: Recovered JobId=23944167+1(23944168) Assoc=2642
Comment 15 Chad Vizino 2022-03-23 18:15:59 MDT
(In reply to Chad Vizino from comment #14)
> I wonder if slurmctld is not being given enough time to shut down so systemd
> is killing it which may be preventing slurmctld from writing its complete
> state (or sending stuff to slurmdbd). What does your slurmctld systemd unit
> file look like? Does it have a 90 sec timeout listed? You may want to try
> increasing it to let slurmctld more gracefully shut down. Maybe that will at
> least prevent the job id reuse issue we're seeing.
I ran a test on my system and after the contrived failure of a het job, I sent a SIGKILL to slurmctld, restarted it and can see job ids being reused so think that explains the reuse problem. What does your slurmctld unit file look like?

Some more questions for you:

What does "sacct -a -j <jid>" show for some of the failed het jobs?

When did this problem start happening?

Have you upgraded recently or has the het job user just started submitting these sorts of jobs?

Can you supply your slurmdbd.log from the 21st and/or 22nd?

Can you supply your prolog script?

Do you have an understanding about the "Invalid job ids" I wondered about in comment 10? Do you know what is trying to kill those every 15sec?

Are you using triggers ("strigger --get" will show any you have)?

Could you supply your syslog from the 21st and/or 22nd?
Comment 17 Adam 2022-03-24 07:13:17 MDT
Hi Chad, I'll upload the other logs/files that you've requested using other comments. 

A) For the service script the timeout value we use is the default:
[Unit]
Description=Slurm controller daemon
After=network.target munge.service
ConditionPathExists=/opt/slurm/current/etc/slurm.conf

[Service]
Type=forking
EnvironmentFile=/opt/slurm/current/etc/environment
ExecStart=/opt/slurm/current/sbin/slurmctld -f /opt/slurm/current/etc/slurm.conf
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurm/slurmctld.pid
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

I'd be very worried if 90 seconds wasn't a reasonable enough amount of time for slurmctld to shut down on this particular platform (Intel(R) Xeon(R) Gold 5222 CPU @ 3.80GHz, 192-GB DDR4, 80Gbps connect to service network, database resides on local nvme disk).

B) ..that is strange about the bogus user id. For JobId=23944167 slurmctld may have reported that as user 4294967294 but slurmdbd correctly recorded it as uid 12914 in the job accounting database. 

C) Excellent work determining why we were seeing this problem for non-het jobs. It seems like job ids should never be reused and to do so could cause many problems. Is there a way to determine if the other hetjobs (real ones or otherwise) shown as canceled in the log files had also been cases of reused jobIDs?

D) Last Saturday at 6:07-PM is the first time that we were impacted by this problem. According to our log files hetjobs had gotten cancelled a few times before last Saturday, ..however in those cases slurmctld appears to have continued normally. We upgraded from slurm 21.08.1 to 21.08.2 on October 7th, 2021. We plan to upgrade to 21.08.6 in a few weeks.

E) I don't have the exact metrics but users have been submitting hetjobs since at least July of last year. 

F) The invalid job ID messages could relate to another less urgent problem. 

G) We are not using triggers.

Thanks,
Adam
Comment 18 Adam 2022-03-24 07:20:15 MDT
Created attachment 24041 [details]
sacct -a -j <jid> for failed hets followed by slurmctld unresponsiveness

As requested by Chad.
Comment 19 Adam 2022-03-24 07:22:34 MDT
*** prolog script ***

#!/bin/bash

# for $XDG_RUNTIME_DIR osTicket #010500
mkdir -p "/run/user/${SLURM_JOB_UID}"
chmod 0700 "/run/user/${SLURM_JOB_UID}"
chown "$SLURM_JOB_UID" "/run/user/${SLURM_JOB_UID}"

# Save job scripts in case support needs them later
scriptdir="/gpfs/ysm/slurm/job_scripts/${HOSTNAME}/$(date "+%Y%m%d")"
if [ ! -e "${scriptdir}/${SLURM_JOB_ID}" ]; then
    mkdir -p "$scriptdir"
    scontrol write batch_script "$SLURM_JOB_ID" "${scriptdir}/${SLURM_JOB_ID}" > /dev/null
    scontrol show JobId="${SLURM_JOB_ID}" > "${scriptdir}/${SLURM_JOB_ID}.info"
fi

/usr/bin/perl /etc/slurm/slurm.prolog.supremm

exit 0



*** /etc/slurm/slurm.prolog.supremm ***
#!/usr/bin/perl

use strict;
use Sys::Hostname;
use Date::Simple;
use POSIX qw(strftime setsid);

my $today = Date::Simple->new;

my $logyear = $today->format("%Y");
my $logmonth = $today->format("%m");
my $logday = $today->format("%d");

my $jobdatelong = strftime "%Y%m%d.%H.%M.%S", localtime;
my $host = ( split('\.', hostname()) )[0];
my $jobid = $ENV{SLURM_JOB_ID};

my $fullhost = hostname();

$SIG{CHLD} = 'IGNORE';        # Configure to autoreap zombies
exit(0) unless defined ( my $child = fork );  # fork, or just exit if failed
exit(0) if $child;      # Main prolog exits with success

setsid();                     # Become session leader
open( STDIN,  "</dev/null" ); # Detach STDIN from shell
open( STDOUT, ">/dev/null" ); # Detach STDOUT from shell
open( STDERR, ">&STDOUT"   ); # Detach STDERR from shell
chdir '/tmp';                 # Change working directory
umask(0);                     # Reset umask

my $logpath = "/gpfs/ysm/SUPReMM_farnam/pmlogger";
my @subdirs = ("$logyear", "$logmonth", "$fullhost", "$logyear-$logmonth-$logday" );
foreach ( @subdirs ) {
  $logpath = $logpath."/".$_;
  unless ( -d $logpath ) {
    `mkdir -p $logpath;chown pcp:pcp $logpath;`;
  }
}
my $logdir = $logpath;


system("env PMLOGGER_PROLOG=yes pmlogger -U pcp -c /etc/pcp/pmlogger/pmlogger-supremm.config -s 420M -l /tmp/job-$jobid-begin-$jobdatelong.log $logdir/job-$jobid-begin-$jobdatelong &> /dev/null");

exit(0)
Comment 20 Adam 2022-03-24 07:26:01 MDT
Created attachment 24042 [details]
slurmctd log for March 22nd (gzipped)

As requested by Chad. Please note that the March 21st log has already been uploaded (attachment 24007 [details]).
Comment 21 Adam 2022-03-24 07:28:59 MDT
Created attachment 24043 [details]
scheduler syslog covering March 21-22nd (gzipped)

As requested by Chad.

Please note that this also includes all of the log data from slurmctld over the same time period.
Comment 22 Chad Vizino 2022-03-24 09:03:56 MDT
(In reply to Adam from comment #17)
> A) For the service script the timeout value we use is the default:
> 
> I'd be very worried if 90 seconds wasn't a reasonable enough amount of time
> for slurmctld to shut down on this particular platform (Intel(R) Xeon(R)
> Gold 5222 CPU @ 3.80GHz, 192-GB DDR4, 80Gbps connect to service network,
> database resides on local nvme disk).
Indeed but a SIGKILL on a timeout to slurmctld looks like it is preventing it from saving its state which is probably leading to other problems like job id reuse since it doesn't record the last job id assigned. I don't see increasing the timeout as a permanent fix but one that may help until we get a solution since you need to restart slurmctld so many times. I'd be interested to know if this works next time you must restart slurmctld:

Manually SIGTERM it and wait for it to gracefully exit (if this can even happen--want to see how long it takes)--we know from your past logs that this will take more than 90sec. You should be able to periodically run "sdiag" while waiting and look at the server thread count to see it going down. slurmctld.log with logging at the info level like you have set should show these lines on a normal shutdown (key is seeing the saving state message):

>Terminate signal (SIGINT or SIGTERM) received
>Saving all slurm state
You might look for the "saving state" message from a previous log on a normal shutdown to see how long it took between the SIGTERM and the saving state (and any other stray messages before actually going down). This would be an interesting baseline metric. It might also be interesting to see how many normal and how many hard (SIGKILL) shutdowns have happened both before and after the het job problem started. Could you check into this?

> C) Excellent work determining why we were seeing this problem for non-het
> jobs. It seems like job ids should never be reused and to do so could cause
> many problems. Is there a way to determine if the other hetjobs (real ones
> or otherwise) shown as canceled in the log files had also been cases of
> reused jobIDs?
Right--job ids should not be reused. Avoiding hard shutdowns should hopefully eliminate the reuse.

> D) Last Saturday at 6:07-PM is the first time that we were impacted by this
> problem. According to our log files hetjobs had gotten cancelled a few times
> before last Saturday, ..however in those cases slurmctld appears to have
> continued normally. We upgraded from slurm 21.08.1 to 21.08.2 on October
> 7th, 2021. We plan to upgrade to 21.08.6 in a few weeks.
Very interesting. So it would seem that something about the job is triggering the problem since your system software versioning seems to be stable.

Thanks for the other answers to my questions--very helpful. I want to spend some time looking at your prolog and make some suggestions for a couple changes that might help. More on this later.
Comment 23 Chad Vizino 2022-03-24 09:56:30 MDT
(In reply to Adam from comment #19)
> *** prolog script ***
> 
> # Save job scripts in case support needs them later
> scriptdir="/gpfs/ysm/slurm/job_scripts/${HOSTNAME}/$(date "+%Y%m%d")"
> if [ ! -e "${scriptdir}/${SLURM_JOB_ID}" ]; then
>     mkdir -p "$scriptdir"
>     scontrol write batch_script "$SLURM_JOB_ID"
> "${scriptdir}/${SLURM_JOB_ID}" > /dev/null
>     scontrol show JobId="${SLURM_JOB_ID}" >
> "${scriptdir}/${SLURM_JOB_ID}.info"
> fi
We recommend not running scontrol in prologs since it can bog down slurmctld with RPCs (see https://slurm.schedmd.com/prolog_epilog.html).

For batch script storage, this can be written to the database (see bug 7609). Is there some reason why you are not using this method? It's commented out in your slurm.conf:

>#AccountingStoreFlags=job_script,job_env
More info is here: https://slurm.schedmd.com/slurm.conf.html#OPT_AccountingStoreFlags.

For the scontrol show, what is it that you need from it? Perhaps we can suggest an alternative.
Comment 24 Chad Vizino 2022-03-24 10:54:02 MDT
(In reply to Adam from comment #17)
> F) The invalid job ID messages could relate to another less urgent problem. 
Maybe but I would like to understand them. There are over 5600 of each of the lines below in the Mar 22 log (every 15sec). Can you look up more info from past logs, squeue and sacct about them?

>Mar 22 13:40:09 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: prolog_complete: invalid JobId=18748205
> 
>Mar 22 13:40:09 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: prolog_complete: invalid JobId=18756999
Comment 25 Chad Vizino 2022-03-24 11:28:29 MDT
(In reply to Chad Vizino from comment #24)
> (In reply to Adam from comment #17)
> > F) The invalid job ID messages could relate to another less urgent problem. 
> Maybe but I would like to understand them. There are over 5600 of each of
> the lines below in the Mar 22 log (every 15sec). Can you look up more info
> from past logs, squeue and sacct about them?
> 
> >Mar 22 13:40:09 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: prolog_complete: invalid JobId=18748205
> > 
> >Mar 22 13:40:09 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: prolog_complete: invalid JobId=18756999
Also check for records of these jobs in your StateSaveLocation (from slurm.conf) under the hash* directories. There may be a stray directory like job.<jid> for each.
Comment 26 Adam 2022-03-24 11:45:50 MDT
Hi Chad,

I'd just like to reiterate that our primary and overwhelming concern in this case is that user space activity is capable of zombifying slurmctld, ..and that we seem to have no way to prevent this. 

Thank you,
Adam
Comment 27 Chad Vizino 2022-03-24 14:01:59 MDT
We'd like you to remove the scontrol lines from your prolog and see if that improves things (see comment 23).

Also, if you are able to grab another sdiag and backtrace when slurmctld is "hung" that would be good (like in comment 5). You may be able to put it into that condition by doing an "scontrol reconfigure" which rereads state files.
Comment 29 Adam 2022-03-25 09:09:17 MDT
Hello,

I'm afraid I'm a bit confused about the focus on the 5600 jobs. It seems like they came about as a result of this problem, and so it's really not clear to me how they could be related to the cause. If you would like more information about them, would you please be more specific about what it is that we're looking for?

I'll check the hash directories for any job.# folders that use jobIDs that are not in the queue. We would like to know how and why they could (or are) related to the problem.

In spite of the RPC hit we will not be removing the scontrol lines from our prolog scripts, unless you can explain why and/or how use of scontrol in this manner is causing the problem that we only see after canceled hetjobs. The information collected from recording job scripts enables us to provide better support for the researchers and we had problems with adding job scripts to the database as you suggested. 

If the system locks up again we can grab another sdiag and debugger dumps but we're not going to intentionally try to cause the scheduler to lock up while it's in production use. The best option is to recreate the problem in our test environment so that we can test any potential solution without negatively impacting any ongoing research work at Yale.

Thank you,
Adam
Comment 30 Jason Booth 2022-03-25 10:09:45 MDT
Hi Adam - Chad brought this issue to my attention. I can understand your confusion here. Let me first start off by clarifying a few points.

We discourage scontrol commands in prologue/epilogue as Chad mentioned in comment#23. Let me elaborate on why.

Running "scontrol show" inside prologue/epilogue can trigger a deadlock, even if it "usually works". Running this can seem seemingly harmless considering 'scontrol show' only needs read locks, but if another internal thread is pending on write, you can potentially run into a race condition and deadlock. We know this is not ideal and would require modifying your workflow to avoid such a situation. With that said, once we have some way to reproduce your specific lockup, then we can confirm for certain if scontrol is to blame. Until such time, we can only assume this is the cause, given previous issues we have seen from other sites with a similar workflow.



With regard to the "5600 jobs" and comment#22: it is highly abnormal to kill slurmctld and a shutdown lasting as long as yours does concerns us, and here is why. 

If slurmctld is killed before it can fully write out the statesave then you could run into problems where the jobs, jobID and overall state of the cluster does not match what is found on the nodes and controller. Slurmctld could end up canceling jobs on nodes that were not properly saved on shutdown because it was not "aware" of it on startup. Though not directly related to the hetjob lockup, it is still concerning. I recommend you split that issue out into another bug when you are ready for support to look into this for your site.
Comment 31 Adam 2022-03-25 11:08:28 MDT
Thank you for clarifying Jason: we're going to try enabling job script saving in the database again during our next maintenance (within 2 weeks from now) and we will take the scontrols out of our prologs at that time. 

We understand and share your concerns about hard-stopping slurmctld, which is part of why we consider this problem to be serious. Normally it takes the scheduler 1-3 seconds to cleanly shutdown, and so if it doesn't do it after 90 seconds I would be inclined to agree with systemd that it never will. We've not had this problem before, and only now have it in conjunction with the hetjob cancel messages: perhaps it is part of the same problem?

We plan to upgrade to 21.08.6 at the time of our maintenance: it would be ideal if we could recreate the problem in a test environment and either verify a fix or confirm that 21.08.6 isn't impacted before then.

Thank you,
Adam
Comment 32 Chad Vizino 2022-03-28 10:50:49 MDT
(In reply to Adam from comment #31)
> Thank you for clarifying Jason: we're going to try enabling job script
> saving in the database again during our next maintenance (within 2 weeks
> from now) and we will take the scontrols out of our prologs at that time. 
> 
> We understand and share your concerns about hard-stopping slurmctld, which
> is part of why we consider this problem to be serious. Normally it takes the
> scheduler 1-3 seconds to cleanly shutdown, and so if it doesn't do it after
> 90 seconds I would be inclined to agree with systemd that it never will.
> We've not had this problem before, and only now have it in conjunction with
> the hetjob cancel messages: perhaps it is part of the same problem?
> 
> We plan to upgrade to 21.08.6 at the time of our maintenance: it would be
> ideal if we could recreate the problem in a test environment and either
> verify a fix or confirm that 21.08.6 isn't impacted before then.
Hi Adam. Reproducing the het job problem in a test environment would be ideal--agreed--and the shutdown delay may be related if locking problems are involved. However, since we (and you on your test system) have not been able to reproduce the problem so far, an answer to whether or not the suggested steps will help may need to wait until you are able to implement them on your production system. In the meantime we would like to downgrade this ticket to level 3. If, after the changes, things are still problematic we can upgrade again. Are you ok with this?
Comment 33 Adam 2022-03-28 14:37:13 MDT
Yes, that's fine and I have no objection. 

Our user outreach has mitigated the problem and we otherwise do not seem to have any way to progress without being able to recreate the problem in a test environment (or by risking a production service outage).

Thank you,
Adam
Comment 34 Chad Vizino 2022-04-18 09:27:39 MDT
(In reply to Adam from comment #31)
> Thank you for clarifying Jason: we're going to try enabling job script
> saving in the database again during our next maintenance (within 2 weeks
> from now) and we will take the scontrols out of our prologs at that time. 
> 
> We understand and share your concerns about hard-stopping slurmctld, which
> is part of why we consider this problem to be serious. Normally it takes the
> scheduler 1-3 seconds to cleanly shutdown, and so if it doesn't do it after
> 90 seconds I would be inclined to agree with systemd that it never will.
> We've not had this problem before, and only now have it in conjunction with
> the hetjob cancel messages: perhaps it is part of the same problem?
> 
> We plan to upgrade to 21.08.6 at the time of our maintenance: it would be
> ideal if we could recreate the problem in a test environment and either
> verify a fix or confirm that 21.08.6 isn't impacted before then.
Hi. I just wanted to check in and see how things are going for you. Were you able to make the changes you mentioned and did they help?
Comment 35 Adam 2022-04-18 10:01:11 MDT
Hello,

We were not able to recreate the problem at the eve of our downtime and so we were not able to verify that any of the changes that we made helped or not. The assumptions are that the bug has not been fixed, that it is load related, and that it will eventually be encountered again.

Thanks,
Adam

From: bugs@schedmd.com <bugs@schedmd.com>
Date: Monday, April 18, 2022 at 11:27 AM
To: Munro, Adam <adam.munro@yale.edu>
Subject: [Bug 13670] scheduler service lockup after hetjob cancels
Comment # 34<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D13670%23c34&data=04%7C01%7Cadam.munro%40yale.edu%7Cc80d9d60afdd44400c6108da214ffa19%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637858924634349244%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ZyrEG4GCZvM%2FkqrZIPjjmq1r5ahAIZFjDfZMynqoCLU%3D&reserved=0> on bug 13670<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D13670&data=04%7C01%7Cadam.munro%40yale.edu%7Cc80d9d60afdd44400c6108da214ffa19%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637858924634349244%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=k5OOOD%2BoGoXxJUhLnl78xiVLZzMrjmnpeQX8rHfSAvU%3D&reserved=0> from Chad Vizino<mailto:chad@schedmd.com>

(In reply to Adam from comment #31<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D13670%23c31&data=04%7C01%7Cadam.munro%40yale.edu%7Cc80d9d60afdd44400c6108da214ffa19%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637858924634349244%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=8LIaAn0NRFspaB1UvU1wFecd1zXOw%2FCceRh30JqZl44%3D&reserved=0>)

> Thank you for clarifying Jason: we're going to try enabling job script

> saving in the database again during our next maintenance (within 2 weeks

> from now) and we will take the scontrols out of our prologs at that time.

>

> We understand and share your concerns about hard-stopping slurmctld, which

> is part of why we consider this problem to be serious. Normally it takes the

> scheduler 1-3 seconds to cleanly shutdown, and so if it doesn't do it after

> 90 seconds I would be inclined to agree with systemd that it never will.

> We've not had this problem before, and only now have it in conjunction with

> the hetjob cancel messages: perhaps it is part of the same problem?

>

> We plan to upgrade to 21.08.6 at the time of our maintenance: it would be

> ideal if we could recreate the problem in a test environment and either

> verify a fix or confirm that 21.08.6 isn't impacted before then.

Hi. I just wanted to check in and see how things are going for you. Were you

able to make the changes you mentioned and did they help?

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 36 Chad Vizino 2022-04-18 13:02:35 MDT
(In reply to Adam from comment #35)
> We were not able to recreate the problem at the eve of our downtime and so
> we were not able to verify that any of the changes that we made helped or
> not. The assumptions are that the bug has not been fixed, that it is load
> related, and that it will eventually be encountered again.
That is a valid assumption and I'm sorry to hear that the outcome was not more clear. May we close this ticket for now (with "can't reproduce") with the understanding that you can reopen at any time in the future when it occurs again?
Comment 37 Adam 2022-04-18 13:34:10 MDT
Have you exhausted exploring and communicating every possible way that this problem could have occurred? You mentioned the usage of scontrol and failed shutdowns of the scheduler service potentially confusing hetjob numbering (and while we have our doubts we have made some adjustments on our side based on this): is there anything else (and aligns with what we have described)?

I think if we have the best possible understanding of how this could have happened (however difficult to recreate) we can at least know what to watch out for.

Thanks,
Adam

From: bugs@schedmd.com <bugs@schedmd.com>
Date: Monday, April 18, 2022 at 3:02 PM
To: Munro, Adam <adam.munro@yale.edu>
Subject: [Bug 13670] scheduler service lockup after hetjob cancels
Comment # 36<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D13670%23c36&data=04%7C01%7Cadam.munro%40yale.edu%7C7c535be2211a47c0d05208da216e0045%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637859053597542653%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2FD3ULWzU5b8eVZQOLtPlqw8tllauH%2Fy8LWpSQnFM8is%3D&reserved=0> on bug 13670<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D13670&data=04%7C01%7Cadam.munro%40yale.edu%7C7c535be2211a47c0d05208da216e0045%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637859053597542653%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=pBl151Ols8bhR15BZzSW5iP5g3phLemzqa8IAYdXeo8%3D&reserved=0> from Chad Vizino<mailto:chad@schedmd.com>

(In reply to Adam from comment #35<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D13670%23c35&data=04%7C01%7Cadam.munro%40yale.edu%7C7c535be2211a47c0d05208da216e0045%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637859053597542653%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=JwCNM3Q0D3THO2AClbg5hsid7dh25wPwWjJwmaxUSJo%3D&reserved=0>)

> We were not able to recreate the problem at the eve of our downtime and so

> we were not able to verify that any of the changes that we made helped or

> not. The assumptions are that the bug has not been fixed, that it is load

> related, and that it will eventually be encountered again.

That is a valid assumption and I'm sorry to hear that the outcome was not more

clear. May we close this ticket for now (with "can't reproduce") with the

understanding that you can reopen at any time in the future when it occurs

again?

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 38 Chad Vizino 2022-04-19 14:37:42 MDT
(In reply to Adam from comment #37)
> Have you exhausted exploring and communicating every possible way that this
> problem could have occurred?
Without a way to reproduce the issue finding the problem is significantly more difficult in some cases. We have a number of sites that use het jobs and have not run into the problem you have reported here so we have some reason to believe this may be site-specific.

> You mentioned the usage of scontrol and failed
> shutdowns of the scheduler service potentially confusing hetjob numbering
> (and while we have our doubts we have made some adjustments on our side
> based on this): is there anything else (and aligns with what we have
> described)?
We are not convinced this has anything to do with the initial problem you reported and we were concerned that there may be some type of issue with the state files but that can be investigated in another bug report if you'd like to file one for it.

> I think if we have the best possible understanding of how this could have
> happened (however difficult to recreate) we can at least know what to watch
> out for.
What we do have are a set of knowns regarding your setup and the potential for "scheduler service lockup" as mentioned in comment 30. It is possible that your cluster operates normally until the perfect set of circumstances happens. This could be an increase of jobs submitted and starting in close succession.

It's good that you were able to make the changes to your system mentioned in comment 31 and hopefully they will help. It will be best at this time to wait and see. If the problem happens again, then we'll have more information to go on (we'll at least know a couple things that the problem is not).
Comment 39 Chad Vizino 2022-05-03 11:56:56 MDT
Hi. I'm going to close this issue for now. However, feel free to reopen if you encounter this problem again and we'll pursue it.