We've observed a relationship between the slurmctld service becoming unresponsive and a "Cancelling aborted hetjob submit:" message. The slurmctld service does not fail, and so by our existing alerting methods we don't get notified about this, ..it just stops responding to all requests. example: Mar 21 16:47:03 p2r1u25.farnam.hpc.yale.internal slurmctld[140280]: _job_complete: JobId=23925049_185(23925323) done Mar 21 16:47:07 p2r1u25.farnam.hpc.yale.internal slurmctld[140280]: _pick_best_nodes: JobId=23925554 never runnable in partition pi_tomography Mar 21 16:47:07 p2r1u25.farnam.hpc.yale.internal slurmctld[140280]: _slurm_rpc_submit_batch_het_job: Requested node configuration is not available Mar 21 16:47:07 p2r1u25.farnam.hpc.yale.internal slurmctld[140280]: Cancelling aborted hetjob submit: JobId=23925553+0(23925553) Mar 21 16:50:01 p2r1u25.farnam.hpc.yale.internal systemd: Started Session 235 of user root. Mar 21 16:50:08 p2r1u25.farnam.hpc.yale.internal systemd-logind: New session 236 of user root. Mar 21 16:50:08 p2r1u25.farnam.hpc.yale.internal systemd: Started Session 236 of user root. Mar 21 16:50:08 p2r1u25.farnam.hpc.yale.internal dbus[1473]: [system] Activating service name='org.freedesktop.problems' (using servicehelper) Mar 21 16:50:08 p2r1u25.farnam.hpc.yale.internal dbus[1473]: [system] Successfully activated service 'org.freedesktop.problems' Mar 21 16:51:01 p2r1u25.farnam.hpc.yale.internal slurmctld[140280]: Terminate signal (SIGINT or SIGTERM) received Mar 21 16:51:01 p2r1u25.farnam.hpc.yale.internal systemd: Stopping Slurm controller daemon... ..with the events after 16:50 being a root user getting on the node to restart the service. This has happened 3 times today, and happened once over this last weekend (>12 hour outage). Thank you, Adam
Adam, if the service is currently stuck, would you then please grab a back trace and attach that to this issue? > gdb attach <slurmstepd_pid> > (gdb) set print pretty > (gdb) t a a bt full If client commands still work, then the output of sdiag would also be helpful to attach as well.
Not stuck at the moment: we might be able to recreate the problem on our test cluster (and if so I'll send you the gdb output). Since the impact is so large we try to recover the service (systemctl stop slurmctld, systemctl start slurmctld) as soon as we notice, but I'll try to get more data next time. I don't know about sdiag but a command like sinfo will timeout when the system gets into this state. Here is the current sdiag output (the service is running normally): [ACTIVE@p2r1u25:~] sdiag ******************************************************* sdiag output at Mon Mar 21 18:31:06 2022 (1647901866) Data since Mon Mar 21 17:45:05 2022 (1647899105) ******************************************************* Server thread count: 3 Agent queue size: 0 Agent count: 0 Agent thread count: 0 DBD Agent queue size: 0 Jobs submitted: 23 Jobs started: 451 Jobs completed: 428 Jobs canceled: 3 Jobs failed: 0 Job states ts: Mon Mar 21 18:31:05 2022 (1647901865) Jobs pending: 7082 Jobs running: 866 Main schedule statistics (microseconds): Last cycle: 29946 Max cycle: 259350 Total cycles: 70 Mean cycle: 24696 Mean depth cycle: 48 Cycles per minute: 1 Last queue length: 42 Backfilling stats Total backfilled jobs (since last slurm start): 39 Total backfilled jobs (since last stats cycle start): 39 Total backfilled heterogeneous job components: 0 Total cycles: 18 Last cycle when: Mon Mar 21 18:30:05 2022 (1647901805) Last cycle: 32836 Max cycle: 57222 Mean cycle: 38540 Last depth cycle: 55 Last depth cycle (try sched): 54 Depth Mean: 58 Depth Mean (try depth): 57 Last queue length: 42 Queue length mean: 43 Last table size: 9 Mean table size: 9 Latency for 1000 calls to gettimeofday(): 16 microseconds Remote Procedure Call statistics by message type REQUEST_PARTITION_INFO ( 2009) count:6482 ave_time:227 total_time:1473720 REQUEST_FED_INFO ( 2049) count:4444 ave_time:157 total_time:701616 REQUEST_JOB_INFO_SINGLE ( 2021) count:4364 ave_time:244 total_time:1068571 REQUEST_NODE_INFO_SINGLE ( 2040) count:2430 ave_time:967 total_time:2350552 REQUEST_COMPLETE_PROLOG ( 6018) count:817 ave_time:277 total_time:226596 MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:687 ave_time:371 total_time:255370 REQUEST_NODE_INFO ( 2007) count:475 ave_time:1078 total_time:512328 REQUEST_STEP_COMPLETE ( 5016) count:475 ave_time:415 total_time:197463 MESSAGE_EPILOG_COMPLETE ( 6012) count:467 ave_time:144 total_time:67271 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:459 ave_time:402 total_time:184930 REQUEST_BATCH_SCRIPT ( 2051) count:443 ave_time:236 total_time:104973 REQUEST_KILL_JOB ( 5032) count:371 ave_time:263 total_time:97930 REQUEST_JOB_USER_INFO ( 2039) count:91 ave_time:357 total_time:32574 REQUEST_JOB_READY ( 4019) count:18 ave_time:2128 total_time:38304 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:15 ave_time:3311 total_time:49667 REQUEST_JOB_STEP_CREATE ( 5001) count:12 ave_time:1432 total_time:17194 REQUEST_JOB_INFO ( 2003) count:12 ave_time:5701 total_time:68417 REQUEST_BUILD_INFO ( 2001) count:10 ave_time:230 total_time:2309 REQUEST_HET_JOB_ALLOC_INFO ( 4027) count:7 ave_time:154 total_time:1083 REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:7 ave_time:282 total_time:1974 REQUEST_RESOURCE_ALLOCATION ( 4001) count:6 ave_time:3081 total_time:18490 REQUEST_UPDATE_NODE ( 3002) count:3 ave_time:128 total_time:384 REQUEST_CANCEL_JOB_STEP ( 5005) count:2 ave_time:5728 total_time:11457 REQUEST_PING ( 1008) count:2 ave_time:118 total_time:236 Remote Procedure Call statistics by user root ( 0) count:9947 ave_time:458 total_time:4558741 sw2247 ( 16774) count:1538 ave_time:228 total_time:351044 dz287 ( 15166) count:1380 ave_time:224 total_time:310490 yh462 ( 13841) count:1065 ave_time:233 total_time:248221 ah2428 ( 18322) count:876 ave_time:221 total_time:193731 jc3893 ( 18408) count:753 ave_time:233 total_time:175860 ngd22 ( 19289) count:752 ave_time:220 total_time:166016 ijc7 ( 18826) count:740 ave_time:216 total_time:160091 mem268 ( 18956) count:698 ave_time:216 total_time:151105 yc753 ( 16259) count:591 ave_time:222 total_time:131433 eaz29 ( 19040) count:549 ave_time:209 total_time:114849 yx44 ( 10855) count:449 ave_time:236 total_time:106246 chs69 ( 19356) count:440 ave_time:200 total_time:88361 sw926 ( 15794) count:423 ave_time:239 total_time:101176 st757 ( 19342) count:412 ave_time:252 total_time:103872 tc732 ( 17107) count:309 ave_time:226 total_time:69951 crm54 ( 20003) count:278 ave_time:212 total_time:59068 vs428 ( 16267) count:260 ave_time:225 total_time:58600 je45 ( 12015) count:218 ave_time:204 total_time:44557 taw46 ( 18820) count:114 ave_time:235 total_time:26802 ms725 ( 12732) count:107 ave_time:227 total_time:24360 tl397 ( 11917) count:40 ave_time:1862 total_time:74518 mrb93 ( 13687) count:18 ave_time:766 total_time:13797 ls2286 ( 16099) count:16 ave_time:979 total_time:15677 bc758 ( 18905) count:15 ave_time:206 total_time:3095 hl796 ( 16326) count:13 ave_time:380 total_time:4950 ap2279 ( 15164) count:12 ave_time:1326 total_time:15920 ktt8 ( 14015) count:10 ave_time:399 total_time:3996 bb926 ( 19980) count:9 ave_time:3158 total_time:28423 zl436 ( 15226) count:9 ave_time:318 total_time:2869 jg2447 ( 14101) count:9 ave_time:272 total_time:2454 yy496 ( 14135) count:9 ave_time:1069 total_time:9625 slurm ( 666) count:6 ave_time:183 total_time:1103 ep542 ( 14978) count:5 ave_time:1473 total_time:7367 mwr8 ( 13872) count:5 ave_time:5412 total_time:27060 cz378 ( 18324) count:4 ave_time:2855 total_time:11423 cl2384 ( 17600) count:4 ave_time:1231 total_time:4926 yw599 ( 15755) count:3 ave_time:266 total_time:799 yg247 ( 12914) count:3 ave_time:217 total_time:653 cz354 ( 16968) count:3 ave_time:255 total_time:765 lut2 ( 14998) count:3 ave_time:198 total_time:594 xw379 ( 14526) count:2 ave_time:4174 total_time:8349 gpr24 ( 17481) count:2 ave_time:236 total_time:472 Pending RPC statistics No pending RPCs
We downgraded our test cluster to 21.08.2 and recreated the hetjob messages on the test cluster by submitting a hetjob requesting resources unavailable to the test environment: this test did *not* lock up the scheduler service.
(In reply to Adam from comment #3) > We downgraded our test cluster to 21.08.2 and recreated the hetjob messages > on the test cluster by submitting a hetjob requesting resources unavailable > to the test environment: this test did *not* lock up the scheduler service. Could you send a full slurmctld.log for Mar 21 from your main system where this has happened? I'd like to see how the scheduler was handling the affected het jobs.
I was unable to recreate the problem on our test cluster, but the problem happened again today on the production system. sdiag was responsive and returned: Last login: Tue Mar 22 13:23:32 on ttys006 aam233@Adams-MacBook-Pro ~ % cd Desktop aam233@Adams-MacBook-Pro Desktop % ls GPS_internal_MOP_Performance_v1.6.docx download.jpg nissim_to_upload GraceCoreParts.xlsx grace.csv openSMlog_postDec5.txt NWR hardware_installation_maintenance_guide_lenovo_storage_d1212_d1224_d3284.pdf sstates Performance Review Munro FY22-with-goals-edit.doc list this_week.txt Policy maint wtf.txt Projects newblue.csv yale university pixstor.png.pdf Q244753_200000003687 CSV.xlsx newslurm zabbix_infiniband.xml Yale InfiniBand Fabric Report-Feb-14-2022.docx nfsrdma-vastdata-3.9.4.tar.xz ~$tterofExplanation_CreditInquiries.docx aam233@Adams-MacBook-Pro Desktop % vim locked state 2 files to edit [ACTIVE@p2r1u25:~] sdiag ******************************************************* sdiag output at Tue Mar 22 13:24:10 2022 (1647969850) Data since Mon Mar 21 20:00:00 2022 (1647907200) ******************************************************* Server thread count: 188 Agent queue size: 0 Agent count: 0 Agent thread count: 0 DBD Agent queue size: 0 Jobs submitted: 11768 Jobs started: 16372 Jobs completed: 14740 Jobs canceled: 65 Jobs failed: 0 Job states ts: Tue Mar 22 13:20:44 2022 (1647969644) Jobs pending: 1512 Jobs running: 738 Main schedule statistics (microseconds): Last cycle: 16512 Max cycle: 518002 Total cycles: 1986 Mean cycle: 20866 Mean depth cycle: 39 Cycles per minute: 1 Last queue length: 56 Backfilling stats Total backfilled jobs (since last slurm start): 1578 Total backfilled jobs (since last stats cycle start): 1397 Total backfilled heterogeneous job components: 8 Total cycles: 417 Last cycle when: Tue Mar 22 13:20:19 2022 (1647969619) Last cycle: 58523 Max cycle: 240477 Mean cycle: 30661 Last depth cycle: 64 Last depth cycle (try sched): 64 Depth Mean: 40 Depth Mean (try depth): 39 Last queue length: 55 Queue length mean: 31 Last table size: 15 Mean table size: 16 Latency for 1000 calls to gettimeofday(): 16 microseconds Remote Procedure Call statistics by message type REQUEST_PARTITION_INFO ( 2009) count:138027 ave_time:223 total_time:30802000 REQUEST_FED_INFO ( 2049) count:91892 ave_time:155 total_time:14304400 REQUEST_JOB_INFO_SINGLE ( 2021) count:90428 ave_time:304 total_time:27500792 REQUEST_NODE_INFO_SINGLE ( 2040) count:63404 ave_time:521 total_time:33076247 REQUEST_COMPLETE_PROLOG ( 6018) count:27486 ave_time:215 total_time:5929203 REQUEST_STEP_COMPLETE ( 5016) count:18387 ave_time:443 total_time:8147613 REQUEST_NODE_INFO ( 2007) count:18288 ave_time:1092 total_time:19979793 MESSAGE_EPILOG_COMPLETE ( 6012) count:18210 ave_time:243 total_time:4435893 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:18044 ave_time:534 total_time:9641511 REQUEST_BATCH_SCRIPT ( 2051) count:17870 ave_time:1621 total_time:28983610 MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:10119 ave_time:303 total_time:3069775 REQUEST_KILL_JOB ( 5032) count:9488 ave_time:273 total_time:2596344 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:1782 ave_time:2524 total_time:4498315 REQUEST_JOB_USER_INFO ( 2039) count:1365 ave_time:456 total_time:622736 REQUEST_JOB_INFO ( 2003) count:387 ave_time:6210 total_time:2403624 REQUEST_JOB_READY ( 4019) count:382 ave_time:967 total_time:369394 REQUEST_BUILD_INFO ( 2001) count:291 ave_time:229 total_time:66693 REQUEST_JOB_STEP_CREATE ( 5001) count:244 ave_time:3283 total_time:801239 REQUEST_HET_JOB_ALLOC_INFO ( 4027) count:155 ave_time:166 total_time:25831 REQUEST_RESOURCE_ALLOCATION ( 4001) count:87 ave_time:4036 total_time:351188 REQUEST_UPDATE_NODE ( 3002) count:85 ave_time:185 total_time:15748 REQUEST_PING ( 1008) count:73 ave_time:140 total_time:10286 REQUEST_CANCEL_JOB_STEP ( 5005) count:72 ave_time:979 total_time:70501 REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:70 ave_time:424 total_time:29732 REQUEST_SUBMIT_BATCH_HET_JOB ( 4028) count:5 ave_time:4470 total_time:22353 REQUEST_UPDATE_JOB ( 3001) count:5 ave_time:1633 total_time:8169 REQUEST_AUTH_TOKEN ( 5039) count:3 ave_time:291 total_time:874 REQUEST_STATS_INFO ( 2035) count:1 ave_time:150 total_time:150 REQUEST_RESERVATION_INFO ( 2024) count:1 ave_time:173 total_time:173 Remote Procedure Call statistics by user root ( 0) count:300247 ave_time:442 total_time:132863662 yl2428 ( 18649) count:51416 ave_time:237 total_time:12235703 dz287 ( 15166) count:18729 ave_time:240 total_time:4507587 sw2247 ( 16774) count:15081 ave_time:252 total_time:3801279 ijc7 ( 18826) count:14078 ave_time:250 total_time:3524550 nkv6 ( 19314) count:9997 ave_time:279 total_time:2790950 tl688 ( 19130) count:8449 ave_time:244 total_time:2063804 sw926 ( 15794) count:6031 ave_time:251 total_time:1514632 ah2428 ( 18322) count:5229 ave_time:224 total_time:1172223 ... sl2373 ( 12560) count:3 ave_time:253 total_time:760 aj646 ( 19804) count:3 ave_time:318 total_time:956 crw42 ( 11467) count:3 ave_time:226 total_time:680 tw626 ( 19760) count:3 ave_time:456 total_time:1369 pag4 ( 10001) count:2 ave_time:2213 total_time:4427 pl543 ( 16023) count:1 ave_time:1113 total_time:1113 Pending RPC statistics No pending RPCs and the GDB output: [ACTIVE@p2r1u25:~] gdb attach 211610 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... attach: No such file or directory. Attaching to process 211610 Reading symbols from /opt/slurm/21.08.2/sbin/slurmctld...done. Reading symbols from /opt/slurm/21.08.2/lib/slurm/libslurmfull.so...done. Loaded symbols for /opt/slurm/21.08.2/lib/slurm/libslurmfull.so Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libdl.so.2 Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libm.so.6 Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libresolv.so.2 Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done. [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Loaded symbols for /lib64/libpthread.so.0 Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libnss_files.so.2 Reading symbols from /lib64/libnss_sss.so.2...Reading symbols from /lib64/libnss_sss.so.2...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /lib64/libnss_sss.so.2 0x00007fb08673dccd in poll () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install glibc-2.17-324.el7_9.x86_64 sssd-client-1.16.5-10.el7_9.10.x86_64 (gdb) set print pretty (gdb) t a a bt full Thread 1 (Thread 0x7fb087939740 (LWP 211610)): #0 0x00007fb08673dccd in poll () from /lib64/libc.so.6 No symbol table info available. #1 0x00007fb0873d6024 in _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x22d4440) at eio.c:352 n = <optimized out> timeout = -1 #2 eio_handle_mainloop (eio=0x22cadc0) at eio.c:315 retval = 0 pollfds = 0x22d4440 map = 0x22b31a0 maxnfds = 1 nfds = 2 n = <optimized out> shutdown_time = <optimized out> __func__ = "eio_handle_mainloop" #3 0x00000000004b720a in _slurmscriptd_mainloop () at slurmscriptd.c:837 No locals. #4 slurmscriptd_init (argc=argc@entry=3, argv=argv@entry=0x7ffca84339a8) at slurmscriptd.c:1107 i = <optimized out> ack = 0 rc = 0 proc_name = 0x4e86ba "slurmscriptd" log_prefix = 0x0 to_slurmscriptd = {6, 7} to_slurmctld = {8, 9} __func__ = "slurmscriptd_init" #5 0x000000000042dd6b in main (argc=3, argv=0x7ffca84339a8) at controller.c:433 cnt = <optimized out> error_code = <optimized out> i = 3 start = { tv_sec = 1647899099, tv_usec = 27688 } now = { tv_sec = 1647899099, tv_usec = 27704 } stat_buf = { st_dev = 64768, st_ino = 100851449, st_nlink = 1, st_mode = 33261, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 1828, st_blksize = 4096, st_blocks = 8, st_atim = { tv_sec = 1647892855, tv_nsec = 296568077 }, st_mtim = { tv_sec = 1596213326, tv_nsec = 0 }, st_ctim = { tv_sec = 1598892426, tv_nsec = 683905770 }, __unused = {0, 0, 0} } rlim = { rlim_cur = 18446744073709551615, rlim_max = 18446744073709551615 } config_write_lock = { conf = WRITE_LOCK, ---Type <return> to continue, or q <return> to quit--- job = WRITE_LOCK, node = WRITE_LOCK, part = WRITE_LOCK, fed = NO_LOCK } prep_callbacks = { prolog_slurmctld = 0x490f60 <prep_prolog_slurmctld_callback>, epilog_slurmctld = 0x491199 <prep_epilog_slurmctld_callback> } create_clustername_file = false conf_file = <optimized out> __func__ = "main" (gdb) ...I'll upload the log files.
Created attachment 24006 [details] logfile logfile covering several lockups
Hi Chad, For the log file: 13:21:07 - along with the hetjob message the first lock-up event occurs (the sdiag and gdb output was collected after this point and before restarting the program) 13:29:52 - slurmctld is manually stopped 13:32:28 - slurmctld is manually started 13:33:18 - another hetjob message and another lockup 13:36:59 - slurmctld is manually stopped 13:39:52 - slurmctld is manually started 13:40:57 - another hetjob message and another lockup 13:48:14 - slurmctld is manually stopped 13:50:07 - slurmctld is manually started More generally, and for every one of these: [ACTIVE@p2r1u25:/var/log] egrep hetjob /var/log/messages Mar 21 16:47:07 p2r1u25.farnam.hpc.yale.internal slurmctld[140280]: Cancelling aborted hetjob submit: JobId=23925553+0(23925553) Mar 21 17:39:49 p2r1u25.farnam.hpc.yale.internal slurmctld[156121]: Cancelling aborted hetjob submit: JobId=23926299+0(23926299) Mar 22 13:21:07 p2r1u25.farnam.hpc.yale.internal slurmctld[211608]: Cancelling aborted hetjob submit: JobId=23944093+0(23944093) Mar 22 13:33:18 p2r1u25.farnam.hpc.yale.internal slurmctld[97808]: Cancelling aborted hetjob submit: JobId=23944152+0(23944152) Mar 22 13:40:57 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: Cancelling aborted hetjob submit: JobId=23944167+0(23944167) the scheduler stopped responding to commands like 'sinfo' and stopped scheduling jobs. Thanks, Adam
Created attachment 24007 [details] Full slurmctld log file covering March 21st Full slurmctld log file covering March 21st
(In reply to Adam from comment #8) > Created attachment 24007 [details] > Full slurmctld log file covering March 21st Thanks for what you've sent today. I'm looking it over.
The "Cancelling aborted" messages come from _het_job_cancel() which is called by _slurm_rpc_submit_batch_het_job() which is called for the sbatch command (as opposed to salloc or srun (needing an allocation)). We can see the error messages that are causing the batch jobs to fail. Mostly they are "Requested node configuration is not available" but there is one from today that is "Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)" so the problem doesn't seem to be limited to one kind of failure. >Mar 21 14:53:45 p2r1u27.farnam.hpc.yale.internal slurmctld[5182]: _slurm_rpc_submit_batch_het_job: Requested node configuration is not available >Mar 21 14:53:45 p2r1u27.farnam.hpc.yale.internal slurmctld[5182]: Cancelling aborted hetjob submit: JobId=23924851+0(23924851) >-- >Mar 21 16:47:07 p2r1u25.farnam.hpc.yale.internal slurmctld[140280]: _slurm_rpc_submit_batch_het_job: Requested node configuration is not available >Mar 21 16:47:07 p2r1u25.farnam.hpc.yale.internal slurmctld[140280]: Cancelling aborted hetjob submit: JobId=23925553+0(23925553) >-- >Mar 21 17:39:49 p2r1u25.farnam.hpc.yale.internal slurmctld[156121]: _slurm_rpc_submit_batch_het_job: Requested node configuration is not available >Mar 21 17:39:49 p2r1u25.farnam.hpc.yale.internal slurmctld[156121]: Cancelling aborted hetjob submit: JobId=23926299+0(23926299) >-- >Mar 22 13:21:07 p2r1u25.farnam.hpc.yale.internal slurmctld[211608]: _slurm_rpc_submit_batch_het_job: Requested node configuration is not available >Mar 22 13:21:07 p2r1u25.farnam.hpc.yale.internal slurmctld[211608]: Cancelling aborted hetjob submit: JobId=23944093+0(23944093) >-- >Mar 22 13:33:18 p2r1u25.farnam.hpc.yale.internal slurmctld[97808]: _slurm_rpc_submit_batch_het_job: Requested node configuration is not available >Mar 22 13:33:18 p2r1u25.farnam.hpc.yale.internal slurmctld[97808]: Cancelling aborted hetjob submit: JobId=23944152+0(23944152) >-- >Mar 22 13:40:57 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: _slurm_rpc_submit_batch_het_job: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits) >Mar 22 13:40:57 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: Cancelling aborted hetjob submit: JobId=23944167+0(23944167) What's odd of each of these is that only the het leader is listed (+0) and not other component jobs (+1, +2, etc.). I'm not sure what to make of that yet. I can contrive the failure by modifying the condition in my slurmctld (am using 21.08.2 also) to trigger a similar failure but it doesn't hang slurmctld: >$ sbatch : --wrap="sleep 300" >sbatch: error: Batch job submission failed: Requested node configuration is not available Then in my slurmctld.log: >[2022-03-22T17:23:35.595] _slurm_rpc_submit_batch_het_job: Requested node configuration is not available >[2022-03-22T17:23:35.595] Cancelling aborted hetjob submit: JobId=164062+0(164062) >[2022-03-22T17:23:35.595] Cancelling aborted hetjob submit: JobId=164062+1(164063) As expected, I see two component het jobs cancelled. But in your log we are only seeing one (the +0) which is odd...I wonder about some kind of corruption of the job. Do you have any details about the failed jobs that are listed in your log (ex. job script, same user)? How were you trying to reproduce the problem on your test cluster (job submission type, etc.)? I'm not sure if these are related but there are a lot of them for the same two job ids in the log you provided (every ~15sec): >Mar 21 00:00:08 p2r1u27.farnam.hpc.yale.internal slurmctld[5182]: prolog_complete: invalid JobId=18748205 >Mar 21 00:00:08 p2r1u27.farnam.hpc.yale.internal slurmctld[5182]: _slurm_rpc_complete_prolog JobId=18748205: Invalid job id specified >Mar 21 00:00:08 p2r1u27.farnam.hpc.yale.internal slurmctld[5182]: prolog_complete: invalid JobId=18756999 >Mar 21 00:00:08 p2r1u27.farnam.hpc.yale.internal slurmctld[5182]: _slurm_rpc_complete_prolog JobId=18756999: Invalid job id specified >Mar 21 00:00:08 p2r1u27.farnam.hpc.yale.internal slurmctld[5182]: _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=18756999 uid 0 >Mar 21 00:00:08 p2r1u27.farnam.hpc.yale.internal slurmctld[5182]: _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=18748205 uid 0 >Mar 21 00:00:08 p2r1u27.farnam.hpc.yale.internal slurmctld[5182]: job_str_signal(3): invalid JobId=18756999 >Mar 21 00:00:08 p2r1u27.farnam.hpc.yale.internal slurmctld[5182]: _slurm_rpc_kill_job: job_str_signal() uid=0 JobId=18756999 sig=9 returned: Invalid job id specified >Mar 21 00:00:08 p2r1u27.farnam.hpc.yale.internal slurmctld[5182]: job_str_signal(3): invalid JobId=18748205 >Mar 21 00:00:08 p2r1u27.farnam.hpc.yale.internal slurmctld[5182]: _slurm_rpc_kill_job: job_str_signal() uid=0 JobId=18748205 sig=9 returned: Invalid job id specified Given that the earliest job id allocated in the log is this one: >Mar 21 00:00:02 p2r1u27.farnam.hpc.yale.internal slurmctld[5182]: sched: Allocate JobId=21938154 NodeList=c17n09 #CPUs=1 Partition=admintest It would appear from comparing the two invalid job id numbers to this one that they are older. Do you know anything about them? Also, would you supply your slurm.conf and prolog? Thanks.
Created attachment 24020 [details] slurm.conf slurm.conf
Slurm.conf attached. Our epilog is very short: #!/bin/bash # If there are no more of this user processes on this node so clean up directories /usr/bin/perl /etc/slurm/slurm.epilog.supremm; if ! pgrep -u $SLURM_JOB_UID > /dev/null then # for $XDG_RUNTIME_DIR osTicket #010500 Remove /run/user/$SLURM_JOB_UID directory if no # processes on node rm -rf /run/user/$SLURM_JOB_UID find /dev/shm -uid $SLURM_JOB_UID -type f -delete find /tmp -uid $SLURM_JOB_UID -type f -delete fi exit 0 ..and slurm.epliog.supremm is: #!/usr/bin/perl # Added for OS-Ticket #017703 use strict; use Sys::Hostname; use Date::Simple; use POSIX qw(strftime); my $jobid = $ENV{SLURM_JOB_ID}; my $today = Date::Simple->new; my $logyear = $today->format("%Y"); my $logmonth = $today->format("%m"); my $logday = $today->format("%d"); my $jobdatelong = strftime "%Y%m%d.%H.%M.%S", localtime; my $fullhost = hostname(); my $logdir = "/gpfs/ysm/SUPReMM_farnam/pmlogger/$logyear/$logmonth/$fullhost/$logyear-$logmonth-$logday"; system("env PMLOGGER_EPILOG=yes pmlogger -U pcp -c /etc/pcp/pmlogger/pmlogger-supremm.config -s 1 -l /tmp/job-$jobid-end-$jobdatelong.log $logdir/job-$jobid-end-$jobdatelong &> /dev/null"); system("pkill -u pcp -f job-$jobid-begin"); exit(0);
In terms of the jobs they are mostly and definitely hetjobs and mostly submitted by one particular user. There was a second user submitting a (canceled) hetjob last Saturday, but the curious thing in a few other cases was that two of the canceled "hetjob" jobs did not appear to be hetjobs. Here are a few examples: (clearly a hetjob case, and from the user submitting most of the jobs): #!/bin/bash #SBATCH --job-name=3DRefine_3M_1pt1 #SBATCH --partition=general #SBATCH --ntasks=1 #SBATCH --cpus-per-task=2 #SBATCH --mem-per-cpu=8000M #SBATCH --time=4-00:00:00 #SBATCH --mail-type=END,FAIL #SBATCH --mail-user=<REDACTED> #SBATCH hetjob #SBATCH --account=tomography #SBATCH --partition=pi_tomography #SBATCH --nodes=2 #SBATCH --ntasks-per-node=4 #SBATCH --cpus-per-task=2 #SBATCH --time=4-00:00:00 #SBATCH --gres=gpu:rtx5000:4 #SBATCH --gres-flags=enforce-binding #SBATCH --mem-per-cpu=22G module load RELION/3.0.8-fosscuda-2018b #mkdir Refine3D/cluster4_7_protein srun --pack-group=0,1 $(which relion_refine_mpi) \ --continue Refine3D/cluster4_7_protein/run_it023_optimiser.star \ --o Refine3D/cluster4_7_protein/run_ct23 \ --dont_combine_weights_via_disc \ --scratch_dir /tmp/<REDACTED> \ --pool 50 \ --pad 2 \ --particle_diameter 576 \ --solvent_mask MaskCreate/cluster4_2_protein/mask.mrc \ --solvent_correct_fsc \ --j 2 \ --gpu \ ..and 2 cases of "how is this a hetjob?" like this one: #!/bin/bash #SBATCH --array 0-2733 #SBATCH --output dsq-tarv_block_anxiety_ICD_rank-%A_%4a-%N.out #SBATCH --job-name dsq-tarv_block_anxiety_ICD_rank #SBATCH --mem-per-cpu=35g --time=5:00:00 -p scavenge /ysm-gpfs/apps/software/dSQ/1.05/dSQBatch.py --job-file /gpfs/ysm/scratch60/<REDACTED>/<REDACTED>/IDP/realdata/anxiety/tmp/tarv_block_anxiety_ICD_rank.txt --status-dir /gpfs/ysm/scratch60/<REDACTED>/<REDACTED>/IDP/realdata/anxiety/tmp We did make one change since opening the ticket and the last blitz of lockups: we added whole_hetjob to SchedulerParamaters. This was because it appeared as if parts of the canceled hetjobs still got started (confirmed to be running after recovering the scheduler). On our test cluster running the same version of slurm and (as far as possible) the same configuration we, like you, have not yet been able to recreate the "unresponsive but running slurmctld" condition, but can easily recreate the same cancelations and error messages. Please let me know if you need any additional information.
I'm still going through what you supplied and do not yet have a clear understanding about why this is happening. Since you've told me that most of the het jobs are expected to have 2 components (+0 and +1 and are also expected to have successive job ids), I was able to pull out a bit more from the logs about some of the +1 jobs. This helped show what you wondered about in comment 13 where a couple of the "het jobs" appeared to not actually be het jobs. It appears to be due to job id reuse since the cancelled het job ids appear to be reused by a batch job and array job in one case (see below). This is odd and I am not sure it is expected--on my own system I can contrive the het job failure and then submit another job but it gets the next job id (does not reuse one of the failed ones). So, I want to investigate this some more but I think it may be since slurmctld is getting killed hard (with SIGKILL). Here's one example (where a SIGKILL is sent about 90sec after a SIGTERM): >Mar 21 16:47:07 p2r1u25.farnam.hpc.yale.internal slurmctld[140280]: Cancelling aborted hetjob submit: JobId=23925553+0(23925553) >Mar 21 16:51:01 p2r1u25.farnam.hpc.yale.internal slurmctld[140280]: Terminate signal (SIGINT or SIGTERM) received >Mar 21 16:52:31 p2r1u25.farnam.hpc.yale.internal systemd: slurmctld.service stop-sigterm timed out. Killing. >Mar 21 16:52:31 p2r1u25.farnam.hpc.yale.internal systemd: slurmctld.service: main process exited, code=killed, status=9/KILL > ... >Mar 21 16:52:31 p2r1u25.farnam.hpc.yale.internal slurmctld[156121]: slurmctld version 21.08.2 started on cluster farnam I wonder if slurmctld is not being given enough time to shut down so systemd is killing it which may be preventing slurmctld from writing its complete state (or sending stuff to slurmdbd). What does your slurmctld systemd unit file look like? Does it have a 90 sec timeout listed? You may want to try increasing it to let slurmctld more gracefully shut down. Maybe that will at least prevent the job id reuse issue we're seeing. The last log segment below lists a bogus user id. That is also very suspicious and want to investigate that as well since it shouldn't be there. ----- Only leader listed as aborted: >Mar 21 14:53:45 p2r1u27.farnam.hpc.yale.internal slurmctld[5182]: _pick_best_nodes: JobId=23924852 never runnable in partition pi_tomography >Mar 21 14:53:45 p2r1u27.farnam.hpc.yale.internal slurmctld[5182]: _slurm_rpc_submit_batch_het_job: Requested node configuration is not available >Mar 21 14:53:45 p2r1u27.farnam.hpc.yale.internal slurmctld[5182]: Cancelling aborted hetjob submit: JobId=23924851+0(23924851) >Mar 21 16:00:20 p2r1u27.farnam.hpc.yale.internal slurmctld[127413]: Recovered JobId=23924851+0(23924851) Assoc=2642 >Mar 21 16:00:20 p2r1u27.farnam.hpc.yale.internal slurmctld[127413]: Recovered JobId=23924852 Assoc=3953 >Mar 21 16:00:54 p2r1u25.farnam.hpc.yale.internal slurmctld[103651]: Recovered JobId=23924851+0(23924851) Assoc=2642 >Mar 21 16:00:54 p2r1u25.farnam.hpc.yale.internal slurmctld[103651]: Recovered JobId=23924852 Assoc=3953 Het job ids reused by array job elements: >Mar 21 16:08:09 p2r1u25.farnam.hpc.yale.internal slurmctld[103651]: _slurm_rpc_submit_batch_job: JobId=23925049 InitPrio=5169 usec=1337 > >Mar 21 16:47:07 p2r1u25.farnam.hpc.yale.internal slurmctld[140280]: _pick_best_nodes: JobId=23925554 never runnable in partition pi_tomography >Mar 21 16:47:07 p2r1u25.farnam.hpc.yale.internal slurmctld[140280]: _slurm_rpc_submit_batch_het_job: Requested node configuration is not available >Mar 21 16:47:07 p2r1u25.farnam.hpc.yale.internal slurmctld[140280]: Cancelling aborted hetjob submit: JobId=23925553+0(23925553) > >Mar 21 16:52:36 p2r1u25.farnam.hpc.yale.internal slurmctld[156121]: Purged files for defunct batch JobId=23925554 >Mar 21 16:52:36 p2r1u25.farnam.hpc.yale.internal slurmctld[156121]: Purged files for defunct batch JobId=23925553 > >Mar 21 17:08:15 p2r1u25.farnam.hpc.yale.internal slurmctld[156121]: _job_complete: JobId=23925049_314(23925554) WEXITSTATUS 0 >Mar 21 17:08:15 p2r1u25.farnam.hpc.yale.internal slurmctld[156121]: _job_complete: JobId=23925049_314(23925554) done >Mar 21 17:21:58 p2r1u25.farnam.hpc.yale.internal slurmctld[156121]: _job_complete: JobId=23925049_313(23925553) WEXITSTATUS 0 >Mar 21 17:21:58 p2r1u25.farnam.hpc.yale.internal slurmctld[156121]: _job_complete: JobId=23925049_313(23925553) done Only leader listed as aborted: >Mar 21 17:39:49 p2r1u25.farnam.hpc.yale.internal slurmctld[156121]: _pick_best_nodes: JobId=23926300 never runnable in partition pi_tomography >Mar 21 17:39:49 p2r1u25.farnam.hpc.yale.internal slurmctld[156121]: _slurm_rpc_submit_batch_het_job: Requested node configuration is not available >Mar 21 17:39:49 p2r1u25.farnam.hpc.yale.internal slurmctld[156121]: Cancelling aborted hetjob submit: JobId=23926299+0(23926299) Het job ids reused by regular batch job and array job element: >Mar 22 13:21:07 p2r1u25.farnam.hpc.yale.internal slurmctld[211608]: _pick_best_nodes: JobId=23944094 never runnable in partition pi_tomography >Mar 22 13:21:07 p2r1u25.farnam.hpc.yale.internal slurmctld[211608]: _slurm_rpc_submit_batch_het_job: Requested node configuration is not available >Mar 22 13:21:07 p2r1u25.farnam.hpc.yale.internal slurmctld[211608]: Cancelling aborted hetjob submit: JobId=23944093+0(23944093) >Mar 22 13:32:34 p2r1u25.farnam.hpc.yale.internal slurmctld[97808]: Purged files for defunct batch JobId=23944094 >Mar 22 13:32:34 p2r1u25.farnam.hpc.yale.internal slurmctld[97808]: Purged files for defunct batch JobId=23944093 >Mar 22 13:32:35 p2r1u25.farnam.hpc.yale.internal slurmctld[97808]: _slurm_rpc_submit_batch_job: JobId=23944093 InitPrio=11948 usec=538 >Mar 22 13:32:38 p2r1u25.farnam.hpc.yale.internal slurmctld[97808]: sched: Allocate JobId=23922086_5281(23944094) NodeList=c22n04 #CPUs=1 Partition=scavenge >Mar 22 13:32:40 p2r1u25.farnam.hpc.yale.internal slurmctld[97808]: sched: Allocate JobId=23944093 NodeList=bigmem02 #CPUs=10 Partition=bigmem >Mar 22 13:39:54 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: Recovered JobId=23922086_5281(23944094) StepId=extern >Mar 22 13:39:54 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: Recovered JobId=23922086_5281(23944094) StepId=batch >Mar 22 13:39:54 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: Recovered JobId=23922086_5281(23944094) Assoc=2562 >Mar 22 13:39:54 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: Recovered JobId=23944093 StepId=extern >Mar 22 13:39:54 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: Recovered JobId=23944093 StepId=batch >Mar 22 13:39:54 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: Recovered JobId=23944093 Assoc=4297 >Mar 22 13:50:13 p2r1u25.farnam.hpc.yale.internal slurmctld[104885]: Recovered JobId=23922086_5281(23944094) StepId=extern >Mar 22 13:50:13 p2r1u25.farnam.hpc.yale.internal slurmctld[104885]: Recovered JobId=23922086_5281(23944094) StepId=batch >Mar 22 13:50:13 p2r1u25.farnam.hpc.yale.internal slurmctld[104885]: Recovered JobId=23922086_5281(23944094) Assoc=2562 >Mar 22 13:50:13 p2r1u25.farnam.hpc.yale.internal slurmctld[104885]: Recovered JobId=23944093 StepId=extern >Mar 22 13:50:13 p2r1u25.farnam.hpc.yale.internal slurmctld[104885]: Recovered JobId=23944093 StepId=batch >Mar 22 13:50:13 p2r1u25.farnam.hpc.yale.internal slurmctld[104885]: Recovered JobId=23944093 Assoc=4297 Only leader listed as aborted: >Mar 22 13:33:18 p2r1u25.farnam.hpc.yale.internal slurmctld[97808]: _pick_best_nodes: JobId=23944153 never runnable in partition scavenge_gpu >Mar 22 13:33:18 p2r1u25.farnam.hpc.yale.internal slurmctld[97808]: _slurm_rpc_submit_batch_het_job: Requested node configuration is not available >Mar 22 13:33:18 p2r1u25.farnam.hpc.yale.internal slurmctld[97808]: Cancelling aborted hetjob submit: JobId=23944152+0(23944152) >Mar 22 13:39:54 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: Recovered JobId=23944152+0(23944152) Assoc=2642 >Mar 22 13:39:54 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: Recovered JobId=23944153 Assoc=2642 Only leader listed as aborted (but then 2 components recovered--suspicious user id == NO_VAL): >Mar 22 13:40:57 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: Hetjob JobId=23944167 exceeded association/QOS limit for user 4294967294 >Mar 22 13:40:57 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: _slurm_rpc_submit_batch_het_job: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits) >Mar 22 13:40:57 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: Cancelling aborted hetjob submit: JobId=23944167+0(23944167) >Mar 22 13:50:13 p2r1u25.farnam.hpc.yale.internal slurmctld[104885]: Recovered JobId=23944167+0(23944167) Assoc=2642 >Mar 22 13:50:13 p2r1u25.farnam.hpc.yale.internal slurmctld[104885]: Recovered JobId=23944167+1(23944168) Assoc=2642
(In reply to Chad Vizino from comment #14) > I wonder if slurmctld is not being given enough time to shut down so systemd > is killing it which may be preventing slurmctld from writing its complete > state (or sending stuff to slurmdbd). What does your slurmctld systemd unit > file look like? Does it have a 90 sec timeout listed? You may want to try > increasing it to let slurmctld more gracefully shut down. Maybe that will at > least prevent the job id reuse issue we're seeing. I ran a test on my system and after the contrived failure of a het job, I sent a SIGKILL to slurmctld, restarted it and can see job ids being reused so think that explains the reuse problem. What does your slurmctld unit file look like? Some more questions for you: What does "sacct -a -j <jid>" show for some of the failed het jobs? When did this problem start happening? Have you upgraded recently or has the het job user just started submitting these sorts of jobs? Can you supply your slurmdbd.log from the 21st and/or 22nd? Can you supply your prolog script? Do you have an understanding about the "Invalid job ids" I wondered about in comment 10? Do you know what is trying to kill those every 15sec? Are you using triggers ("strigger --get" will show any you have)? Could you supply your syslog from the 21st and/or 22nd?
Hi Chad, I'll upload the other logs/files that you've requested using other comments. A) For the service script the timeout value we use is the default: [Unit] Description=Slurm controller daemon After=network.target munge.service ConditionPathExists=/opt/slurm/current/etc/slurm.conf [Service] Type=forking EnvironmentFile=/opt/slurm/current/etc/environment ExecStart=/opt/slurm/current/sbin/slurmctld -f /opt/slurm/current/etc/slurm.conf ExecReload=/bin/kill -HUP $MAINPID PIDFile=/var/run/slurm/slurmctld.pid LimitNOFILE=65536 [Install] WantedBy=multi-user.target I'd be very worried if 90 seconds wasn't a reasonable enough amount of time for slurmctld to shut down on this particular platform (Intel(R) Xeon(R) Gold 5222 CPU @ 3.80GHz, 192-GB DDR4, 80Gbps connect to service network, database resides on local nvme disk). B) ..that is strange about the bogus user id. For JobId=23944167 slurmctld may have reported that as user 4294967294 but slurmdbd correctly recorded it as uid 12914 in the job accounting database. C) Excellent work determining why we were seeing this problem for non-het jobs. It seems like job ids should never be reused and to do so could cause many problems. Is there a way to determine if the other hetjobs (real ones or otherwise) shown as canceled in the log files had also been cases of reused jobIDs? D) Last Saturday at 6:07-PM is the first time that we were impacted by this problem. According to our log files hetjobs had gotten cancelled a few times before last Saturday, ..however in those cases slurmctld appears to have continued normally. We upgraded from slurm 21.08.1 to 21.08.2 on October 7th, 2021. We plan to upgrade to 21.08.6 in a few weeks. E) I don't have the exact metrics but users have been submitting hetjobs since at least July of last year. F) The invalid job ID messages could relate to another less urgent problem. G) We are not using triggers. Thanks, Adam
Created attachment 24041 [details] sacct -a -j <jid> for failed hets followed by slurmctld unresponsiveness As requested by Chad.
*** prolog script *** #!/bin/bash # for $XDG_RUNTIME_DIR osTicket #010500 mkdir -p "/run/user/${SLURM_JOB_UID}" chmod 0700 "/run/user/${SLURM_JOB_UID}" chown "$SLURM_JOB_UID" "/run/user/${SLURM_JOB_UID}" # Save job scripts in case support needs them later scriptdir="/gpfs/ysm/slurm/job_scripts/${HOSTNAME}/$(date "+%Y%m%d")" if [ ! -e "${scriptdir}/${SLURM_JOB_ID}" ]; then mkdir -p "$scriptdir" scontrol write batch_script "$SLURM_JOB_ID" "${scriptdir}/${SLURM_JOB_ID}" > /dev/null scontrol show JobId="${SLURM_JOB_ID}" > "${scriptdir}/${SLURM_JOB_ID}.info" fi /usr/bin/perl /etc/slurm/slurm.prolog.supremm exit 0 *** /etc/slurm/slurm.prolog.supremm *** #!/usr/bin/perl use strict; use Sys::Hostname; use Date::Simple; use POSIX qw(strftime setsid); my $today = Date::Simple->new; my $logyear = $today->format("%Y"); my $logmonth = $today->format("%m"); my $logday = $today->format("%d"); my $jobdatelong = strftime "%Y%m%d.%H.%M.%S", localtime; my $host = ( split('\.', hostname()) )[0]; my $jobid = $ENV{SLURM_JOB_ID}; my $fullhost = hostname(); $SIG{CHLD} = 'IGNORE'; # Configure to autoreap zombies exit(0) unless defined ( my $child = fork ); # fork, or just exit if failed exit(0) if $child; # Main prolog exits with success setsid(); # Become session leader open( STDIN, "</dev/null" ); # Detach STDIN from shell open( STDOUT, ">/dev/null" ); # Detach STDOUT from shell open( STDERR, ">&STDOUT" ); # Detach STDERR from shell chdir '/tmp'; # Change working directory umask(0); # Reset umask my $logpath = "/gpfs/ysm/SUPReMM_farnam/pmlogger"; my @subdirs = ("$logyear", "$logmonth", "$fullhost", "$logyear-$logmonth-$logday" ); foreach ( @subdirs ) { $logpath = $logpath."/".$_; unless ( -d $logpath ) { `mkdir -p $logpath;chown pcp:pcp $logpath;`; } } my $logdir = $logpath; system("env PMLOGGER_PROLOG=yes pmlogger -U pcp -c /etc/pcp/pmlogger/pmlogger-supremm.config -s 420M -l /tmp/job-$jobid-begin-$jobdatelong.log $logdir/job-$jobid-begin-$jobdatelong &> /dev/null"); exit(0)
Created attachment 24042 [details] slurmctd log for March 22nd (gzipped) As requested by Chad. Please note that the March 21st log has already been uploaded (attachment 24007 [details]).
Created attachment 24043 [details] scheduler syslog covering March 21-22nd (gzipped) As requested by Chad. Please note that this also includes all of the log data from slurmctld over the same time period.
(In reply to Adam from comment #17) > A) For the service script the timeout value we use is the default: > > I'd be very worried if 90 seconds wasn't a reasonable enough amount of time > for slurmctld to shut down on this particular platform (Intel(R) Xeon(R) > Gold 5222 CPU @ 3.80GHz, 192-GB DDR4, 80Gbps connect to service network, > database resides on local nvme disk). Indeed but a SIGKILL on a timeout to slurmctld looks like it is preventing it from saving its state which is probably leading to other problems like job id reuse since it doesn't record the last job id assigned. I don't see increasing the timeout as a permanent fix but one that may help until we get a solution since you need to restart slurmctld so many times. I'd be interested to know if this works next time you must restart slurmctld: Manually SIGTERM it and wait for it to gracefully exit (if this can even happen--want to see how long it takes)--we know from your past logs that this will take more than 90sec. You should be able to periodically run "sdiag" while waiting and look at the server thread count to see it going down. slurmctld.log with logging at the info level like you have set should show these lines on a normal shutdown (key is seeing the saving state message): >Terminate signal (SIGINT or SIGTERM) received >Saving all slurm state You might look for the "saving state" message from a previous log on a normal shutdown to see how long it took between the SIGTERM and the saving state (and any other stray messages before actually going down). This would be an interesting baseline metric. It might also be interesting to see how many normal and how many hard (SIGKILL) shutdowns have happened both before and after the het job problem started. Could you check into this? > C) Excellent work determining why we were seeing this problem for non-het > jobs. It seems like job ids should never be reused and to do so could cause > many problems. Is there a way to determine if the other hetjobs (real ones > or otherwise) shown as canceled in the log files had also been cases of > reused jobIDs? Right--job ids should not be reused. Avoiding hard shutdowns should hopefully eliminate the reuse. > D) Last Saturday at 6:07-PM is the first time that we were impacted by this > problem. According to our log files hetjobs had gotten cancelled a few times > before last Saturday, ..however in those cases slurmctld appears to have > continued normally. We upgraded from slurm 21.08.1 to 21.08.2 on October > 7th, 2021. We plan to upgrade to 21.08.6 in a few weeks. Very interesting. So it would seem that something about the job is triggering the problem since your system software versioning seems to be stable. Thanks for the other answers to my questions--very helpful. I want to spend some time looking at your prolog and make some suggestions for a couple changes that might help. More on this later.
(In reply to Adam from comment #19) > *** prolog script *** > > # Save job scripts in case support needs them later > scriptdir="/gpfs/ysm/slurm/job_scripts/${HOSTNAME}/$(date "+%Y%m%d")" > if [ ! -e "${scriptdir}/${SLURM_JOB_ID}" ]; then > mkdir -p "$scriptdir" > scontrol write batch_script "$SLURM_JOB_ID" > "${scriptdir}/${SLURM_JOB_ID}" > /dev/null > scontrol show JobId="${SLURM_JOB_ID}" > > "${scriptdir}/${SLURM_JOB_ID}.info" > fi We recommend not running scontrol in prologs since it can bog down slurmctld with RPCs (see https://slurm.schedmd.com/prolog_epilog.html). For batch script storage, this can be written to the database (see bug 7609). Is there some reason why you are not using this method? It's commented out in your slurm.conf: >#AccountingStoreFlags=job_script,job_env More info is here: https://slurm.schedmd.com/slurm.conf.html#OPT_AccountingStoreFlags. For the scontrol show, what is it that you need from it? Perhaps we can suggest an alternative.
(In reply to Adam from comment #17) > F) The invalid job ID messages could relate to another less urgent problem. Maybe but I would like to understand them. There are over 5600 of each of the lines below in the Mar 22 log (every 15sec). Can you look up more info from past logs, squeue and sacct about them? >Mar 22 13:40:09 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: prolog_complete: invalid JobId=18748205 > >Mar 22 13:40:09 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: prolog_complete: invalid JobId=18756999
(In reply to Chad Vizino from comment #24) > (In reply to Adam from comment #17) > > F) The invalid job ID messages could relate to another less urgent problem. > Maybe but I would like to understand them. There are over 5600 of each of > the lines below in the Mar 22 log (every 15sec). Can you look up more info > from past logs, squeue and sacct about them? > > >Mar 22 13:40:09 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: prolog_complete: invalid JobId=18748205 > > > >Mar 22 13:40:09 p2r1u25.farnam.hpc.yale.internal slurmctld[100548]: prolog_complete: invalid JobId=18756999 Also check for records of these jobs in your StateSaveLocation (from slurm.conf) under the hash* directories. There may be a stray directory like job.<jid> for each.
Hi Chad, I'd just like to reiterate that our primary and overwhelming concern in this case is that user space activity is capable of zombifying slurmctld, ..and that we seem to have no way to prevent this. Thank you, Adam
We'd like you to remove the scontrol lines from your prolog and see if that improves things (see comment 23). Also, if you are able to grab another sdiag and backtrace when slurmctld is "hung" that would be good (like in comment 5). You may be able to put it into that condition by doing an "scontrol reconfigure" which rereads state files.
Hello, I'm afraid I'm a bit confused about the focus on the 5600 jobs. It seems like they came about as a result of this problem, and so it's really not clear to me how they could be related to the cause. If you would like more information about them, would you please be more specific about what it is that we're looking for? I'll check the hash directories for any job.# folders that use jobIDs that are not in the queue. We would like to know how and why they could (or are) related to the problem. In spite of the RPC hit we will not be removing the scontrol lines from our prolog scripts, unless you can explain why and/or how use of scontrol in this manner is causing the problem that we only see after canceled hetjobs. The information collected from recording job scripts enables us to provide better support for the researchers and we had problems with adding job scripts to the database as you suggested. If the system locks up again we can grab another sdiag and debugger dumps but we're not going to intentionally try to cause the scheduler to lock up while it's in production use. The best option is to recreate the problem in our test environment so that we can test any potential solution without negatively impacting any ongoing research work at Yale. Thank you, Adam
Hi Adam - Chad brought this issue to my attention. I can understand your confusion here. Let me first start off by clarifying a few points. We discourage scontrol commands in prologue/epilogue as Chad mentioned in comment#23. Let me elaborate on why. Running "scontrol show" inside prologue/epilogue can trigger a deadlock, even if it "usually works". Running this can seem seemingly harmless considering 'scontrol show' only needs read locks, but if another internal thread is pending on write, you can potentially run into a race condition and deadlock. We know this is not ideal and would require modifying your workflow to avoid such a situation. With that said, once we have some way to reproduce your specific lockup, then we can confirm for certain if scontrol is to blame. Until such time, we can only assume this is the cause, given previous issues we have seen from other sites with a similar workflow. With regard to the "5600 jobs" and comment#22: it is highly abnormal to kill slurmctld and a shutdown lasting as long as yours does concerns us, and here is why. If slurmctld is killed before it can fully write out the statesave then you could run into problems where the jobs, jobID and overall state of the cluster does not match what is found on the nodes and controller. Slurmctld could end up canceling jobs on nodes that were not properly saved on shutdown because it was not "aware" of it on startup. Though not directly related to the hetjob lockup, it is still concerning. I recommend you split that issue out into another bug when you are ready for support to look into this for your site.
Thank you for clarifying Jason: we're going to try enabling job script saving in the database again during our next maintenance (within 2 weeks from now) and we will take the scontrols out of our prologs at that time. We understand and share your concerns about hard-stopping slurmctld, which is part of why we consider this problem to be serious. Normally it takes the scheduler 1-3 seconds to cleanly shutdown, and so if it doesn't do it after 90 seconds I would be inclined to agree with systemd that it never will. We've not had this problem before, and only now have it in conjunction with the hetjob cancel messages: perhaps it is part of the same problem? We plan to upgrade to 21.08.6 at the time of our maintenance: it would be ideal if we could recreate the problem in a test environment and either verify a fix or confirm that 21.08.6 isn't impacted before then. Thank you, Adam
(In reply to Adam from comment #31) > Thank you for clarifying Jason: we're going to try enabling job script > saving in the database again during our next maintenance (within 2 weeks > from now) and we will take the scontrols out of our prologs at that time. > > We understand and share your concerns about hard-stopping slurmctld, which > is part of why we consider this problem to be serious. Normally it takes the > scheduler 1-3 seconds to cleanly shutdown, and so if it doesn't do it after > 90 seconds I would be inclined to agree with systemd that it never will. > We've not had this problem before, and only now have it in conjunction with > the hetjob cancel messages: perhaps it is part of the same problem? > > We plan to upgrade to 21.08.6 at the time of our maintenance: it would be > ideal if we could recreate the problem in a test environment and either > verify a fix or confirm that 21.08.6 isn't impacted before then. Hi Adam. Reproducing the het job problem in a test environment would be ideal--agreed--and the shutdown delay may be related if locking problems are involved. However, since we (and you on your test system) have not been able to reproduce the problem so far, an answer to whether or not the suggested steps will help may need to wait until you are able to implement them on your production system. In the meantime we would like to downgrade this ticket to level 3. If, after the changes, things are still problematic we can upgrade again. Are you ok with this?
Yes, that's fine and I have no objection. Our user outreach has mitigated the problem and we otherwise do not seem to have any way to progress without being able to recreate the problem in a test environment (or by risking a production service outage). Thank you, Adam
(In reply to Adam from comment #31) > Thank you for clarifying Jason: we're going to try enabling job script > saving in the database again during our next maintenance (within 2 weeks > from now) and we will take the scontrols out of our prologs at that time. > > We understand and share your concerns about hard-stopping slurmctld, which > is part of why we consider this problem to be serious. Normally it takes the > scheduler 1-3 seconds to cleanly shutdown, and so if it doesn't do it after > 90 seconds I would be inclined to agree with systemd that it never will. > We've not had this problem before, and only now have it in conjunction with > the hetjob cancel messages: perhaps it is part of the same problem? > > We plan to upgrade to 21.08.6 at the time of our maintenance: it would be > ideal if we could recreate the problem in a test environment and either > verify a fix or confirm that 21.08.6 isn't impacted before then. Hi. I just wanted to check in and see how things are going for you. Were you able to make the changes you mentioned and did they help?
Hello, We were not able to recreate the problem at the eve of our downtime and so we were not able to verify that any of the changes that we made helped or not. The assumptions are that the bug has not been fixed, that it is load related, and that it will eventually be encountered again. Thanks, Adam From: bugs@schedmd.com <bugs@schedmd.com> Date: Monday, April 18, 2022 at 11:27 AM To: Munro, Adam <adam.munro@yale.edu> Subject: [Bug 13670] scheduler service lockup after hetjob cancels Comment # 34<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D13670%23c34&data=04%7C01%7Cadam.munro%40yale.edu%7Cc80d9d60afdd44400c6108da214ffa19%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637858924634349244%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ZyrEG4GCZvM%2FkqrZIPjjmq1r5ahAIZFjDfZMynqoCLU%3D&reserved=0> on bug 13670<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D13670&data=04%7C01%7Cadam.munro%40yale.edu%7Cc80d9d60afdd44400c6108da214ffa19%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637858924634349244%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=k5OOOD%2BoGoXxJUhLnl78xiVLZzMrjmnpeQX8rHfSAvU%3D&reserved=0> from Chad Vizino<mailto:chad@schedmd.com> (In reply to Adam from comment #31<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D13670%23c31&data=04%7C01%7Cadam.munro%40yale.edu%7Cc80d9d60afdd44400c6108da214ffa19%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637858924634349244%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=8LIaAn0NRFspaB1UvU1wFecd1zXOw%2FCceRh30JqZl44%3D&reserved=0>) > Thank you for clarifying Jason: we're going to try enabling job script > saving in the database again during our next maintenance (within 2 weeks > from now) and we will take the scontrols out of our prologs at that time. > > We understand and share your concerns about hard-stopping slurmctld, which > is part of why we consider this problem to be serious. Normally it takes the > scheduler 1-3 seconds to cleanly shutdown, and so if it doesn't do it after > 90 seconds I would be inclined to agree with systemd that it never will. > We've not had this problem before, and only now have it in conjunction with > the hetjob cancel messages: perhaps it is part of the same problem? > > We plan to upgrade to 21.08.6 at the time of our maintenance: it would be > ideal if we could recreate the problem in a test environment and either > verify a fix or confirm that 21.08.6 isn't impacted before then. Hi. I just wanted to check in and see how things are going for you. Were you able to make the changes you mentioned and did they help? ________________________________ You are receiving this mail because: * You reported the bug.
(In reply to Adam from comment #35) > We were not able to recreate the problem at the eve of our downtime and so > we were not able to verify that any of the changes that we made helped or > not. The assumptions are that the bug has not been fixed, that it is load > related, and that it will eventually be encountered again. That is a valid assumption and I'm sorry to hear that the outcome was not more clear. May we close this ticket for now (with "can't reproduce") with the understanding that you can reopen at any time in the future when it occurs again?
Have you exhausted exploring and communicating every possible way that this problem could have occurred? You mentioned the usage of scontrol and failed shutdowns of the scheduler service potentially confusing hetjob numbering (and while we have our doubts we have made some adjustments on our side based on this): is there anything else (and aligns with what we have described)? I think if we have the best possible understanding of how this could have happened (however difficult to recreate) we can at least know what to watch out for. Thanks, Adam From: bugs@schedmd.com <bugs@schedmd.com> Date: Monday, April 18, 2022 at 3:02 PM To: Munro, Adam <adam.munro@yale.edu> Subject: [Bug 13670] scheduler service lockup after hetjob cancels Comment # 36<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D13670%23c36&data=04%7C01%7Cadam.munro%40yale.edu%7C7c535be2211a47c0d05208da216e0045%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637859053597542653%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2FD3ULWzU5b8eVZQOLtPlqw8tllauH%2Fy8LWpSQnFM8is%3D&reserved=0> on bug 13670<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D13670&data=04%7C01%7Cadam.munro%40yale.edu%7C7c535be2211a47c0d05208da216e0045%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637859053597542653%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=pBl151Ols8bhR15BZzSW5iP5g3phLemzqa8IAYdXeo8%3D&reserved=0> from Chad Vizino<mailto:chad@schedmd.com> (In reply to Adam from comment #35<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D13670%23c35&data=04%7C01%7Cadam.munro%40yale.edu%7C7c535be2211a47c0d05208da216e0045%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637859053597542653%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=JwCNM3Q0D3THO2AClbg5hsid7dh25wPwWjJwmaxUSJo%3D&reserved=0>) > We were not able to recreate the problem at the eve of our downtime and so > we were not able to verify that any of the changes that we made helped or > not. The assumptions are that the bug has not been fixed, that it is load > related, and that it will eventually be encountered again. That is a valid assumption and I'm sorry to hear that the outcome was not more clear. May we close this ticket for now (with "can't reproduce") with the understanding that you can reopen at any time in the future when it occurs again? ________________________________ You are receiving this mail because: * You reported the bug.
(In reply to Adam from comment #37) > Have you exhausted exploring and communicating every possible way that this > problem could have occurred? Without a way to reproduce the issue finding the problem is significantly more difficult in some cases. We have a number of sites that use het jobs and have not run into the problem you have reported here so we have some reason to believe this may be site-specific. > You mentioned the usage of scontrol and failed > shutdowns of the scheduler service potentially confusing hetjob numbering > (and while we have our doubts we have made some adjustments on our side > based on this): is there anything else (and aligns with what we have > described)? We are not convinced this has anything to do with the initial problem you reported and we were concerned that there may be some type of issue with the state files but that can be investigated in another bug report if you'd like to file one for it. > I think if we have the best possible understanding of how this could have > happened (however difficult to recreate) we can at least know what to watch > out for. What we do have are a set of knowns regarding your setup and the potential for "scheduler service lockup" as mentioned in comment 30. It is possible that your cluster operates normally until the perfect set of circumstances happens. This could be an increase of jobs submitted and starting in close succession. It's good that you were able to make the changes to your system mentioned in comment 31 and hopefully they will help. It will be best at this time to wait and see. If the problem happens again, then we'll have more information to go on (we'll at least know a couple things that the problem is not).
Hi. I'm going to close this issue for now. However, feel free to reopen if you encounter this problem again and we'll pursue it.