Ticket 22951

Summary:	slurmdbd crashed with fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
Product:	Slurm	Reporter:	Ole.H.Nielsen <Ole.H.Nielsen>
Component:	slurmdbd	Assignee:	Oscar Hernández <oscar.hernandez>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	oscar.hernandez
Version:	24.11.5
Hardware:	Linux
OS:	Linux
Site:	DTU Physics	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	25.05.1, 25.11.0rc1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Ole.H.Nielsen@fysik.dtu.dk 2025-06-09 02:20:16 MDT

We had a slurmdbd incident today, the first of its kind, with an error in slurmdbd.log:

[2025-06-09T08:15:02.075] fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program

Interestingly, on the slurmctld server I discovered a RunawayJob, also a first of its kind, for a job which I had cancelled the day before:

$ /usr/bin/sacctmgr show RunawayJobs
NOTE: Runaway jobs are jobs that don't exist in the controller but have a start time and no end time in the database
ID                 Name  Partition    Cluster      State          TimeSubmit           TimeStart             TimeEnd
------------ ---------- ---------- ---------- ---------- ------------------- ------------------- -------------------
8739696      bader_ser+     xeon56   niflheim  CANCELLED 2025-06-08T11:04:29 2025-06-08T11:04:29             Unknown

Would you like to fix these runaway jobs?
(This will set the end time for each job to the latest out of the start, eligible, or submit times, and set the state to completed.
Once corrected, this will trigger the rollup to reroll usage from before the earliest submit time of all the runaway jobs.)

Strangely, on the slurmctld server there was an sacctmgr process using 100% CPU and running for some hours, and which had unfortunately filled up the / filesystem to 100% :-(  We do run this crontab job every day at 07:30: "/usr/bin/sacctmgr show RunawayJobs", and I wonder if this could cause sacctmgr to enter an infinite loop because there is no input ?

Question: Can you suggest how may we prevent this issue from reappearing?

Thanks a lot,
Ole

Comment 1 Ole.H.Nielsen@fysik.dtu.dk 2025-06-09 02:20:29 MDT

Additional information:

The database log mariadb.log:

2025-06-09  8:15:02 2532 [Warning] Aborted connection 2532 to db: 'slurm_acct_db' user: 'slurm' host: 'localhost' (Got an error reading communication packets)
2025-06-09  8:15:02 2524 [Warning] Aborted connection 2524 to db: 'slurm_acct_db' user: 'slurm' host: 'localhost' (Got an error reading communication packets)
2025-06-09  8:15:02 11 [Warning] Aborted connection 11 to db: 'slurm_acct_db' user: 'slurm' host: 'localhost' (Got an error reading communication packets)
2025-06-09  8:15:02 9 [Warning] Aborted connection 9 to db: 'slurm_acct_db' user: 'slurm' host: 'localhost' (Got an error reading communication packets)

The syslog reported the same:

Jun  9 08:15:02 slurmdb2 slurmdbd[1996]: slurmdbd: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
Jun  9 08:15:02 slurmdb2 systemd[1]: slurmdbd.service: Main process exited, code=exited, status=1/FAILURE
Jun  9 08:15:02 slurmdb2 systemd[1]: slurmdbd.service: Failed with result 'exit-code'.

I restarted manually the slurmdbd service which subsequently runs without errors.  I noted a similar but not identical issue in bug 20838.

The slurmdbd server is a dedicated system with fast hardware and 256 GB RAM running AlmaLinux 8.10 with the in-distro database mariadb-server-10.3.39-1.module_el8.8.0+3609+204d4ab0.x86_64 rpm.  We have a single slurmctld on another dedicated server.

We use the Innodb with settings in /etc/my.cnf.d/innodb.cnf:

[mysqld]
innodb_buffer_pool_size=32768M
innodb_log_file_size=64M
innodb_lock_wait_timeout=900

We have used these database purge parameters for many years without issues:

PurgeEventAfter=180days
PurgeResvAfter=60days
PurgeStepAfter=120days
PurgeSuspendAfter=30days
PurgeTXNAfter=180days

The daily database dump was completed about 4 hours before the crash at 03:42, so there should be no influence here.

The error may be generated at this line in the source: 
https://github.com/SchedMD/slurm/blob/c38fdb92550fd8f7cccd577a2f3af8fe87b44c9d/src/database/mysql_common.c#L185

Comment 2 Ole.H.Nielsen@fysik.dtu.dk 2025-06-09 03:05:37 MDT

Note added: The cron daemon on the slurmctld server has sent a warning message:

Subject: Cron <root@que2> /usr/bin/sacctmgr show RunawayJobs | tail -n +2

*** Mail of at least 59593572352 bytes could not be accepted
*** at que2.fysik.dtu.dk due to lack of disk space for temp file.
*** Currently, 58196840 kilobytes are available for mail temp files.

This strongly points to sacctmgr entering an infinite loop when it receives no input from the crontab job that launched it, thereby filling up the system disk on the slurmctld server.

Can you confirm this behavior of sacctmgr?

If this is correct, I think a fix to sacctmgr is required which checks for a non-existent stdio and takes appropriate action.  This might be assuming either "yes" or "no" as the default answer?  Avoiding the infinite loop generating output seems to be critical.

Comment 3 Oscar Hernández 2025-06-09 05:11:47 MDT

Hi Ole,

Will take a look at this.

Would you be able to share the slurmdbd.log starting at midnight for that same day?

>Strangely, on the slurmctld server there was an sacctmgr process using 100% CPU 
>and running for some hours, and which had unfortunately filled up the / 
>filesystem to 100% :-(  We do run this crontab job every day at 07:30: 
>"/usr/bin/sacctmgr show RunawayJobs", and I wonder if this could cause sacctmgr 
>to enter an infinite loop because there is no input ?
Briefly describing "show runaway" behavior. What it will do is to query for Running jobs in the DB, and compare the list with the running jobs registered in the controller at that time.

Only when some runaway jobs are present, it will then proceed with the "fix" in the background, asking afterwards to confirm the fix with the message you mentioned. Selecting YES/NO will proceed to commit/discard the changed.

The fact that it was using 100% cpu fir such a long time seems unexpected.

>This strongly points to sacctmgr entering an infinite loop when it receives no 
>input from the crontab job that launched it, thereby filling up the system disk 
>on the slurmctld server.
The "Would you like to fix these runaway jobs?" message is issued by our "commit_check()" function. This function should automatically terminate 30s after showing the message if no input is received. 

However, the fact that it filled the FS, makes me suspect that the cron could be feeding some kind of invalid input to the "commit_check()". Will check in detail this commit check, because there is chance it could, as you mentioned, be stuck processing some invalid input. Generating an infinite log of (example feeding the prompt with "b" instead of N/Y):

> (You have 30 seconds to decide)
>(N/y): b
>Y or N please
>(N/y): b
>Y or N please
>(N/y): b
This would fill the FS, and additionally, would keep the changed in the DB pending to be committed/discarded for a long time. Which could explain why, come minutes later, commands timed out trying to push changes to the DB.

>[2025-06-09T08:15:02.075] fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. 
>The only way to fix this is restart the calling program
Will keep you updated.

Kind regards,
Oscar

Comment 4 Ole.H.Nielsen@fysik.dtu.dk 2025-06-09 05:23:07 MDT

Hi Oscar,

Thanks a lot for your fast response!

(In reply to Oscar Hernández from comment #3)
> Hi Ole,
> 
> Will take a look at this.
> 
> Would you be able to share the slurmdbd.log starting at midnight for that
> same day?

This is today's log:

[2025-06-09T00:00:07.083] accounting_storage/as_mysql: _get_oldest_record: DB_ARCHIVE: 0(as_mysql_archive.c:5202) query
select time_end from "niflheim_event_table" where time_end <= 1733871599 && time_end != 0 order by time_end asc LIMIT 1
[2025-06-09T00:00:07.676] accounting_storage/as_mysql: _archive_purge_table: DB_ARCHIVE: Purging niflheim_event_table before 1733871599
[2025-06-09T00:00:07.676] accounting_storage/as_mysql: _purge_mark: DB_ARCHIVE: 0(as_mysql_archive.c:5024) update
update "niflheim_event_table" set deleted = 1 where time_end <= 1733871599 && time_end != 0 order by time_end asc LIMIT 50000
[2025-06-09T00:00:07.690] accounting_storage/as_mysql: _get_oldest_record: DB_ARCHIVE: 0(as_mysql_archive.c:5202) query
select time_end from "niflheim_event_table" where time_end <= 1733871599 && time_end != 0 order by time_end asc LIMIT 1
[2025-06-09T00:00:07.691] accounting_storage/as_mysql: _get_oldest_record: DB_ARCHIVE: 0(as_mysql_archive.c:5202) query
select time_end from "niflheim_suspend_table" where time_end <= 1746827999 && time_end != 0 order by time_end asc LIMIT 1
[2025-06-09T00:00:07.691] accounting_storage/as_mysql: _get_oldest_record: DB_ARCHIVE: 0(as_mysql_archive.c:5202) query
select time_end from "niflheim_step_table" where time_end <= 1739055599 && time_end != 0 order by time_end asc LIMIT 1
[2025-06-09T00:00:07.691] accounting_storage/as_mysql: _archive_purge_table: DB_ARCHIVE: Purging niflheim_step_table before 1739055599
[2025-06-09T00:00:07.691] accounting_storage/as_mysql: _purge_mark: DB_ARCHIVE: 0(as_mysql_archive.c:5024) update
update "niflheim_step_table" set deleted = 1 where time_end <= 1739055599 && time_end != 0 order by time_end asc LIMIT 50000
[2025-06-09T00:00:08.978] accounting_storage/as_mysql: _get_oldest_record: DB_ARCHIVE: 0(as_mysql_archive.c:5202) query
select time_end from "niflheim_step_table" where time_end <= 1739055599 && time_end != 0 order by time_end asc LIMIT 1
[2025-06-09T00:00:08.978] accounting_storage/as_mysql: _get_oldest_record: DB_ARCHIVE: 0(as_mysql_archive.c:5202) query
select time_end from "niflheim_resv_table" where time_end <= 1744235999 && time_end != 0 order by time_end asc LIMIT 1
[2025-06-09T00:00:08.978] accounting_storage/as_mysql: _archive_purge_table: DB_ARCHIVE: Purging niflheim_resv_table before 1744235999
[2025-06-09T00:00:08.978] accounting_storage/as_mysql: _purge_mark: DB_ARCHIVE: 0(as_mysql_archive.c:5024) update
update "niflheim_resv_table" set deleted = 1 where time_end <= 1744235999 && time_end != 0 order by time_end asc LIMIT 50000
[2025-06-09T00:00:08.979] accounting_storage/as_mysql: _get_oldest_record: DB_ARCHIVE: 0(as_mysql_archive.c:5202) query
select time_end from "niflheim_resv_table" where time_end <= 1744235999 && time_end != 0 order by time_end asc LIMIT 1
[2025-06-09T00:00:08.979] accounting_storage/as_mysql: _get_oldest_record: DB_ARCHIVE: 0(as_mysql_archive.c:5202) query
select timestamp from "txn_table" where timestamp <= 1733871599 && cluster='niflheim' order by timestamp asc LIMIT 1
[2025-06-09T00:00:08.982] accounting_storage/as_mysql: _archive_purge_table: DB_ARCHIVE: Purging niflheim_txn_table before 1733871599
[2025-06-09T00:00:08.982] accounting_storage/as_mysql: _purge_mark: DB_ARCHIVE: 0(as_mysql_archive.c:5024) update
update "txn_table" set deleted = 1 where timestamp <= 1733871599 && cluster='niflheim' order by timestamp asc LIMIT 50000
[2025-06-09T00:00:08.988] accounting_storage/as_mysql: _get_oldest_record: DB_ARCHIVE: 0(as_mysql_archive.c:5202) query
select timestamp from "txn_table" where timestamp <= 1733871599 && cluster='niflheim' order by timestamp asc LIMIT 1
[2025-06-09T00:00:08.991] Warning: Note very large processing time from daily_rollup for niflheim: usec=7656630 began=00:00:01.334
[2025-06-09T08:15:02.075] fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
[2025-06-09T09:06:04.782] error: Unable to open pidfile `/var/run/slurmdbd.pid': Permission denied
[2025-06-09T09:06:04.784] accounting_storage/as_mysql: _check_mysql_concat_is_sane: MySQL server version is: 10.3.39-MariaDB
[2025-06-09T09:06:04.802] accounting_storage/as_mysql: init: Accounting storage MYSQL plugin loaded
[2025-06-09T09:06:05.115] slurmdbd version 24.11.5 started
[2025-06-09T11:00:16.199] Warning: Note very large processing time from hourly_rollup for niflheim: usec=15256732 began=11:00:00.942
[2025-06-09T11:00:21.928] accounting_storage/as_mysql: _get_oldest_record: DB_ARCHIVE: 0(as_mysql_archive.c:5202) query
select time_end from "niflheim_event_table" where time_end <= 1733871599 && time_end != 0 order by time_end asc LIMIT 1
[2025-06-09T11:00:22.525] accounting_storage/as_mysql: _get_oldest_record: DB_ARCHIVE: 0(as_mysql_archive.c:5202) query
select time_end from "niflheim_suspend_table" where time_end <= 1746827999 && time_end != 0 order by time_end asc LIMIT 1
[2025-06-09T11:00:22.525] accounting_storage/as_mysql: _get_oldest_record: DB_ARCHIVE: 0(as_mysql_archive.c:5202) query
select time_end from "niflheim_step_table" where time_end <= 1739055599 && time_end != 0 order by time_end asc LIMIT 1
[2025-06-09T11:00:22.525] accounting_storage/as_mysql: _get_oldest_record: DB_ARCHIVE: 0(as_mysql_archive.c:5202) query
select time_end from "niflheim_resv_table" where time_end <= 1744235999 && time_end != 0 order by time_end asc LIMIT 1
[2025-06-09T11:00:22.525] accounting_storage/as_mysql: _get_oldest_record: DB_ARCHIVE: 0(as_mysql_archive.c:5202) query
select timestamp from "txn_table" where timestamp <= 1733871599 && cluster='niflheim' order by timestamp asc LIMIT 1
[2025-06-09T11:00:22.528] Warning: Note very large processing time from daily_rollup for niflheim: usec=6328824 began=11:00:16.199


Best regards,
Ole

Comment 5 Oscar Hernández 2025-06-09 05:55:28 MDT

Thanks for the logs. Nothing unexpected there. The first relevant error is the "ER_LOCK_WAIT_TIMEOUT" message.

I have been running some tests. I have seen that the 30s timer gets restarted every time some input that is not "Y","N" or "/n" is provided. 

>	while(c != 'Y' && c != 'y'
>	      && c != 'N' && c != 'n'
>	      && c != '\n') {
>		if (c) {
>			printf("Y or N please\n");
>		}
I believe that what happened, is that somehow the command was providing invalid input in a tight loop. Leading to the 100% cpu usage and the long log generated, which could have been running infinitely.

To confirm? By any chance do you have a part of the long log generated that filled the FS?

I will see what can I do to prevent that form happening again.

Kind regards,
Oscar

Comment 6 Ole.H.Nielsen@fysik.dtu.dk 2025-06-09 06:42:27 MDT

Hi Oscar,

(In reply to Oscar Hernández from comment #5)
> Thanks for the logs. Nothing unexpected there. The first relevant error is
> the "ER_LOCK_WAIT_TIMEOUT" message.
> 
> I have been running some tests. I have seen that the 30s timer gets
> restarted every time some input that is not "Y","N" or "/n" is provided. 
> 
> >	while(c != 'Y' && c != 'y'
> >	      && c != 'N' && c != 'n'
> >	      && c != '\n') {
> >		if (c) {
> >			printf("Y or N please\n");
> >		}
> I believe that what happened, is that somehow the command was providing
> invalid input in a tight loop. Leading to the 100% cpu usage and the long
> log generated, which could have been running infinitely.
> 
> To confirm? By any chance do you have a part of the long log generated that
> filled the FS?

Unfortunately I don't have the Sendmail spool file from /var/log/mqueue/ any longer, because I was desperately trying to solve the FS full condition and deleted the spool file.  I didn't think to look inside the file.  Then I killed the sacctmgr process and the disk space was fortunately released.

> I will see what can I do to prevent that form happening again.

The stdin file descriptor of sacctmgr might be checked whether it's a terminal, or some weird fd provided by the cron daemon which doesn't respond as expected by sacctmgr.  It's definitely worth fixing the behavior seen in the present case.

Thanks a lot,
Ole

Comment 7 Ole.H.Nielsen@fysik.dtu.dk 2025-06-09 06:58:23 MDT

I'm wondering also about the root cause of the "runaway" jobid 8739696 reported by sacctmgr?  In the database I see this job information:

$  sacct -j 8739696 -o "jobid,Submit,Eligible,Start,elapsed,End,CPUTime,NNodes,NCPUS,ExitCode,nodelist,State,ConsumedEnergyRaw,AllocTRES"
JobID                     Submit            Eligible               Start    Elapsed                 End    CPUTime   NNodes      NCPUS ExitCode        NodeList      State ConsumedEnergyRaw  AllocTRES
------------ ------------------- ------------------- ------------------- ---------- ------------------- ---------- -------- ---------- -------- --------------- ---------- ----------------- ----------
8739696      2025-06-08T11:04:29             Unknown 2025-06-08T11:04:29   00:00:00 2025-06-08T11:04:29   00:00:00        1          0      0:0   None assigned  COMPLETED                 0

and in the slurmctld log the job is reported in a series of jobs by the user with a dependency problem:

[2025-06-08T11:03:36.807] _slurm_rpc_submit_batch_job: JobId=8739694 InitPrio=263488 usec=593
[2025-06-08T11:03:37.030] sched: Allocate JobId=8739694 NodeList=d048 #CPUs=56 Partition=xeon56
[2025-06-08T11:04:08.672] _slurm_rpc_submit_batch_job: JobId=8739695 InitPrio=263488 usec=561
[2025-06-08T11:04:29.419] _kill_dependent: Job dependency can't be satisfied, cancelling JobId=8739696
[2025-06-08T11:04:29.419] _slurm_rpc_submit_batch_job: JobId=8739696 InitPrio=4294967294 usec=830
[2025-06-08T11:04:55.576] _slurm_rpc_submit_batch_job: JobId=8739697 InitPrio=263488 usec=642

I don't understand why this would lead to a runaway job 8739696?

Thanks,
Ole

Comment 8 Oscar Hernández 2025-06-09 10:19:31 MDT

Thanks for providing that extra logs. I will see If I can consistently create that runaway job following a similar use-case.

>ID                 Name  Partition    Cluster      State          TimeSubmit           TimeStart             TimeEnd
>------------ ---------- ---------- ---------- ---------- ------------------- ------------------- -------------------
>8739696      bader_ser+     xeon56   niflheim  CANCELLED 2025-06-08T11:04:29 2025-06-08T11:04:29             Unknown
Typically Runaway jobs are in RUNNING state in the DB, since the job cacellation never reached the DB. But your case seems a bit different, since this job was cancelled before even starting. Maybe there is some flaw in this codepath that can be fixed. Will make sure to revise that.

Cheers,
Oscar

Comment 9 Oscar Hernández 2025-06-10 05:04:11 MDT

Hi Ole,

>[2025-06-08T11:04:29.419] _kill_dependent: Job dependency can't be satisfied, cancelling JobId=8739696
>[2025-06-08T11:04:29.419] _slurm_rpc_submit_batch_job: JobId=8739696 InitPrio=4294967294 usec=830
In my testing, I am not being able to achieve this exact sequence. If I try to submit a job without a valid dependency, it gets rejected at submit time:

>[2025-06-10T12:54:07.649] _slurm_rpc_submit_batch_job: Job dependency problem
Currently checking possible codepaths that could lead to your logged messages. But it would be really helpful if you could get some information on the dependency this job requested. If available, the jobscript or submitline for job 8739696 should contain the dependency details:

$sacct -j 8739696 -p -o submitline

or

$sacct -j 8739696 -B

Thanks,
Oscar

Comment 10 Ole.H.Nielsen@fysik.dtu.dk 2025-06-10 06:20:23 MDT

Hi Oscar,

(In reply to Oscar Hernández from comment #9)
> >[2025-06-08T11:04:29.419] _kill_dependent: Job dependency can't be satisfied, cancelling JobId=8739696
> >[2025-06-08T11:04:29.419] _slurm_rpc_submit_batch_job: JobId=8739696 InitPrio=4294967294 usec=830
> In my testing, I am not being able to achieve this exact sequence. If I try
> to submit a job without a valid dependency, it gets rejected at submit time:
> 
> >[2025-06-10T12:54:07.649] _slurm_rpc_submit_batch_job: Job dependency problem
> Currently checking possible codepaths that could lead to your logged
> messages. But it would be really helpful if you could get some information
> on the dependency this job requested. If available, the jobscript or
> submitline for job 8739696 should contain the dependency details:
> 
> $sacct -j 8739696 -p -o submitline

I can get this information:

# sacct -j 8739696 -p -o submitline
SubmitLine|
sbatch --dependency=afterok:8739693 BADER.sh|

The script BADER.sh contains:

# cat /home/energy/s182755/cu111_corrected/data_analysis/Bader/Li/SJM/10/BADER.sh 
#!/bin/bash
#SBATCH --job-name=bader_serial
#SBATCH --mail-type=END,FAIL
#SBATCH --partition=xeon56
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=56
#SBATCH --time=50:00:00

cd $SLURM_SUBMIT_DIR
(lines deleted)

The dependency jobid 8739693 had both completed and also failed with this sacct information:

         JobID            JobName    User Account Partition  Timelimit 
-------------- ------------------ ------- ------- --------- ---------- 
       8739693             opt_ga s182755 ecsstud    xeon56 2-02:00:00 
 8739693.batch              batch         ecsstud                      
8739693.extern             extern         ecsstud                      
     8739693.0 hydra_bstrap_proxy         ecsstud                      

Job details information for jobid 8739693:
         JobID              Submit            Eligible               Start  Elapsed                 End  CPUTime NNodes NCPUS ExitCode NodeList     State ConsumedEnergyRaw                                          AllocTRES 
-------------- ------------------- ------------------- ------------------- -------- ------------------- -------- ------ ----- -------- -------- --------- ----------------- -------------------------------------------------- 
       8739693 2025-06-08T11:02:15 2025-06-08T11:02:15 2025-06-08T11:02:16 00:00:15 2025-06-08T11:02:31 00:14:00      1    56     42:0     d048    FAILED             11145 billing=168,cpu=56,energy=11145,mem=487200M,node=1 
 8739693.batch 2025-06-08T11:02:16 2025-06-08T11:02:16 2025-06-08T11:02:16 00:00:15 2025-06-08T11:02:31 00:14:00      1    56     42:0     d048    FAILED              4545                          cpu=56,mem=487200M,node=1 
8739693.extern 2025-06-08T11:02:16 2025-06-08T11:02:16 2025-06-08T11:02:16 00:00:15 2025-06-08T11:02:31 00:14:00      1    56      0:0     d048 COMPLETED             11145              billing=168,cpu=56,mem=487200M,node=1 
     8739693.0 2025-06-08T11:02:16 2025-06-08T11:02:16 2025-06-08T11:02:16 00:00:15 2025-06-08T11:02:31 00:14:00      1    56      0:0     d048 COMPLETED              4545              billing=168,cpu=56,mem=487200M,node=1 



> $sacct -j 8739696 -B

It seems that no script was used (perhaps a pipe?):

# sacct -j 8739696 -B
Batch Script for 8739696
--------------------------------------------------------------------------------
NONE

I hope this helps the troubleshooting?  Should I ask the user about how he submits these jobs?

Best regards,
Ole

Comment 11 Oscar Hernández 2025-06-10 06:53:11 MDT

Thanks Ole,

I think I got enough relevant info, will work from that.

># sacct -j 8739696 -B
>Batch Script for 8739696
>--------------------------------------------------------------------------------
>NONE
This -B will only show information about the script if the following setting is configured in slurm.conf: 

AccountingStoreFlags=job_script

So, it is expected nothing is shown if that option is not configured. It is disabled by default due to the potential space jobscripts could take in the DB.

In your case, the dependency was already present in the submit-line, so it is enough for me, though I appreciate you also providing the content of the BADER.sh script.

>I hope this helps the troubleshooting?  Should I ask the user about how he 
>submits these jobs?
Yes, I think I have enough info for now. Will let you know if I need anything else.

Kind regards,
Oscar

[1]https://slurm.schedmd.com/slurm.conf.html#OPT_job_script

Comment 12 Ole.H.Nielsen@fysik.dtu.dk 2025-06-11 02:15:10 MDT

> (In reply to Oscar Hernández from comment #5)
> > I have been running some tests. I have seen that the 30s timer gets
> > restarted every time some input that is not "Y","N" or "/n" is provided. 
> > 
> > >	while(c != 'Y' && c != 'y'
> > >	      && c != 'N' && c != 'n'
> > >	      && c != '\n') {
> > >		if (c) {
> > >			printf("Y or N please\n");
> > >		}

My 2 cents: I now did a simple test of the stdin file descriptor as seen in a crontab job to clarify my point:

> The stdin file descriptor of sacctmgr might be checked whether it's a
> terminal, or some weird fd provided by the cron daemon which doesn't respond
> as expected by sacctmgr.  It's definitely worth fixing the behavior seen in
> the present case.

A crontab job checking file descriptor 0 (stdin) resulted in this mail message:

Subject: Cron <ohni@gamow> ls -l /proc/self/fd/0
lr-x------. 1 ohni camdvip 64 Jun 11 09:50 /proc/self/fd/0 -> pipe:[7095519]

In another crontab job I try to read some lines from fd 0 and get an error:

Subject: Cron <ohni@gamow> head -10 /proc/self/fd/0
head: cannot open '/proc/self/fd/0' for reading: Permission denied

For comparison this is what I see in a Bash login shell:

$ ls -l /proc/self/fd/0
lrwx------. 1 ohni camdvip 64 Jun 11 09:47 /proc/self/fd/0 -> /dev/pts/2

IMHO, it seems that sacctmgr ought to verify whether stdin is readable before it engages in the above infinite loop.  And perhaps the loop ought to be limited to a small number of iterations?

Thanks,
Ole

Comment 14 Oscar Hernández 2025-06-16 04:52:37 MDT

Hi Ole,

>IMHO, it seems that sacctmgr ought to verify whether stdin is readable before it 
>engages in the above infinite loop.  And perhaps the loop ought to be limited to 
>a small number of iterations?
Just to double confirm the bug the issue is related with cron executions, I did some testing on Friday and indeed, got into a state in which the output file from Cron was left growing indefinitely.

In fact, as a fun fact, I forgot to disable the test cron I had set up to run every day, and this morning, when I got to my workstation, I found myself an ugly situation with it burning CPU and a saturated FS :)

Will see the best way to fix that and come back to you.

Additionally, with the info provided, I was able to replicate the exact same logs (so, presumably the same codepath followed) that caused your job to appear as a runaway. But in my case, it was successfully stored in the DB, never appearing as runaway. Will inspect that path in depth to see if I can find the failure point. But it currently seems an isolated case.

Kind regards,
Oscar

Comment 15 Ole.H.Nielsen@fysik.dtu.dk 2025-06-16 05:28:55 MDT

Hi Oscar,

Thanks for the progress report:

(In reply to Oscar Hernández from comment #14)
> Just to double confirm the bug the issue is related with cron executions, I
> did some testing on Friday and indeed, got into a state in which the output
> file from Cron was left growing indefinitely.
> 
> In fact, as a fun fact, I forgot to disable the test cron I had set up to
> run every day, and this morning, when I got to my workstation, I found
> myself an ugly situation with it burning CPU and a saturated FS :)
> 
> Will see the best way to fix that and come back to you.

Hehe, this bug bites badly when it occurs :-)  It will be good to get it fixed, hopefully also in 24.11?

> Additionally, with the info provided, I was able to replicate the exact same
> logs (so, presumably the same codepath followed) that caused your job to
> appear as a runaway. But in my case, it was successfully stored in the DB,
> never appearing as runaway. Will inspect that path in depth to see if I can
> find the failure point. But it currently seems an isolated case.

I don't know what happened, but it would not be a concern if it happens very rarely.

Best regards,
Ole

Comment 16 Oscar Hernández 2025-06-17 08:25:06 MDT

Hi Ole,

>Hehe, this bug bites badly when it occurs :-)  It will be good to get it fixed, 
>hopefully also in 24.11?
Already have a fix proposal on review. But, unfortunately, the fix won't be landing into 24.11. At this point, after 25.05 release, our policy states that only major regressions or security issues might be fixed in 24.11. And current bug we are fixing here has been around for years :).

Will keep you updated on the progress.

Kind regards,
Oscar

Comment 17 Oscar Hernández 2025-06-30 05:48:56 MDT

Hi Ole,

Fix for this is already upstream[1] in master and 25.05. We added a check on whether the FD is actually a valid readable source. Printing an error, and breaking the loop, when it is not.

With the fix, issue reported is no longer reproducible using "/dev/null" as input or running inside a cron job.

As mentioned, 24.11 is no longer eligible for this kind of changes, but patch should apply cleanly. Do let me know if you are interested in it.

Many thanks for reporting the issue. It has been around for a surprising amount of time.

Kind regards,
Oscar

[1]https://github.com/SchedMD/slurm/commit/0d4e339538161317a24b2b7d385c4ba708f9c719

Comment 18 Ole.H.Nielsen@fysik.dtu.dk 2025-06-30 06:03:14 MDT

Hi Oscar,

Thanks a lot for the clean patch!  We will be careful about sacctmgr in cron jobs with 24.11 (and earlier), and look forward to upgrading to 25.05 before long.

Best regards,
Ole


(In reply to Oscar Hernández from comment #17)
> Fix for this is already upstream[1] in master and 25.05. We added a check on
> whether the FD is actually a valid readable source. Printing an error, and
> breaking the loop, when it is not.
> 
> With the fix, issue reported is no longer reproducible using "/dev/null" as
> input or running inside a cron job.
> 
> As mentioned, 24.11 is no longer eligible for this kind of changes, but
> patch should apply cleanly. Do let me know if you are interested in it.
> 
> Many thanks for reporting the issue. It has been around for a surprising
> amount of time.
> 
> Kind regards,
> Oscar
> 
> [1]https://github.com/SchedMD/slurm/commit/
> 0d4e339538161317a24b2b7d385c4ba708f9c719

Comment 19 Oscar Hernández 2025-06-30 09:23:49 MDT

Great! Resolving.