Ticket 14103

Summary: srun.prolog formatting issue in 21.08.8-2
Product: Slurm Reporter: Alex Mamach <alex.mamach>
Component: OtherAssignee: Tim McMullan <mcmullan>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: mcmullan
Version: 22.05.2   
Hardware: Linux   
OS: Linux   
Site: Northwestern Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: Slurm 20 prolog formatting
Slurm 21 prolog formatting
srun.prolog
slurm.conf
srun_cat_file

Description Alex Mamach 2022-05-18 12:33:09 MDT
Hi,

We're working on upgrading from Slurm 20.11.8 to 21.08.8-2, and found that our srun.prolog formatting is displayed somewhat bizarrely after upgrading.

I've attached images of the formatting change between 20.11.8 and 21.08.8-2; for some reason Slurm 21 is indenting each successive line of the prolog which makes it difficult for our users to read.

We are using the same srun.prolog script on 21 as we were on 20, (also attached). I've tried using printf instead of echo but the result is the same.

Any advice you could give would be welcome!

Thanks!

Alex
Comment 1 Alex Mamach 2022-05-18 12:33:34 MDT
Created attachment 25097 [details]
Slurm 20 prolog formatting
Comment 2 Alex Mamach 2022-05-18 12:42:52 MDT
Created attachment 25099 [details]
Slurm 21 prolog formatting
Comment 3 Alex Mamach 2022-05-18 12:43:05 MDT
Created attachment 25100 [details]
srun.prolog
Comment 4 Scott Hilton 2022-05-19 16:40:00 MDT
Alex,

That is odd. Did you change anything else than slurm when you upgraded?

If you run the script outside of slurm do you get the issue?

I am getting the properly formatted output on my machine with 21.08.8-2

Can I get your slurm.conf? 

-Scott
Comment 5 Alex Mamach 2022-05-25 12:43:51 MDT
Hi Scott,

We only changed a few lines that were deprecated between 20 and 21:

We removed the following lines:

CheckpointType=checkpoint/none
JobCheckpointDir=/etc/slurm/checkpoint
AccountingStoreJobComment=YES

and added this line:

AccountingStoreFlags=job_comment,job_env,job_script

When running the srun.prolog manually as the user on the compute node, it is formatted correctly.

I've also attached our slurm.conf.

Thanks for your help on this!
Comment 6 Alex Mamach 2022-05-25 12:44:02 MDT
Created attachment 25219 [details]
slurm.conf
Comment 7 Scott Hilton 2022-05-25 14:35:22 MDT
Does this just affect srun.prolog or does the actual srun output get messed up as well?
Comment 8 Scott Hilton 2022-05-25 14:54:03 MDT
Alex,

Can you also run printenv in your prolog (both through slurm and manually)?

Besides slurm variables is anything else different between them?

-Scott
Comment 9 Alex Mamach 2022-06-03 12:38:20 MDT
Hi Scott,

Bizarrely I'm getting the same environment variables with the exception of the Slurm variables (like the node hostname and such). I'm really stumped as to why this is happening.
Comment 10 Scott Hilton 2022-06-06 09:31:04 MDT
Alex,

I am rather stumped as well. Here are a few more ideas though.

Do you get the same behavior if you run the program as PrologSlurmctld, Prolog, or SrunEpilog. (as opposed to SrunProlog) This might give a few clues.

As a workaround, does this work?
#!/bin/bash
now=$(date)
printf "----------------------------------------\n"
printf "srun job start: $now\n"
printf "Job ID: $SLURM_JOB_ID\n"
printf "Username: $USER\n"
printf "Queue: $SLURM_JOB_PARTITION\n"
printf "Account: $SLURM_JOB_ACCOUNT\n"
printf "----------------------------------------\n"
printf   "The following variables are not\n"
printf   "guaranteed to be the same in\n"
printf   "prologue and the job run script\n"
printf "----------------------------------------\n"
printf "PATH (in prologue) : $PATH\n"
printf "WORKDIR is: /home/$USER\n"
printf "----------------------------------------\n"

-Scott
Comment 11 Alex Mamach 2022-06-06 13:51:38 MDT
Hi Scott,

Thanks for the suggestion! I tried your printf solution and got the same result; I also saw the odd format when running the script through PrologSlurmctld, Prolog, and SrunEpilog.

One note of interest is that I tried this:

#!/bin/bash
now=$(date)
echo "----------------------------------------" > /tmp/$SLURM_JOB_ID
echo "srun job start: $now" >> /tmp/$SLURM_JOB_ID
echo "Job ID: $SLURM_JOB_ID" >> /tmp/$SLURM_JOB_ID
echo "Username: $USER" >> /tmp/$SLURM_JOB_ID
echo "Queue: $SLURM_JOB_PARTITION" >> /tmp/$SLURM_JOB_ID
echo "Account: $SLURM_JOB_ACCOUNT" >> /tmp/$SLURM_JOB_ID
echo "----------------------------------------" >> /tmp/$SLURM_JOB_ID
echo   "The following variables are not" >> /tmp/$SLURM_JOB_ID
echo   "guaranteed to be the same in" >> /tmp/$SLURM_JOB_ID
echo   "prologue and the job run script" >> /tmp/$SLURM_JOB_ID
echo "----------------------------------------" >> /tmp/$SLURM_JOB_ID
echo "PATH (in prologue) : $PATH" >> /tmp/$SLURM_JOB_ID
echo "WORKDIR is: /home/$USER" >> /tmp/$SLURM_JOB_ID
echo "----------------------------------------" >> /tmp/$SLURM_JOB_ID
cat /tmp/$SLURM_JOB_ID

and the file itself looks normal, but when catted out on the compute node it has the weird formatting. So it looks like the issue is occurring somewhere on the very last leg of the script when its output is displayed, rather than during the echo phase.
Comment 12 Scott Hilton 2022-06-07 09:44:28 MDT
Alex, 

Does the file /tmp/$SLURM_JOB_ID have the weired formatting issue when catted out on a different node? Could you upload the raw file?

-Scott
Comment 13 Scott Hilton 2022-06-30 11:23:29 MDT
Alex,

Any update on this issue?

-Scott
Comment 14 Scott Hilton 2022-07-25 10:09:46 MDT
Alex,

Is this issue still happening?

-Scott
Comment 15 Alex Mamach 2022-07-28 11:29:20 MDT
Hi Scott,

Sorry for the delayed response. Our maintenance period was delayed, so I'm currently working on a build of slurm 22..05.2 to see if the issue happens on that build. I should have an update in the next day or so.

Thanks!
Comment 16 Alex Mamach 2022-08-09 13:10:59 MDT
Hi Scott,

I tested with 22.05.2 but unfortunately we're still seeing the same behavior.

To update and clarify on my post on 06-06, when the prolog script fires off, the formatting is borked. However, manually catting the script out during the interactive job shows the script as expected. This makes me think that the way the prolog script is being displayed when fired off by srun is somehow different than when being manually catted out.
Comment 17 Scott Hilton 2022-08-16 13:05:38 MDT
Alex, 

When you upgraded from 20.11.8 to 21.08.8-2 did you upgrade your OS or any other software that may be relevant? What OS and version are you running now?

Do you have any other thoughts that may help reproduce the issue? 

-Scott
Comment 18 Alex Mamach 2022-09-08 11:27:38 MDT
Hi Scott,

We didn't update the OS or other software other than Slurm, (currently running RHEL 7.5), I'm really not sure what could be causing this issue. We're moving to 22.05.3 next week with an OS update as well, I'll be curious to see if that resolves the issue.
Comment 19 Scott Hilton 2022-09-08 15:06:14 MDT
Alex,

Does the file /tmp/$SLURM_JOB_ID have the weired formatting issue when catted out on a different node? 

Could you upload the raw file to me? 

Does this issue happen when you cat other files to your terminal?

What if you call "stty sane" before printing the file or running srun?

-Scott
Comment 20 Alex Mamach 2022-09-18 09:02:46 MDT
Hi Scott,

The /tmp/$SLURM_JOB_ID file looks normal when catted out on both the login node and any compute nodes I've trued, I've uploaded it here.

If I cat out a file manually everything looks fine, however if I change the srun.prolog script to simply cat out a known-good file, the formatting issue occurs:

srun: job 3372 queued and waiting for resources
srun: job 3372 has been allocated resources
hello

     this is a test

                   testing testing testing
                                          [aml4540@qgpu0207 ~]$ exit

Calling stty sane doesn't change the output for me when running srun, when catting out the file manually it looks fine with or without running stty sane beforehand.

One thing I noticed during a test is that this issue does not occur if I roll back to Slurm 20.11.8; the moment I move to 21.08 or 22.05, the issue begins.

Did something change in how Slurm 21.08 handles prologs?
Comment 21 Alex Mamach 2022-09-18 09:03:16 MDT
Created attachment 26852 [details]
srun_cat_file
Comment 22 Scott Hilton 2022-09-19 10:52:19 MDT
(In reply to Alex Mamach from comment #20)
> One thing I noticed during a test is that this issue does not occur if I
> roll back to Slurm 20.11.8; the moment I move to 21.08 or 22.05, the issue
> begins.
> 
> Did something change in how Slurm 21.08 handles prologs?
This looks like it could be relevant? But because I can't reproduce your issue, I can't be sure.
https://github.com/SchedMD/slurm/commit/4b5241dda2e317a90ffd50d57434db6314c7ae6a

-Scott
Comment 23 Tim McMullan 2022-11-11 14:54:26 MST
Hi Alex,

I was able to reproduce this and track down the change that caused this for you (https://github.com/SchedMD/slurm/commit/77755b6563).

The issue here is that as of 21.08 when that srun prolog script runs its running in a raw terminal.  This actually fixes some different display issues once you are actually in the terminal, most notably display issues with emacs with split windows.

A relatively easy way to get it to display correctly would be to add in the carriage returns that would normally be added in for you.

EG, change:
> echo "----------------------------------------"
to
> echo -e "----------------------------------------\r"
or
> printf "----------------------------------------\r\n"

Let me know if this works for you while I do some thinking and chatting with folks on our end about this particular use case.

Thanks!
--Tim
Comment 24 Alex Mamach 2022-11-11 17:18:26 MST
Hi Tim,

Thank you so much! This fixed the issue!

Thanks!!

Alex