| Summary: | srun.prolog formatting issue in 21.08.8-2 | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Alex Mamach <alex.mamach> |
| Component: | Other | Assignee: | Tim McMullan <mcmullan> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | mcmullan |
| Version: | 22.05.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Northwestern | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
Slurm 20 prolog formatting
Slurm 21 prolog formatting srun.prolog slurm.conf srun_cat_file |
||
|
Description
Alex Mamach
2022-05-18 12:33:09 MDT
Created attachment 25097 [details]
Slurm 20 prolog formatting
Created attachment 25099 [details]
Slurm 21 prolog formatting
Created attachment 25100 [details]
srun.prolog
Alex, That is odd. Did you change anything else than slurm when you upgraded? If you run the script outside of slurm do you get the issue? I am getting the properly formatted output on my machine with 21.08.8-2 Can I get your slurm.conf? -Scott Hi Scott, We only changed a few lines that were deprecated between 20 and 21: We removed the following lines: CheckpointType=checkpoint/none JobCheckpointDir=/etc/slurm/checkpoint AccountingStoreJobComment=YES and added this line: AccountingStoreFlags=job_comment,job_env,job_script When running the srun.prolog manually as the user on the compute node, it is formatted correctly. I've also attached our slurm.conf. Thanks for your help on this! Created attachment 25219 [details]
slurm.conf
Does this just affect srun.prolog or does the actual srun output get messed up as well? Alex, Can you also run printenv in your prolog (both through slurm and manually)? Besides slurm variables is anything else different between them? -Scott Hi Scott, Bizarrely I'm getting the same environment variables with the exception of the Slurm variables (like the node hostname and such). I'm really stumped as to why this is happening. Alex, I am rather stumped as well. Here are a few more ideas though. Do you get the same behavior if you run the program as PrologSlurmctld, Prolog, or SrunEpilog. (as opposed to SrunProlog) This might give a few clues. As a workaround, does this work? #!/bin/bash now=$(date) printf "----------------------------------------\n" printf "srun job start: $now\n" printf "Job ID: $SLURM_JOB_ID\n" printf "Username: $USER\n" printf "Queue: $SLURM_JOB_PARTITION\n" printf "Account: $SLURM_JOB_ACCOUNT\n" printf "----------------------------------------\n" printf "The following variables are not\n" printf "guaranteed to be the same in\n" printf "prologue and the job run script\n" printf "----------------------------------------\n" printf "PATH (in prologue) : $PATH\n" printf "WORKDIR is: /home/$USER\n" printf "----------------------------------------\n" -Scott Hi Scott, Thanks for the suggestion! I tried your printf solution and got the same result; I also saw the odd format when running the script through PrologSlurmctld, Prolog, and SrunEpilog. One note of interest is that I tried this: #!/bin/bash now=$(date) echo "----------------------------------------" > /tmp/$SLURM_JOB_ID echo "srun job start: $now" >> /tmp/$SLURM_JOB_ID echo "Job ID: $SLURM_JOB_ID" >> /tmp/$SLURM_JOB_ID echo "Username: $USER" >> /tmp/$SLURM_JOB_ID echo "Queue: $SLURM_JOB_PARTITION" >> /tmp/$SLURM_JOB_ID echo "Account: $SLURM_JOB_ACCOUNT" >> /tmp/$SLURM_JOB_ID echo "----------------------------------------" >> /tmp/$SLURM_JOB_ID echo "The following variables are not" >> /tmp/$SLURM_JOB_ID echo "guaranteed to be the same in" >> /tmp/$SLURM_JOB_ID echo "prologue and the job run script" >> /tmp/$SLURM_JOB_ID echo "----------------------------------------" >> /tmp/$SLURM_JOB_ID echo "PATH (in prologue) : $PATH" >> /tmp/$SLURM_JOB_ID echo "WORKDIR is: /home/$USER" >> /tmp/$SLURM_JOB_ID echo "----------------------------------------" >> /tmp/$SLURM_JOB_ID cat /tmp/$SLURM_JOB_ID and the file itself looks normal, but when catted out on the compute node it has the weird formatting. So it looks like the issue is occurring somewhere on the very last leg of the script when its output is displayed, rather than during the echo phase. Alex, Does the file /tmp/$SLURM_JOB_ID have the weired formatting issue when catted out on a different node? Could you upload the raw file? -Scott Alex, Any update on this issue? -Scott Alex, Is this issue still happening? -Scott Hi Scott, Sorry for the delayed response. Our maintenance period was delayed, so I'm currently working on a build of slurm 22..05.2 to see if the issue happens on that build. I should have an update in the next day or so. Thanks! Hi Scott, I tested with 22.05.2 but unfortunately we're still seeing the same behavior. To update and clarify on my post on 06-06, when the prolog script fires off, the formatting is borked. However, manually catting the script out during the interactive job shows the script as expected. This makes me think that the way the prolog script is being displayed when fired off by srun is somehow different than when being manually catted out. Alex, When you upgraded from 20.11.8 to 21.08.8-2 did you upgrade your OS or any other software that may be relevant? What OS and version are you running now? Do you have any other thoughts that may help reproduce the issue? -Scott Hi Scott, We didn't update the OS or other software other than Slurm, (currently running RHEL 7.5), I'm really not sure what could be causing this issue. We're moving to 22.05.3 next week with an OS update as well, I'll be curious to see if that resolves the issue. Alex, Does the file /tmp/$SLURM_JOB_ID have the weired formatting issue when catted out on a different node? Could you upload the raw file to me? Does this issue happen when you cat other files to your terminal? What if you call "stty sane" before printing the file or running srun? -Scott Hi Scott,
The /tmp/$SLURM_JOB_ID file looks normal when catted out on both the login node and any compute nodes I've trued, I've uploaded it here.
If I cat out a file manually everything looks fine, however if I change the srun.prolog script to simply cat out a known-good file, the formatting issue occurs:
srun: job 3372 queued and waiting for resources
srun: job 3372 has been allocated resources
hello
this is a test
testing testing testing
[aml4540@qgpu0207 ~]$ exit
Calling stty sane doesn't change the output for me when running srun, when catting out the file manually it looks fine with or without running stty sane beforehand.
One thing I noticed during a test is that this issue does not occur if I roll back to Slurm 20.11.8; the moment I move to 21.08 or 22.05, the issue begins.
Did something change in how Slurm 21.08 handles prologs?
Created attachment 26852 [details]
srun_cat_file
(In reply to Alex Mamach from comment #20) > One thing I noticed during a test is that this issue does not occur if I > roll back to Slurm 20.11.8; the moment I move to 21.08 or 22.05, the issue > begins. > > Did something change in how Slurm 21.08 handles prologs? This looks like it could be relevant? But because I can't reproduce your issue, I can't be sure. https://github.com/SchedMD/slurm/commit/4b5241dda2e317a90ffd50d57434db6314c7ae6a -Scott Hi Alex, I was able to reproduce this and track down the change that caused this for you (https://github.com/SchedMD/slurm/commit/77755b6563). The issue here is that as of 21.08 when that srun prolog script runs its running in a raw terminal. This actually fixes some different display issues once you are actually in the terminal, most notably display issues with emacs with split windows. A relatively easy way to get it to display correctly would be to add in the carriage returns that would normally be added in for you. EG, change: > echo "----------------------------------------" to > echo -e "----------------------------------------\r" or > printf "----------------------------------------\r\n" Let me know if this works for you while I do some thinking and chatting with folks on our end about this particular use case. Thanks! --Tim Hi Tim, Thank you so much! This fixed the issue! Thanks!! Alex |