22172 – User Unable to request 9 nodes even nodes available in the partition.

Ticket 22172 - User Unable to request 9 nodes even nodes available in the partition.

Summary: User Unable to request 9 nodes even nodes available in the partition.

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Limits (show other tickets)
Version:	25.05.x
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Ricard Zarco Badia
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2025-02-24 08:32 MST by Waqas Hanif
Modified:	2025-04-23 03:27 MDT (History)
CC List:	5 users (show)

See Also:
Site:	Columbia University
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Slurm.conf file (10.98 KB, text/plain) 2025-02-24 08:32 MST, Waqas Hanif	Details
Job_submit.lua (3.81 KB, text/plain) 2025-02-25 12:19 MST, Waqas Hanif	Details
Job submitted pending status (50.08 KB, image/png) 2025-03-04 10:01 MST, Waqas Hanif	Details
Sinfo output (8.90 KB, text/plain) 2025-03-11 08:57 MDT, Waqas Hanif	Details
Squeue output (88.65 KB, text/plain) 2025-03-11 08:57 MDT, Waqas Hanif	Details
Sacctmgr output (1.04 KB, text/plain) 2025-03-11 08:58 MDT, Waqas Hanif	Details
sinfo (9.30 KB, text/plain) 2025-03-26 13:11 MDT, Waqas Hanif	Details
squeue (51.15 KB, text/plain) 2025-03-26 13:11 MDT, Waqas Hanif	Details
scontrol (1.13 KB, text/plain) 2025-03-26 13:11 MDT, Waqas Hanif	Details
Script file (26.80 KB, text/plain) 2025-03-28 08:10 MDT, Waqas Hanif	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Waqas Hanif 2025-02-24 08:32:03 MST

Created attachment 40908 [details]
Slurm.conf file

Dear team,  

why the user is unable to request 9 nodes exclusively, even though there are more than 9 nodes in her partition in an idle state?







[lm3391@ruth ~]$ salloc -N 9 --exclusive -A mckinley --mail-type=ALL --mail-type=BEGIN --mail-user=lm3391@columbia.edu --time=5-00:00:00

salloc: error: Job submit/allocate failed: Requested node configuration is not available

salloc: Job allocation 1090297 has been revoked.

[lm3391@ruth ~]$ salloc -N 9 -A mckinley --mail-type=ALL --mail-type=BEGIN --mail-user=lm3391@columbia.edu --time=5-00:00:00

salloc: Pending job allocation 1090298

salloc: job 1090298 queued and waiting for resources

salloc: job 1090298 has been allocated resources

salloc: Granted job allocation 1090298

salloc: Waiting for resource configuration

salloc: Nodes g[001-009] are ready for job

[lm3391@g001 ~]$ scontrol show partition mckinley1

PartitionName=mckinley1

   AllowGroups=ALL AllowAccounts=mckinley AllowQos=ALL

   AllocNodes=ALL Default=NO QoS=N/A

   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO

   MaxNodes=UNLIMITED MaxTime=5-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED

   Nodes=g[001-038,125-128,049-050,097-098,164-167,184]

   PriorityJobFactor=1 PriorityTier=2 RootOnly=NO ReqResv=NO OverSubscribe=NO

   OverTimeLimit=NONE PreemptMode=OFF

   State=UP TotalCPUs=1632 TotalNodes=51 SelectTypeParameters=NONE

   JobDefaults=(null)

   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

Comment 1 Ricard Zarco Badia 2025-02-24 09:04:37 MST

Hello,

Could you gather the output of "scontrol show job -d <jobID>" of your example jobs? One for the salloc that works and another for the one that gets rejected.

We will start with this, I will request more information if needed.

Best regards, Ricard.

Comment 2 Waqas Hanif 2025-02-24 10:48:41 MST

Sure, please see below: 

[lm3391@ruth ms6472]$ salloc -N 9 --exclusive -A mckinley --mail-type=ALL --mail-type=BEGIN --mail-user=lm3391@columbia.edu --time=2:00:00
salloc: Pending job allocation 1095578
salloc: job 1095578 queued and waiting for resources


[ms6472@ruth ~]$ scontrol show job -d 1095578
JobId=1095578 JobName=interactive
  UserId=lm3391(495777) GroupId=user(500) MCS_label=N/A
  Priority=3376 Nice=0 Account=mckinley QOS=h012
  JobState=PENDING Reason=Resources Dependency=(null)
  Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
  DerivedExitCode=0:0
  RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A
  SubmitTime=2025-02-24T12:37:39 EligibleTime=2025-02-24T12:37:39
  AccrueTime=2025-02-24T12:37:39
  StartTime=2025-02-25T04:21:44 EndTime=2025-02-25T06:21:44 Deadline=N/A
  SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-02-24T12:37:58 Scheduler=Main
  Partition=mckinley1,short AllocNode:Sid=ruth:733728
  ReqNodeList=(null) ExcNodeList=(null)
  NodeList=(null) SchedNodeList=g[104,123-128,155-156]
  NumNodes=9-9 NumCPUs=9 NumTasks=9 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
  TRES=cpu=9,mem=52200M,node=9,billing=9
  Socks/Node=* NtasksPerN:b:S:C=0:0:*:* CoreSpec=*
  MinCPUsNode=1 MinMemoryCPU=5800M MinTmpDiskNode=0
  Features=(null) DelayBoot=00:00:00
  OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
  Command=(null)
  WorkDir=/burg/home/ms6472
  Power=
  MailUser=lm3391@columbia.edu MailType=INVALID_DEPEND,BEGIN,END,FAIL,REQUEUE,STAGE_OUT





when increasing the job walltime from 2 hours to 5 days:

[lm3391@ruth ~]$ salloc -N 9 --exclusive -A mckinley --mail-type=ALL --mail-type=BEGIN --mail-user=lm3391@columbia.edu --time=5-00:00:00
salloc: error: Job submit/allocate failed: Requested node configuration is not available
[lm3391@ruth ~]$

Comment 3 Ricard Zarco Badia 2025-02-25 01:57:14 MST

Hello,

This is different. The original case was that the job could not be submitted doing with "-N 9 --exclusive --time=5-00:00:00" and could be submitted by taking out the exclusive flag. Now the issue is specifically with the time limit? Check if you can retrieve the output of "scontrol show job" of the failed job specifically, that is the one I am most interested about.

I know that your site has performed some changes recently with its job_submit.lua script, so I cannot rely on my old copy to check for correctness. Could I get a recent copy of that script in this ticket?

Best regards, Ricard.

Comment 4 Waqas Hanif 2025-02-25 12:19:39 MST

Created attachment 40925 [details]
Job_submit.lua

Hi Richard, 

Attached is the requested file, **job_submit.lua**.  

Please let me know if you need any additional information from our side.


Thanks 
Waqas Hanif

Comment 5 Ricard Zarco Badia 2025-02-26 04:29:13 MST

Hello,

I think that I know what is happening here. Your job_submit.lua appends the "short" partition to the job when the specified time limit is low enough. This can be seen too in the "scontrol show job" output you provided. This short partition has access to ALL the nodes of the cluster.

If we go back to the allocated nodes of your working job (g[104,123-128,155-156]), we can see that all of them are high-memory nodes (RealMemory=738000). Those are accessed using the "short" partition, since "mckinley1" only has 4 high memory nodes assigned (g[125-128]).

The issue here is that your normal nodes have a RealMemory of 171800, but in your slurm.conf I can see that you have this global parameter defined:

>> DefMemPerCPU=5800

All nodes accessible by the "mckinley1" partition have 32 cores. 32 cores x 5800 = 185600, which is higher than the total memory of your normal nodes. That is the reason why it gets allocated to high memory nodes, because this does not fit into normal nodes.

When you increase the time limit of the job, the job_submit.lua does not append the "short" partition anymore, meaning that you can only get access to 4 whole high memory nodes in the "mckinley1" partition. Since you requested 9 whole nodes, it is not enough and the job gets rejected.

I would recommend adjusting your DefMemPerCPU to be the memory of the node divided by its number of cores. If possible, I would put this parameter at the partition level so you can have different values for different partitions.

As a workaround, you can just call your srun/salloc/sbatch with "--mem=0", which tells the client that you are requesting the whole memory of the node instead of relying on memory calculations based on your DefMemPerCPU. Try it and let me know how it went, just to know if this resolves your issue.

Best regards, Ricard.

Comment 6 Waqas Hanif 2025-02-26 11:41:39 MST

Hi Ricard, 


Thanks for your prompt response and I really appreciate the quick fix you suggested. However, I’d prefer not to use that workaround and instead focus on permanently fixing the Lua script. 

Could you please help me fixing the script permanently?







Thanks
Waqas Hanif

Comment 7 Ricard Zarco Badia 2025-02-27 02:19:34 MST

Hello,

The script is not the root of the issue, unless you want to change your policy on assigning partitions and qos. The problem is that "DefMemPerCPU=5800" in your slurm.conf.

You have to change the value to something that all nodes can use without going over the total node memory. I think that all your nodes have 32 cores, and the lowest amount of memory is 171800.

So, it should be 171800 / 32 = 5368.75, you can round it up to 5360 to keep things simple. The important thing is to keep it low enough for when you request a whole node, so it does not go above its configured RealMemory.

Best regards, Ricard.

Comment 8 Ricard Zarco Badia 2025-03-03 07:51:56 MST

Hello Waqas,

Quick check-up, were you able to test my suggestion?

Best regards, Ricard.

Comment 9 Waqas Hanif 2025-03-04 07:23:51 MST

Hi Ricard, 


I will test your suggestions, and will get back to you as soon as I can. 



Thanks 
Waqas

Comment 10 Ricard Zarco Badia 2025-03-04 08:06:42 MST

Hello Waqas,

Understood. I will mark this ticket as closed to keep my queue organized, but if you find out that the issue persists after doing that change, please feel free to reopen it again and update.

Best regards, Ricard.

Comment 11 Waqas Hanif 2025-03-04 10:00:18 MST

Hey Ricard, 



We’ve implemented the changes you recommended, but we're still unable to allocate the specified number of nodes. I’m attaching a screenshot of the job I attempted to run for your reference.

Is there a reason why slurm allocates these 9 nodes, while 30 nodes are available?

SchedNodeList=g[001-009]


For instance, there are multiple nodes available for the research group. Please refer to the following: 

Nodes=g[001-038,125-128,049-050,097-098,164-167,184]
 

But for some reason, it only assigns 9 nodes. 

Please advice.

Comment 12 Waqas Hanif 2025-03-04 10:01:05 MST

Created attachment 40989 [details]
Job submitted pending status

Comment 13 Ricard Zarco Badia 2025-03-05 03:16:19 MST

Hello Waqas,

Can you share the output of "sinfo", "squeue" and "sacctmgr show qos h120" while this job is still pending? Which is the exact command you used to submit this job?

Best regards, Ricard.

Comment 14 Waqas Hanif 2025-03-11 08:57:27 MDT

Created attachment 41086 [details]
Sinfo output

Comment 15 Waqas Hanif 2025-03-11 08:57:58 MDT

Created attachment 41087 [details]
Squeue output

Comment 16 Waqas Hanif 2025-03-11 08:58:33 MDT

Created attachment 41088 [details]
Sacctmgr output

Comment 17 Waqas Hanif 2025-03-11 09:00:26 MDT

Hi Ricard, 


I’ve uploaded the attachments for sacct, sinfo, and sacctmgr show qos h120.  

Let me know if you need anything else!



Thanks 
Waqas

Comment 18 Ricard Zarco Badia 2025-03-11 10:00:24 MDT

Hello Waqas,

I do not see the 1197184 job in the squeue output, or any job targeting the mckinley1 partition for that matter. Is this issue happening right now for another job that I am not aware of?

Best regards, Ricard.

Comment 19 Waqas Hanif 2025-03-11 13:53:09 MDT

Ricard, 

[root@ruth ms6472]# sacct -u lm3391
JobID      JobName Partition  Account AllocCPUS   State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
1240872   interacti+ mckinley1  mckinley     9  TIMEOUT   0:0 
1240872.int+ interacti+       mckinley     1 COMPLETED   0:0 
1240873   run_MITgcm mckinley1  mckinley    242  TIMEOUT   0:0 
1240873.bat+   batch       mckinley     15 CANCELLED   0:15 
1240873.0    MITgcm       mckinley    242   FAILED   1:0




The Jobs often fail or time out because it takes so long to get the resources. 

The user responded back with the following: 

"Quick update to share that this latest run is still ongoing. I am actively comparing the current run to the previous run I wrote about in my last email. For context, here is how I'm defining these experiments:

Previous run: /burg/mckinley/users/lm3391/runs/tmpdir_dic-budget-trial2/dic-budget-trial2/
Current run: /burg/mckinley/users/lm3391/runs/tmpdir_dic-budget-trial2_try2/dic-budget-trial2_try2/
For both runs, it is taking ~7 hours for a single month of output to be written (in the ./diags/ folder). 

This is a considerable slowdown from previously successful runs, prior to our latest Ginsburg troubles, which only took ~20 minutes to produce each month of output."

I've requested user to start a new job and give me the pending job ID, once they will provide, I will get back to you. 


Thanks 
Waqas

Comment 20 Ricard Zarco Badia 2025-03-12 03:23:48 MDT

Hello Waqas,

>> The Jobs often fail or time out because it takes so long to get the resources. 

If there is a timeout in a batch job, it means that the job actually ran. The time spent waiting in the queue does not have an impact on the job time limit or walltime computation.

>> For both runs, it is taking ~7 hours for a single month of output to
>> be written (in the ./diags/ folder). 
>> This is a considerable slowdown from previously successful runs, prior to our
>> latest Ginsburg troubles, which only took ~20 minutes to produce each month of
>> output.

I have a feeling that the user is not talking about the jobs getting stuck as pending here. Without any more context, I would say that the user is saying that their workload is taking a lot more execution time than usual. However, I do not have all the details. If the user is defining these slowdowns in *execution* time (excluding wait times in the queue), this is an entirely different issue.

Coming back to the original topic, I still have to see a clear example of the job getting stuck in the queue as pending with proof of sufficient resources being available. That means having:

1 - The output of "squeue" + "scontrol show job <job_id>", to see that the job is pending with reason "Resources".

2 - The output of "sinfo", to see that the partition actually has resources.

3 - Can I assume that your slurm.conf is still the same with *only* the DefMemPerCPU change?

I need this information at the same time so I can correlate it. Confirm what the user's issue actually is and let me know when you have an update with what I have requested.

Best regards, Ricard.

Comment 21 Max Shortte 2025-03-12 03:41:57 MDT

Hi Richard,

Could you please join us on a Zoom call today to troubleshoot this issue? We have an appointment scheduled with the researcher at 3 PM EST today, and I would appreciate it if you could join the call. The researcher has been waiting for over 6 weeks for a resolution and is becoming quite frustrated.

McKinley Partition | Follow-Up
Wednesday, March 12 · 3:00 – 4:00pm
Time zone: America/New_York
Google Meet joining info
Video call link: https://meet.google.com/kwh-vahz-act
Or dial: (US) +1 530-593-0132 PIN: 566 858 948#
More phone numbers: https://tel.meet/kwh-vahz-act?pin=9456588918597

Comment 22 Ricard Zarco Badia 2025-03-12 05:04:38 MDT

Hello Max,

We normally do not do calls and keep everything ticket-based to have a complete track of what has been done. Furthermore, I personally will not be available at that time (8PM in my time zone).

We need to focus on the following:

* Determining the exact problem we are facing, with concrete outputs supporting the claims. This ticket initially started with jobs being straight out revoked, which has already been dealt with. The problem allegedly has shifted to jobs being queued but stuck in the pending status waiting for resources.

* If the issue is what I have just said, I need everything I have requested in comment 20 (and also the exact command used for the job launch, just in case). I want the whole outputs clearly representing the problem, so I can analyze them. Do not skip anything. If that is not enough, we will request more things from there.

* If the issue is not that, please provide the details of what the actual unexpected behavior is. I am saying this because in comment 19 it is not clear if the user is talking about total time (waiting + execution) or just execution time. If it is the latter, this is not related to the current topic of the ticket.

Best regards, Ricard.

Comment 23 Max Shortte 2025-03-12 05:28:08 MDT

We need a resolve to this question:

why the user is unable to request 9 nodes exclusively, even though there are more than 20 nodes in her partition in an idle state?

Comment 24 Ricard Zarco Badia 2025-03-12 06:40:16 MDT

Hello Max,

>> why the user is unable to request 9 nodes exclusively
>> even though there are more than 20 nodes in her partition in an idle state?

This is why I am asking for all the outputs requested in comment 20 *and* the exact command used to submit a job that exhibits this behavior. I understand that after the DefMemPerCPU change, the jobs are not revoked and they are just stuck in the pending state, so it is very likely that we are in a different scenario than the one described at the beginning of the ticket.

Once I have all that, if I still do not see a clear cause for this, would you be open to add specific debug flags / increase the debug level of slurmctld and try to reproduce the issue again so I can get more diagnostic data from the logs?

Best regards, Ricard.

Comment 25 Max Shortte 2025-03-12 12:05:56 MDT

Richard,

When Lauren, our end user, attempts to launch a job, it fails immediately.
The scontrol show jobs command does not provide any indication of why the
job failed, and there are no relevant entries in the logs. Please refer to
the latest feedback from Lauren, which I have included. I have also
requested Lauren to be granted authorized admin access so she can assist
with troubleshooting her workflow in real-time. Lauren has been CC'd on
this email.


-------------[FEEEDBACK]--------------------------
The job actually aborted very quickly. I am now having issues with using
all of the nodes I allocate. i.e., I request 9 nodes to complete my job but
only 8 are being used - thus the job fails instantly. I have just tried to
troubleshoot by requesting a larger amount of nodes (10) in my initial
allocation, but I encounter the same error.

This is the allocation command I am using: salloc -N 9 -A mckinley
--mail-type=ALL --mail-type=BEGIN --mail-user=lm3391@columbia.edu
 --time=5-00:00:00

This is a new error on the same job I was successfully (but with
painstaking slowness) running ~5 days ago. I have not made any code changes
that could explain why this failure is occurring. This is quite
representative of my experience on Ginsburg over the past 6 months -
inexplicable errors on routine jobs that have halted my research progress.

Max or Waqas, can you please advise? I would like to help the SchedMD
engineers with diagnosing my workflow issues but it's not clear to me how I
can do so without being able to successfully launch a 5-day job.

Thank you,
Lauren

----------

Could you please advise on the next steps for Lauren to take on her end?
Additionally, how would you like me to increase the debug level for further
investigation?


On Wed, Mar 12, 2025 at 8:40 AM <bugs@schedmd.com> wrote:

> Comment # 24 on ticket 22172 from Ricard Zarco Badia Hello Max, >> why the
> user is unable to request 9 nodes exclusively >> even though there are more
> than 20 nodes in her partition in an idle state? This is why I am asking
> for all
> ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
>
> ZjQcmQRYFpfptBannerEnd
>
> *Comment # 24
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_show-5Fbug.cgi-3Fid-3D22172-23c24&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=63m3moQvRHJG4ybJ_0nBaNEt2tqPCk3LfHQy4f4RyFw&m=mnvZS08-MfNXReCAt_aDn8F0uRajoduuNThyNzLnU_j6sZHnpyG7a8qMnX1YO9Qh&s=U55ufMTLc1JnXxVSE36bx-9V60-EBmF-O4OVGPLtZeM&e=>
> on ticket 22172
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_show-5Fbug.cgi-3Fid-3D22172&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=63m3moQvRHJG4ybJ_0nBaNEt2tqPCk3LfHQy4f4RyFw&m=mnvZS08-MfNXReCAt_aDn8F0uRajoduuNThyNzLnU_j6sZHnpyG7a8qMnX1YO9Qh&s=1Mu7oNqB-L9-taY7Bo4Kz7vU2_c310R5VFDr980xTBQ&e=>
> from Ricard Zarco Badia <ricard@schedmd.com> *
>
> Hello Max,
> >> why the user is unable to request 9 nodes exclusively
> >> even though there are more than 20 nodes in her partition in an idle state?
>
> This is why I am asking for all the outputs requested in comment 20 <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_show-5Fbug.cgi-3Fid-3D22172-23c20&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=63m3moQvRHJG4ybJ_0nBaNEt2tqPCk3LfHQy4f4RyFw&m=mnvZS08-MfNXReCAt_aDn8F0uRajoduuNThyNzLnU_j6sZHnpyG7a8qMnX1YO9Qh&s=qzHxNzdBoOwgfJNZTwyOhxII-ugwVGoIAU_oziR1ZNs&e=> *and* the
> exact command used to submit a job that exhibits this behavior. I understand
> that after the DefMemPerCPU change, the jobs are not revoked and they are just
> stuck in the pending state, so it is very likely that we are in a different
> scenario than the one described at the beginning of the ticket.
>
> Once I have all that, if I still do not see a clear cause for this, would you
> be open to add specific debug flags / increase the debug level of slurmctld and
> try to reproduce the issue again so I can get more diagnostic data from the
> logs?
>
> Best regards, Ricard.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You are on the CC list for the ticket.
>
>

Comment 26 Ricard Zarco Badia 2025-03-13 05:25:01 MDT

Hello everyone,

This will be a long message, but I really need everyone to be thorough with it and follow along. Do not take this as harshness, I am saying this because it is in our mutual best interest to deal with the issues as quickly and efficiently as possible.

Please let us take a moment to step back a bit and organize our thoughts on what needs to be addressed in manageable chunks. The topic/issue at hand has been shifting around for a while, we need to focus on *clearly defined and specific* issues, one at a time.
Please remember that I have no context outside of what has been said in this conversation. With that being said, here is a list of all the topics I have been picking up since the start of the ticket alongside their current status:

1 - Allocations with specific parameter combinations being *revoked*, aka rejected and not even sent to the queue for execution. See the first message in the ticket:
>> [lm3391@ruth ~]$ salloc -N 9 --exclusive -A mckinley --mail-type=ALL --mail-type=BEGIN --mail-user=lm3391@columbia.edu --time=5-00:00:00
>> salloc: error: Job submit/allocate failed: Requested node configuration is not available
>> salloc: Job allocation 1090297 has been revoked.

Status: This supposedly has been adressed by changing DefMemPerCPU in slurm.conf. I understand that this *does not happen anymore*. Please correct me if that is not the case.


2 - Jobs getting sent to the queue, but getting stuck as "Pending", concretely with the reason "Resources". Allegedly, there are enough resources in the target partition for these jobs to be allocated.

Status: I still have to see concrete outputs showing this behavior. Refer to what I have requested in comment 20. I cannot stress enough the importance of seeing the outputs as they are, to rule out possible user errors and see the state of the cluster so I can analyze it. I will put my requests here again so we have everything in a single message:

* Make sure that the job is pending and supposedly there are enough resources for it.

* Provide the exact command used to submit that job. Can I assume that it is always "salloc -N 9 -A mckinley
--mail-type=ALL --mail-type=BEGIN --mail-user=lm3391@columbia.edu
 --time=5-00:00:00"?

* Provide the complete outputs of "squeue" and "scontrol show <job_id>".

* Provide the complete output of "sinfo".

* Confirm that your slurm.conf is still the same and that the only change done to it is the DefMemPerCPU value.

This is the bare minimum I need, *all of it*. If nothing is found using those outputs, then we will proceed to increase the debug level and flags of the controller, repeat the reproducer and analyze the logs. I will provide the specific instructions for this if it comes to that.

After this, I have seen new topics coming from the user feedback:


3 - Jobs timing out, which I assume is related with what the user describes as jobs taking a lot more time executing than they used to (executing, not waiting in queue + executing). This has been explained in comment 19 and 25.

Status: This does not look related with points 1 and 2. The job enters execution and takes more time than expected. This issue should be *treated separately after we deal with points 1 and 2*, since there are a lot of possible causes for this, which can be related or (most likely) unrelated to scheduler usage or policies.


4 - Jobs instantly "failing" as soon as they are allocated. As per comment 25:

>> "When Lauren, our end user, attempts to launch a job, it fails immediately."

Status: This is a new topic for which I have no context. How are we defining "failing" here? The job entering execution and then the workload itself fails in some way, thus ending the job early as FAILED? Or are we talking about the job getting revoked in some way, which would be related to points 1 and 2?

Again, we need to tackle points 1 and 2 first. After that is out of the way, we need to see if there is any relation with point 3 too. This is probably related to the workload itself and not the scheduler, so you should be checking the job's outputs. If you still find nothing, I can try to provide further guidance *after point 1 and 2 are dealt with"*.


5 - Jobs where only some of the nodes get used. This may be a cause for point 4, but I have no context about this.

Status: Again, I have no context for this. This is possibly related to the workload the user is launching, for which I have no information other than the allocation command used. The user workload itself should be revised for correctness and confirm that there are indeed nodes that show no activity. Again, this comes after point 1 and 2.


6 - This is just something I have noticed, but I do not know what the user workload is. Running any kind of workload interactively (via salloc) that requires 5 days is not something I would recommend at all. Is there a reason why sbatch is not used instead? If possible, these types of long-time computations should be converted to sbatch scripts.


This is everything I have been picking up. Points 1 and 2 are scheduler-related and need my attention first. Points 3-6 seem related to the user workload or the infrastructure surrounding it. Please update the status of point 1, and after that give me everything I have requested in point 2 if you have a job exhibiting the behavior described there. *Ignore points 3-6 until points 1 and 2 are clear*.

After that is out of the way, we can try to sort out points 3-6, but it is very possible that those are not related to slurm itself. We will use this message to have a clear structure and organize what needs to be done.

Best regards, Ricard.

Comment 27 Waqas Hanif 2025-03-14 11:15:46 MDT

Ricard,

 
Why does the user land specifically on nodes g[001-009]? Is there a particular reason they are restricted to these nodes? 

The issue arises when any of these nodes [g001-009] are occupied, causing the user to wait for resources — even when there are 25+ available nodes in the partition.



[lm3391@bader ~]$ salloc -N 9  -A mckinley --mail-type=ALL --mail-type=BEGIN --mail-user=lm3391@columbia.edu --time=5-00:00:00
salloc: Pending job allocation 1351665
salloc: job 1351665 queued and waiting for resources
salloc: job 1351665 has been allocated resources
salloc: Granted job allocation 1351665
salloc: Waiting for resource configuration
salloc: Nodes g[001-009] are ready for job

Comment 28 Waqas Hanif 2025-03-14 11:18:14 MDT

Lastly, if a user requests exclusive access to any node in the partition, the system waits specifically for these 9 nodes (g[001-009]), ignoring all other available resources in the partition.

root@justice:/burg/home/wh2612# scontrol show partition mckinley1
PartitionName=mckinley1
   AllowGroups=ALL AllowAccounts=mckinley AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=5-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=g[001-038,125-128,049-050,097-098,164-167,184]
   PriorityJobFactor=1 PriorityTier=2 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=1632 TotalNodes=51 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

Comment 29 Ricard Zarco Badia 2025-03-17 03:39:45 MDT

Hello Waqas,

Please refer to point 2 of comment 26. I have explicitly detailed everything I need to perform my initial diagnostic. Your partitions are heterogeneous and I cannot hypothesize with only partial and incomplete information.

*Follow my instructions* and *do not skip anything requested/asked there*. Make sure that the outputs that I have requested are relevant and coherent before sending them.

Best regards, Ricard.

Comment 30 Ricard Zarco Badia 2025-03-24 04:56:21 MDT

Hello Waqas,

Are there any updates with this issue? Do you have a reproducer with the outputs and information I requested (see comment 26, point 2)?

Best regards, Ricard.

Comment 31 Waqas Hanif 2025-03-24 08:55:35 MDT

Hi Ricard,

I will have the request by COB today.


Thanks

On Mon, Mar 24, 2025 at 6:56 AM <bugs@schedmd.com> wrote:

> Comment # 30 on ticket 22172 from Ricard Zarco Badia Hello Waqas, Are
> there any updates with this issue? Do you have a reproducer with the
> outputs and information I requested (see comment 26, point 2)? Best
> regards, Ricard. You are receiving
> ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
>
> ZjQcmQRYFpfptBannerEnd
>
> *Comment # 30
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_show-5Fbug.cgi-3Fid-3D22172-23c30&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=qTO1ErNBhX-OUmfQL7YR-jE5l2uWsaHoc1FllOQlitI&m=YFHavgTTXKdx6_QWwwG_ei6WITaMIgRzxjXsyHlGq5lg421T_YBqZTSKo7OEyDR4&s=cm5s5HnctCLRazSp76HBkQsSm9x9BL2pLIrJPnPzbvE&e=>
> on ticket 22172
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_show-5Fbug.cgi-3Fid-3D22172&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=qTO1ErNBhX-OUmfQL7YR-jE5l2uWsaHoc1FllOQlitI&m=YFHavgTTXKdx6_QWwwG_ei6WITaMIgRzxjXsyHlGq5lg421T_YBqZTSKo7OEyDR4&s=D5y2EpJ0d_TJe_Y7-eeEst1n8ZH9u62Hdon0znHPjRc&e=>
> from Ricard Zarco Badia <ricard@schedmd.com> *
>
> Hello Waqas,
>
> Are there any updates with this issue? Do you have a reproducer with the
> outputs and information I requested (see comment 26 <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_show-5Fbug.cgi-3Fid-3D22172-23c26&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=qTO1ErNBhX-OUmfQL7YR-jE5l2uWsaHoc1FllOQlitI&m=YFHavgTTXKdx6_QWwwG_ei6WITaMIgRzxjXsyHlGq5lg421T_YBqZTSKo7OEyDR4&s=UJQlujE1Wd0zE6X3t3jT5bTCK936x1z2Zx0kssOr5eQ&e=>, point 2)?
>
> Best regards, Ricard.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the ticket.
>    - You are on the CC list for the ticket.
>
>

Comment 32 Waqas Hanif 2025-03-26 13:10:38 MDT

Hi Ricard,

When we run the command you mentioned in comment 26, the job gets submitted too quickly for us to capture exactly what was requested. However, for the sake of completeness, I have gathered the outputs of `sinfo`, `squeue`, and `scontrol`.

The `slurm.conf` remains the same as previously shared, with only the suggested change to the `DefMemPerCPU` value.

Additionally, I'm sharing the user's latest response regarding the issue they are currently facing.

**I deleted all screen sessions and opened a new one. In there, I requested exclusive access to 9 nodes. This created job 1493068. I then sbatch submitted the run_MITgcm.sh script to begin the experiment, which began job 1493073. This is the squeue printout:
Screen Shot 2025-03-21 at 1.39.02 PM.png

This is the first time I have seen a (REASON) printed next to the nodelist. In case you can't read that easily, next to the nodes of the run_MITg job there is the following: Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions.

This is also the first time that my experiment hasn't initiated but also hasn't failed. It seems to be stuck in a holding pattern, at least for the past few minutes since I tried to start it.

The issue, as I see it, is that the sbatch command inherently runs a parallel allocation for the nodes that I request separately prior to beginning an experiment. My previously successful workflow circumvented this by initiating experiments using " ./run_MITgcm.sh " - which only used my requested nodes, and never interfered with another allocation.

What do you think? I'm happy to meet now if you are able to discuss, but we can also continue emailing until our next regularly scheduled weekly meeting on 3/26.

Best,
Lauren**

Comment 33 Waqas Hanif 2025-03-26 13:11:06 MDT

Created attachment 41273 [details]
sinfo

Comment 34 Waqas Hanif 2025-03-26 13:11:31 MDT

Created attachment 41274 [details]
squeue

Comment 35 Waqas Hanif 2025-03-26 13:11:53 MDT

Created attachment 41275 [details]
scontrol

Comment 36 Ricard Zarco Badia 2025-03-27 06:12:46 MDT

Hello Waqas,

>>   SubmitTime=2025-03-26T14:56:53 
>>   StartTime=2025-03-26T14:56:53

This job entered execution instantaneously, this does not represent the case to be investigated. If I remember correctly, it was mentioned some time ago that the pending for resources issue happened if NodeList=g[001-009] was not available at the time of submission. Have you tried allocating some of those nodes yourself beforehand to try to force the same scenario?

>> **I deleted all screen sessions and opened a new one. In there, I requested
>> exclusive access to 9 nodes. This created job 1493068. I then sbatch submitted
>> the run_MITgcm.sh script to begin the experiment, which began job 1493073.

I do not understand this. If the actual workload to be performed is in the sbatch job, the first salloc would be wasting 9 nodes (the actual work is done in job 1493073, which has its own allocation). Is there a reason for this?

>> next to the nodes of the run_MITg job there is the following: Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions.

With the exception of "reserved for job in higher priority partitions" (it is a bit tricky to confirm, and even then I think you do not have job preemption enabled), have you checked if the other reasons were accurate?

>> The issue, as I see it, is that the sbatch command inherently runs a parallel
>> allocation for the nodes that I request separately prior to beginning an
>> experiment. My previously successful workflow circumvented this by initiating
>> experiments using " ./run_MITgcm.sh " - which only used my requested nodes,
>> and never interfered with another allocation.

I really have a feeling that the user is misunderstanding how salloc, sbatch or allocations operate in general. I have no knowledge of what he is actually running or how it would be interfering with another allocation, but the main issue here is that it looks like he is using two different allocations (one from salloc, another from sbatch) for a reason that still is not apparent. What is stopping him from just launching his workload via sbatch and ignore salloc completely?

These are my comments and questions for now, but I cannot shake the feeling that we are dragging this ticket around with no real direction or purpose. Each time that there is an interaction, the case and context to be investigated changes completely. We should stick to a concrete topic and thoroughly deal with it before jumping to the next.

I will try to reproduce the original "job pending with enough resources available" issue myself in the meantime. If I manage to do that, it will facilitate things. 

Best regards, Ricard.

Comment 37 Waqas Hanif 2025-03-27 07:14:01 MDT

Hi Ricard, 

We would like to schedule a Zoom call to expedite the resolution of the ongoing issue. The researcher has been waiting for six months to resolve the workflow problem, and we need to confirm that this is not related to a SLURM misconfiguration. As you can appreciate, the researcher is located on the West Coast, and the time zone difference has caused delays in troubleshooting. Real-time assistance would be far more effective, so we are requesting a training session or live troubleshooting.

The researcher has limited availability and is only available on Mondays and Fridays from 3 - 4 PM. Could you please confirm if one of your engineers would be available during this time to join the Zoom session and assist with troubleshooting?

If it is determined that SLURM is not the root cause, we will assign a developer to work directly with the researcher to adjust their workflow as needed.
Additionally, could you please escalate the severity of this ticket? The issue has now exceeded the terms of our SLA.


Thank you for your prompt attention to this matter.

Best, 
Waqas

Comment 38 Ricard Zarco Badia 2025-03-27 09:42:44 MDT

Hello Waqas,

As I said previously, calls fall outside our regular methodology. They are usually an exception and, if done, need to follow this:

1 - There must be a reason e.g. we can not get the information via normal means like uploads/logs/cli output.

2 - There must be an agenda of what we will specifically cover/do.

3 - A time limit of 30 minuets. Engineers can extend it if they need extra time.

Right now, points 1 and 2 are very unclear. I have talked with Jason (director of support) and he said that we can discuss this. Please confirm the timezone of that 3-4 PM slot. After that, discuss the three points above directly with him to see if we could schedule something.

Please use jbooth@schedmd.com to contact him about this, and let him know that the context for this is ticket 22172. Let me know as soon as you do that, just to be on the same page.

Best regards, Ricard.

Comment 39 Waqas Hanif 2025-03-27 12:50:40 MDT

Hi Ricard, 

FYI, I'm scheduling a call with Jason. 



Respectfully 
Waqas Hanif

Comment 40 Ricard Zarco Badia 2025-03-28 02:36:25 MDT

Hello,

Understood, please give me a brief summary of the meeting after you do it to know the situation and what needs to be done.

Best regards, Ricard.

Comment 41 Waqas Hanif 2025-03-28 08:10:24 MDT

Created attachment 41304 [details]
Script file

Comment 42 Waqas Hanif 2025-03-28 08:10:53 MDT

I definitely will.  

In the meantime, please review the attached script. Currently, the job is being submitted using the following `salloc` syntax:  


$ salloc -N 9 -A mckinley --exclusive --mail-type=ALL --mail-type=BEGIN --mail-user=wh2612@columbia.edu --time=5-00:00:00


Based on our observations, we suspect the issue might be related to how the job is being submitted. You should be able to run some tests in his test environment to prepare for Monday.

Comment 43 Ricard Zarco Badia 2025-03-28 09:54:02 MDT

Hello Waqas,

Just some comments. I have read the script and, correctness aside, I still do not see which role does salloc have here. The first section of the script contains SBATCH pragmas, so this looks intended to be launched directly via sbatch.

Having an allocation first via salloc does not contribute anything if later the script will be launched using sbatch. It would be another thing if, once you have the salloc allocation, you executed the script as is (./runMITgcmn instead of using sbatch). It looks like it can detect if it is being run inside an allocation already, and in that case it would use salloc's allocation.

However, I mentioned some time ago that it is preferable to execute long workloads using sbatch instead of interactively via salloc. That means getting rid of salloc entirely and use an equivalent sbatch command in its place. As it stands, I still have a feeling that there is a misunderstanding here on how allocations are meant to be used, unless there is an actual reason for this salloc+sbatch thing that we do not know about.

Best regards, Ricard.

Comment 44 Max Shortte 2025-03-28 10:21:53 MDT

Ricard,

The script requires a minimum allocation of 9 nodes. To successfully execute the job, the researcher follows these procedural steps:

1. Initialize a screen session** to manage job execution in the background:

   $ screen -dR


2. Request resources via `salloc` to allocate 9 nodes exclusively for the job, ensuring the correct project and necessary configurations are applied:

   $ salloc -N 9 -A mckinley --exclusive --mail-type=ALL --mail-type=BEGIN --mail-user=wh2612@columbia.edu --time=5-00:00:00


3. Execute the job** on one of the allocated nodes by running the following command:

   $ ./runMITgcmn

Let me know if further clarification is needed.

Comment 45 Max Shortte 2025-03-30 08:56:32 MDT

Ricard,

I wanted to give you my theory regarding the resource allocation issue before tomorrow’s Zoom session. It appears that the use of salloc and srun may be causing the jobs to remain in a pending state due to resources being requested twice, which could explain why two JOB IDs are generated.

Here’s the breakdown:

An interactive job is initiated with salloc and srun, requesting 9 nodes.

The job script (./runMITgcmn) then requests an additional 9 nodes, effectively doubling the requested resources.

This causes ambiguity as the script generates a local node list file, and the system is unsure whether to use the 9 nodes assigned by the interactive job or the 9 nodes requested by the script, leaving the job pending.

If this theory is correct, no changes would be required to the job script itself. Instead, the solution would be to adjust how the job is submitted, ensuring that resources are only requested once.

Looking forward to discussing this in tomorrow’s session.

Max

Comment 46 Ricard Zarco Badia 2025-03-31 08:52:36 MDT

Hello Max,

>> An interactive job is initiated with salloc and srun, requesting 9 nodes.

I think you mean just "salloc". This "srun" would be inside the script.

>> The job script (./runMITgcmn) then requests an additional 9 nodes,
>> effectively doubling the requested resources.

If we only do "./runMITgcm" (important to differentiate between "./runMITgcm" and "sbatch runMITgcm") once we are inside the salloc session, this will not create any more allocations. If the script is still the same as the one you provided earlier, the script internals amount to the following srun call:

>> export SLURM_HOSTFILE=./slurm_nodes.txt
>> export I_MPI_PMI_LIBRARY=/cm/shared/apps/slurm/21.08.8/lib64/libpmi.so
>> export I_MPI_DEBUG=5
>> export I_MPI_FABRICS="shm:ofi"
>> srun --nodes $nNodes \
>>      --ntasks $nProcs \
>>      --job-name=MITgcm \
>>      --export=SLURM_HOSTFILE,I_MPI_PMI_LIBRARY,I_MPI_DEBUG,I_MPI_FABRICS,ALL \
>>      --output=MITgcm.o%j \
>>      --error=MITgcm.e%j \
>>      --label \
>>      --verbose \
>>      --exclusive \
>>      --sockets-per-node=$socketsPerNode \
>>      --cores-per-socket=$coresPerSocket \
>>      --cpu-bind=verbose,rank_ldom \
>>      --mem-bind=verbose,local \
>>      --distribution=arbitrary \
>>      --mem=0 \
>>      ./${MITgcmExec}

There are two things to unpack here:

1 - That slurm_nodes.txt will only contain nodes from the salloc allocation. They are used for some sort of load_balancing logic implemented in that script (which can probably be simplified by just using srun parameters).

2 - This srun, as it is right now, should use the salloc's allocation resources. It will not create another allocation or go through the scheduler. It is what we call a "step" inside a job.

In short, there will not be two allocations, only the one from salloc. I have not seen any "sbatch" inside the script, but I know that it has been mentioned beforehand in the ticket.

I think that there are some concepts that need to be clarified before we continue, so the call should help with that. FYI, I will not be present in the meeting, since it will be 8PM in my time zone. Jason will be the one coming over if I am not mistaken.

Best regards, Ricard.

Comment 49 Ricard Zarco Badia 2025-04-09 08:45:00 MDT

Hello,

Are there any updates since the call? Should we still consider this as an ongoing ticket?

Best regards, Ricard.

Comment 50 Waqas Hanif 2025-04-09 10:06:05 MDT

Hi Richard,

I just wanted to let you know that the issue we were troubleshooting has been resolved.

Also, I was hoping to get your input on Galen’s current workflow. Her script is currently utilizing both salloc and sbatch, and I was wondering if you’d recommend splitting it into two separate scripts—one for the SLURM job submission and the other for the actual code execution.

This might streamline things by allowing researchers to use just sbatch, avoiding the need for an interactive salloc session altogether.

Would appreciate any thoughts or recommendations you might have.


Best, 
Waqas Hanif

Comment 51 Ricard Zarco Badia 2025-04-10 02:41:07 MDT

Hello Waqas,

Sure, is the script attached to the ticket still exactly the same used by Galen? If not, provide an updated version. But basically, it will most likely boil down to either use salloc or sbatch, but not both.

For long jobs, sbatch should be prefered over salloc for multiple reasons. I will not get into specifics because they may or may not apply to your site or case, but it is the general rule of thumb.

Best regards, Ricard.

Comment 52 Ricard Zarco Badia 2025-04-17 08:41:02 MDT

Hello Waqas,

This is a reminder of my last comment, please remember to confirm if the script in this ticket is still the same that Galen currently uses. 

If that is not a concern anymore, please let me know too so I can mark this ticket as closed (to keep my queue organized).

Best regards, Ricard.

Comment 53 Ricard Zarco Badia 2025-04-22 02:45:05 MDT

Hello,

I will be closing this ticket to keep the queue organized. Please reopen if necessary.

Best regards, Ricard.

Comment 54 Waqas Hanif 2025-04-22 11:43:47 MDT

Hi Ricard,



We'd appreciate a final clarification on this matter. Could you provide a summary of your findings and confirm whether the issues we've encountered are primarily related to the system or the workflow?


Specifically, we would value your input on the use of sbatch versus salloc in Galen's script. Do you recommend one approach over the other, or suggest any modifications to how the workflow is structured?
Furthermore, do you agree that if the root cause were a system-level issue, we would expect to see impacts beyond just Galen's workflow, potentially affecting other jobs or users as well?


We look forward to your conclusion.

Comment 55 Ricard Zarco Badia 2025-04-23 03:27:45 MDT

Hello Waqas,

My summary is the following:

1 - Are the issues related to the system or the workflow?

* As far as I am concerned, in the call you had with Jason everything came down to how the user submitted the workload. As soon the "sbatch" was removed from *within* the script, it started working.

Other than that, there is the issue I discussed in comment 26, point 2. Some jobs were allegedly stuck waiting for specific nodes. However, we never got to receive the diagnostic data we needed and this issue has not been brought up for a while now. This was a bit inconclusive, so if you manage to hit this behavior again, please provide everything I requested in that comment while the issue is happening.


2 - There seems to be a misunderstanding on how allocations work, at least by the way the user submitted the workload. We recommend refreshing some concepts via documentation (like this [1]).

* Specifically, we see no reason to use salloc alongside sbatch in this context. This is basically requesting double the resources needed and then leaving half of those idling, since effectively you are requesting two allocations. If there is no reason for it, please stick to either salloc or sbatch, but not both at the same time.


3 - Following up on this:

>> Specifically, we would value your input on the use of sbatch versus salloc in 
>> Galen's script. Do you recommend one approach over the other?

* Both are valid options for their intended uses, it comes down to knowing when to use them. My general recommendations are the following:

Use sbatch when...
 - The workload is long (more than a few hours).
 - User interaction is not needed.
 - The workload does not rely on X11 forwarding (interactive GUIs).
 - Basically, it should be the default option.

Use salloc when...
 - You are running very small workloads that need trial and error (to avoid going through the scheduler each time).
 - User interaction is needed.
 - The workload is GUI-based.
 - In my experience, it should not be the default mode of operation for workloads. This tends to get abused by users by hoarding resources for a long time just to avoid the scheduler.

There is also one thing to consider, salloc needs to mantain the shell session alive during its operation, and those could end up dying for various reasons. You can mitigate this by using tools like "screen" or "tmux", but even those are not 100% reliable (example: you could have a CPU time limit for processes in the login node, which can kill a screen server).

Having said that, Galen's script should be used with sbatch and not rely on salloc.


4 - About this:

>> Do you suggest any modifications to how the workflow is structured?

My recommendations are: 

- Leave salloc out of the equation and just launch the script via sbatch. 

- Reduce the script's verbosity. It is very easy to get lost.

- Simplify operations if possible. There is *a lot* of filesystem operations and maybe they could be reduced. Please keep in mind that I have no expertise in what the script is supposed to do, so the user should evaluate the feasibility of this.

- I saw that there is quite a bit of logic to calculate task placement in that script. It could be interesting to see if that could be reforumlated to slurm directives.

I could comment a bit more, but I do not know if the script being used now is still the same that we have in this ticket.


5 - Finally:

>> Furthermore, do you agree that if the root cause were a system-level issue, 
>> we would expect to see impacts beyond just Galen's workflow, potentially 
>> affecting other jobs or users as well?

Yes, I would say that too. Just keep in mind that this is never a guarantee and there is the possibility that the user could be hitting a very specific corner case. This always has to be taken with a grain of salt.

I think that this covers everything you requested.

Best regards, Ricard.

[1] https://slurm.schedmd.com/job_launch.html