Created attachment 40908 [details] Slurm.conf file Dear team, why the user is unable to request 9 nodes exclusively, even though there are more than 9 nodes in her partition in an idle state? [lm3391@ruth ~]$ salloc -N 9 --exclusive -A mckinley --mail-type=ALL --mail-type=BEGIN --mail-user=lm3391@columbia.edu --time=5-00:00:00 salloc: error: Job submit/allocate failed: Requested node configuration is not available salloc: Job allocation 1090297 has been revoked. [lm3391@ruth ~]$ salloc -N 9 -A mckinley --mail-type=ALL --mail-type=BEGIN --mail-user=lm3391@columbia.edu --time=5-00:00:00 salloc: Pending job allocation 1090298 salloc: job 1090298 queued and waiting for resources salloc: job 1090298 has been allocated resources salloc: Granted job allocation 1090298 salloc: Waiting for resource configuration salloc: Nodes g[001-009] are ready for job [lm3391@g001 ~]$ scontrol show partition mckinley1 PartitionName=mckinley1 AllowGroups=ALL AllowAccounts=mckinley AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=5-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=g[001-038,125-128,049-050,097-098,164-167,184] PriorityJobFactor=1 PriorityTier=2 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=1632 TotalNodes=51 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
Hello, Could you gather the output of "scontrol show job -d <jobID>" of your example jobs? One for the salloc that works and another for the one that gets rejected. We will start with this, I will request more information if needed. Best regards, Ricard.
Sure, please see below: [lm3391@ruth ms6472]$ salloc -N 9 --exclusive -A mckinley --mail-type=ALL --mail-type=BEGIN --mail-user=lm3391@columbia.edu --time=2:00:00 salloc: Pending job allocation 1095578 salloc: job 1095578 queued and waiting for resources [ms6472@ruth ~]$ scontrol show job -d 1095578 JobId=1095578 JobName=interactive UserId=lm3391(495777) GroupId=user(500) MCS_label=N/A Priority=3376 Nice=0 Account=mckinley QOS=h012 JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A SubmitTime=2025-02-24T12:37:39 EligibleTime=2025-02-24T12:37:39 AccrueTime=2025-02-24T12:37:39 StartTime=2025-02-25T04:21:44 EndTime=2025-02-25T06:21:44 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-02-24T12:37:58 Scheduler=Main Partition=mckinley1,short AllocNode:Sid=ruth:733728 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) SchedNodeList=g[104,123-128,155-156] NumNodes=9-9 NumCPUs=9 NumTasks=9 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=9,mem=52200M,node=9,billing=9 Socks/Node=* NtasksPerN:b:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=5800M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/burg/home/ms6472 Power= MailUser=lm3391@columbia.edu MailType=INVALID_DEPEND,BEGIN,END,FAIL,REQUEUE,STAGE_OUT when increasing the job walltime from 2 hours to 5 days: [lm3391@ruth ~]$ salloc -N 9 --exclusive -A mckinley --mail-type=ALL --mail-type=BEGIN --mail-user=lm3391@columbia.edu --time=5-00:00:00 salloc: error: Job submit/allocate failed: Requested node configuration is not available [lm3391@ruth ~]$
Hello, This is different. The original case was that the job could not be submitted doing with "-N 9 --exclusive --time=5-00:00:00" and could be submitted by taking out the exclusive flag. Now the issue is specifically with the time limit? Check if you can retrieve the output of "scontrol show job" of the failed job specifically, that is the one I am most interested about. I know that your site has performed some changes recently with its job_submit.lua script, so I cannot rely on my old copy to check for correctness. Could I get a recent copy of that script in this ticket? Best regards, Ricard.
Created attachment 40925 [details] Job_submit.lua Hi Richard, Attached is the requested file, **job_submit.lua**. Please let me know if you need any additional information from our side. Thanks Waqas Hanif
Hello, I think that I know what is happening here. Your job_submit.lua appends the "short" partition to the job when the specified time limit is low enough. This can be seen too in the "scontrol show job" output you provided. This short partition has access to ALL the nodes of the cluster. If we go back to the allocated nodes of your working job (g[104,123-128,155-156]), we can see that all of them are high-memory nodes (RealMemory=738000). Those are accessed using the "short" partition, since "mckinley1" only has 4 high memory nodes assigned (g[125-128]). The issue here is that your normal nodes have a RealMemory of 171800, but in your slurm.conf I can see that you have this global parameter defined: >> DefMemPerCPU=5800 All nodes accessible by the "mckinley1" partition have 32 cores. 32 cores x 5800 = 185600, which is higher than the total memory of your normal nodes. That is the reason why it gets allocated to high memory nodes, because this does not fit into normal nodes. When you increase the time limit of the job, the job_submit.lua does not append the "short" partition anymore, meaning that you can only get access to 4 whole high memory nodes in the "mckinley1" partition. Since you requested 9 whole nodes, it is not enough and the job gets rejected. I would recommend adjusting your DefMemPerCPU to be the memory of the node divided by its number of cores. If possible, I would put this parameter at the partition level so you can have different values for different partitions. As a workaround, you can just call your srun/salloc/sbatch with "--mem=0", which tells the client that you are requesting the whole memory of the node instead of relying on memory calculations based on your DefMemPerCPU. Try it and let me know how it went, just to know if this resolves your issue. Best regards, Ricard.
Hi Ricard, Thanks for your prompt response and I really appreciate the quick fix you suggested. However, I’d prefer not to use that workaround and instead focus on permanently fixing the Lua script. Could you please help me fixing the script permanently? Thanks Waqas Hanif
Hello, The script is not the root of the issue, unless you want to change your policy on assigning partitions and qos. The problem is that "DefMemPerCPU=5800" in your slurm.conf. You have to change the value to something that all nodes can use without going over the total node memory. I think that all your nodes have 32 cores, and the lowest amount of memory is 171800. So, it should be 171800 / 32 = 5368.75, you can round it up to 5360 to keep things simple. The important thing is to keep it low enough for when you request a whole node, so it does not go above its configured RealMemory. Best regards, Ricard.
Hello Waqas, Quick check-up, were you able to test my suggestion? Best regards, Ricard.
Hi Ricard, I will test your suggestions, and will get back to you as soon as I can. Thanks Waqas
Hello Waqas, Understood. I will mark this ticket as closed to keep my queue organized, but if you find out that the issue persists after doing that change, please feel free to reopen it again and update. Best regards, Ricard.
Hey Ricard, We’ve implemented the changes you recommended, but we're still unable to allocate the specified number of nodes. I’m attaching a screenshot of the job I attempted to run for your reference. Is there a reason why slurm allocates these 9 nodes, while 30 nodes are available? SchedNodeList=g[001-009] For instance, there are multiple nodes available for the research group. Please refer to the following: Nodes=g[001-038,125-128,049-050,097-098,164-167,184] But for some reason, it only assigns 9 nodes. Please advice.
Created attachment 40989 [details] Job submitted pending status
Hello Waqas, Can you share the output of "sinfo", "squeue" and "sacctmgr show qos h120" while this job is still pending? Which is the exact command you used to submit this job? Best regards, Ricard.
Created attachment 41086 [details] Sinfo output
Created attachment 41087 [details] Squeue output
Created attachment 41088 [details] Sacctmgr output
Hi Ricard, I’ve uploaded the attachments for sacct, sinfo, and sacctmgr show qos h120. Let me know if you need anything else! Thanks Waqas
Hello Waqas, I do not see the 1197184 job in the squeue output, or any job targeting the mckinley1 partition for that matter. Is this issue happening right now for another job that I am not aware of? Best regards, Ricard.
Ricard, [root@ruth ms6472]# sacct -u lm3391 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 1240872 interacti+ mckinley1 mckinley 9 TIMEOUT 0:0 1240872.int+ interacti+ mckinley 1 COMPLETED 0:0 1240873 run_MITgcm mckinley1 mckinley 242 TIMEOUT 0:0 1240873.bat+ batch mckinley 15 CANCELLED 0:15 1240873.0 MITgcm mckinley 242 FAILED 1:0 The Jobs often fail or time out because it takes so long to get the resources. The user responded back with the following: "Quick update to share that this latest run is still ongoing. I am actively comparing the current run to the previous run I wrote about in my last email. For context, here is how I'm defining these experiments: Previous run: /burg/mckinley/users/lm3391/runs/tmpdir_dic-budget-trial2/dic-budget-trial2/ Current run: /burg/mckinley/users/lm3391/runs/tmpdir_dic-budget-trial2_try2/dic-budget-trial2_try2/ For both runs, it is taking ~7 hours for a single month of output to be written (in the ./diags/ folder). This is a considerable slowdown from previously successful runs, prior to our latest Ginsburg troubles, which only took ~20 minutes to produce each month of output." I've requested user to start a new job and give me the pending job ID, once they will provide, I will get back to you. Thanks Waqas
Hello Waqas, >> The Jobs often fail or time out because it takes so long to get the resources. If there is a timeout in a batch job, it means that the job actually ran. The time spent waiting in the queue does not have an impact on the job time limit or walltime computation. >> For both runs, it is taking ~7 hours for a single month of output to >> be written (in the ./diags/ folder). >> This is a considerable slowdown from previously successful runs, prior to our >> latest Ginsburg troubles, which only took ~20 minutes to produce each month of >> output. I have a feeling that the user is not talking about the jobs getting stuck as pending here. Without any more context, I would say that the user is saying that their workload is taking a lot more execution time than usual. However, I do not have all the details. If the user is defining these slowdowns in *execution* time (excluding wait times in the queue), this is an entirely different issue. Coming back to the original topic, I still have to see a clear example of the job getting stuck in the queue as pending with proof of sufficient resources being available. That means having: 1 - The output of "squeue" + "scontrol show job <job_id>", to see that the job is pending with reason "Resources". 2 - The output of "sinfo", to see that the partition actually has resources. 3 - Can I assume that your slurm.conf is still the same with *only* the DefMemPerCPU change? I need this information at the same time so I can correlate it. Confirm what the user's issue actually is and let me know when you have an update with what I have requested. Best regards, Ricard.
Hi Richard, Could you please join us on a Zoom call today to troubleshoot this issue? We have an appointment scheduled with the researcher at 3 PM EST today, and I would appreciate it if you could join the call. The researcher has been waiting for over 6 weeks for a resolution and is becoming quite frustrated. McKinley Partition | Follow-Up Wednesday, March 12 · 3:00 – 4:00pm Time zone: America/New_York Google Meet joining info Video call link: https://meet.google.com/kwh-vahz-act Or dial: (US) +1 530-593-0132 PIN: 566 858 948# More phone numbers: https://tel.meet/kwh-vahz-act?pin=9456588918597
Hello Max, We normally do not do calls and keep everything ticket-based to have a complete track of what has been done. Furthermore, I personally will not be available at that time (8PM in my time zone). We need to focus on the following: * Determining the exact problem we are facing, with concrete outputs supporting the claims. This ticket initially started with jobs being straight out revoked, which has already been dealt with. The problem allegedly has shifted to jobs being queued but stuck in the pending status waiting for resources. * If the issue is what I have just said, I need everything I have requested in comment 20 (and also the exact command used for the job launch, just in case). I want the whole outputs clearly representing the problem, so I can analyze them. Do not skip anything. If that is not enough, we will request more things from there. * If the issue is not that, please provide the details of what the actual unexpected behavior is. I am saying this because in comment 19 it is not clear if the user is talking about total time (waiting + execution) or just execution time. If it is the latter, this is not related to the current topic of the ticket. Best regards, Ricard.
We need a resolve to this question: why the user is unable to request 9 nodes exclusively, even though there are more than 20 nodes in her partition in an idle state?
Hello Max, >> why the user is unable to request 9 nodes exclusively >> even though there are more than 20 nodes in her partition in an idle state? This is why I am asking for all the outputs requested in comment 20 *and* the exact command used to submit a job that exhibits this behavior. I understand that after the DefMemPerCPU change, the jobs are not revoked and they are just stuck in the pending state, so it is very likely that we are in a different scenario than the one described at the beginning of the ticket. Once I have all that, if I still do not see a clear cause for this, would you be open to add specific debug flags / increase the debug level of slurmctld and try to reproduce the issue again so I can get more diagnostic data from the logs? Best regards, Ricard.
Richard, When Lauren, our end user, attempts to launch a job, it fails immediately. The scontrol show jobs command does not provide any indication of why the job failed, and there are no relevant entries in the logs. Please refer to the latest feedback from Lauren, which I have included. I have also requested Lauren to be granted authorized admin access so she can assist with troubleshooting her workflow in real-time. Lauren has been CC'd on this email. -------------[FEEEDBACK]-------------------------- The job actually aborted very quickly. I am now having issues with using all of the nodes I allocate. i.e., I request 9 nodes to complete my job but only 8 are being used - thus the job fails instantly. I have just tried to troubleshoot by requesting a larger amount of nodes (10) in my initial allocation, but I encounter the same error. This is the allocation command I am using: salloc -N 9 -A mckinley --mail-type=ALL --mail-type=BEGIN --mail-user=lm3391@columbia.edu --time=5-00:00:00 This is a new error on the same job I was successfully (but with painstaking slowness) running ~5 days ago. I have not made any code changes that could explain why this failure is occurring. This is quite representative of my experience on Ginsburg over the past 6 months - inexplicable errors on routine jobs that have halted my research progress. Max or Waqas, can you please advise? I would like to help the SchedMD engineers with diagnosing my workflow issues but it's not clear to me how I can do so without being able to successfully launch a 5-day job. Thank you, Lauren ---------- Could you please advise on the next steps for Lauren to take on her end? Additionally, how would you like me to increase the debug level for further investigation? On Wed, Mar 12, 2025 at 8:40 AM <bugs@schedmd.com> wrote: > Comment # 24 on ticket 22172 from Ricard Zarco Badia Hello Max, >> why the > user is unable to request 9 nodes exclusively >> even though there are more > than 20 nodes in her partition in an idle state? This is why I am asking > for all > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > > *Comment # 24 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_show-5Fbug.cgi-3Fid-3D22172-23c24&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=63m3moQvRHJG4ybJ_0nBaNEt2tqPCk3LfHQy4f4RyFw&m=mnvZS08-MfNXReCAt_aDn8F0uRajoduuNThyNzLnU_j6sZHnpyG7a8qMnX1YO9Qh&s=U55ufMTLc1JnXxVSE36bx-9V60-EBmF-O4OVGPLtZeM&e=> > on ticket 22172 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_show-5Fbug.cgi-3Fid-3D22172&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=63m3moQvRHJG4ybJ_0nBaNEt2tqPCk3LfHQy4f4RyFw&m=mnvZS08-MfNXReCAt_aDn8F0uRajoduuNThyNzLnU_j6sZHnpyG7a8qMnX1YO9Qh&s=1Mu7oNqB-L9-taY7Bo4Kz7vU2_c310R5VFDr980xTBQ&e=> > from Ricard Zarco Badia <ricard@schedmd.com> * > > Hello Max, > >> why the user is unable to request 9 nodes exclusively > >> even though there are more than 20 nodes in her partition in an idle state? > > This is why I am asking for all the outputs requested in comment 20 <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_show-5Fbug.cgi-3Fid-3D22172-23c20&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=63m3moQvRHJG4ybJ_0nBaNEt2tqPCk3LfHQy4f4RyFw&m=mnvZS08-MfNXReCAt_aDn8F0uRajoduuNThyNzLnU_j6sZHnpyG7a8qMnX1YO9Qh&s=qzHxNzdBoOwgfJNZTwyOhxII-ugwVGoIAU_oziR1ZNs&e=> *and* the > exact command used to submit a job that exhibits this behavior. I understand > that after the DefMemPerCPU change, the jobs are not revoked and they are just > stuck in the pending state, so it is very likely that we are in a different > scenario than the one described at the beginning of the ticket. > > Once I have all that, if I still do not see a clear cause for this, would you > be open to add specific debug flags / increase the debug level of slurmctld and > try to reproduce the issue again so I can get more diagnostic data from the > logs? > > Best regards, Ricard. > > ------------------------------ > You are receiving this mail because: > > - You are on the CC list for the ticket. > >
Hello everyone, This will be a long message, but I really need everyone to be thorough with it and follow along. Do not take this as harshness, I am saying this because it is in our mutual best interest to deal with the issues as quickly and efficiently as possible. Please let us take a moment to step back a bit and organize our thoughts on what needs to be addressed in manageable chunks. The topic/issue at hand has been shifting around for a while, we need to focus on *clearly defined and specific* issues, one at a time. Please remember that I have no context outside of what has been said in this conversation. With that being said, here is a list of all the topics I have been picking up since the start of the ticket alongside their current status: 1 - Allocations with specific parameter combinations being *revoked*, aka rejected and not even sent to the queue for execution. See the first message in the ticket: >> [lm3391@ruth ~]$ salloc -N 9 --exclusive -A mckinley --mail-type=ALL --mail-type=BEGIN --mail-user=lm3391@columbia.edu --time=5-00:00:00 >> salloc: error: Job submit/allocate failed: Requested node configuration is not available >> salloc: Job allocation 1090297 has been revoked. Status: This supposedly has been adressed by changing DefMemPerCPU in slurm.conf. I understand that this *does not happen anymore*. Please correct me if that is not the case. 2 - Jobs getting sent to the queue, but getting stuck as "Pending", concretely with the reason "Resources". Allegedly, there are enough resources in the target partition for these jobs to be allocated. Status: I still have to see concrete outputs showing this behavior. Refer to what I have requested in comment 20. I cannot stress enough the importance of seeing the outputs as they are, to rule out possible user errors and see the state of the cluster so I can analyze it. I will put my requests here again so we have everything in a single message: * Make sure that the job is pending and supposedly there are enough resources for it. * Provide the exact command used to submit that job. Can I assume that it is always "salloc -N 9 -A mckinley --mail-type=ALL --mail-type=BEGIN --mail-user=lm3391@columbia.edu --time=5-00:00:00"? * Provide the complete outputs of "squeue" and "scontrol show <job_id>". * Provide the complete output of "sinfo". * Confirm that your slurm.conf is still the same and that the only change done to it is the DefMemPerCPU value. This is the bare minimum I need, *all of it*. If nothing is found using those outputs, then we will proceed to increase the debug level and flags of the controller, repeat the reproducer and analyze the logs. I will provide the specific instructions for this if it comes to that. After this, I have seen new topics coming from the user feedback: 3 - Jobs timing out, which I assume is related with what the user describes as jobs taking a lot more time executing than they used to (executing, not waiting in queue + executing). This has been explained in comment 19 and 25. Status: This does not look related with points 1 and 2. The job enters execution and takes more time than expected. This issue should be *treated separately after we deal with points 1 and 2*, since there are a lot of possible causes for this, which can be related or (most likely) unrelated to scheduler usage or policies. 4 - Jobs instantly "failing" as soon as they are allocated. As per comment 25: >> "When Lauren, our end user, attempts to launch a job, it fails immediately." Status: This is a new topic for which I have no context. How are we defining "failing" here? The job entering execution and then the workload itself fails in some way, thus ending the job early as FAILED? Or are we talking about the job getting revoked in some way, which would be related to points 1 and 2? Again, we need to tackle points 1 and 2 first. After that is out of the way, we need to see if there is any relation with point 3 too. This is probably related to the workload itself and not the scheduler, so you should be checking the job's outputs. If you still find nothing, I can try to provide further guidance *after point 1 and 2 are dealt with"*. 5 - Jobs where only some of the nodes get used. This may be a cause for point 4, but I have no context about this. Status: Again, I have no context for this. This is possibly related to the workload the user is launching, for which I have no information other than the allocation command used. The user workload itself should be revised for correctness and confirm that there are indeed nodes that show no activity. Again, this comes after point 1 and 2. 6 - This is just something I have noticed, but I do not know what the user workload is. Running any kind of workload interactively (via salloc) that requires 5 days is not something I would recommend at all. Is there a reason why sbatch is not used instead? If possible, these types of long-time computations should be converted to sbatch scripts. This is everything I have been picking up. Points 1 and 2 are scheduler-related and need my attention first. Points 3-6 seem related to the user workload or the infrastructure surrounding it. Please update the status of point 1, and after that give me everything I have requested in point 2 if you have a job exhibiting the behavior described there. *Ignore points 3-6 until points 1 and 2 are clear*. After that is out of the way, we can try to sort out points 3-6, but it is very possible that those are not related to slurm itself. We will use this message to have a clear structure and organize what needs to be done. Best regards, Ricard.
Ricard, Why does the user land specifically on nodes g[001-009]? Is there a particular reason they are restricted to these nodes? The issue arises when any of these nodes [g001-009] are occupied, causing the user to wait for resources — even when there are 25+ available nodes in the partition. [lm3391@bader ~]$ salloc -N 9 -A mckinley --mail-type=ALL --mail-type=BEGIN --mail-user=lm3391@columbia.edu --time=5-00:00:00 salloc: Pending job allocation 1351665 salloc: job 1351665 queued and waiting for resources salloc: job 1351665 has been allocated resources salloc: Granted job allocation 1351665 salloc: Waiting for resource configuration salloc: Nodes g[001-009] are ready for job
Lastly, if a user requests exclusive access to any node in the partition, the system waits specifically for these 9 nodes (g[001-009]), ignoring all other available resources in the partition. root@justice:/burg/home/wh2612# scontrol show partition mckinley1 PartitionName=mckinley1 AllowGroups=ALL AllowAccounts=mckinley AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=5-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=g[001-038,125-128,049-050,097-098,164-167,184] PriorityJobFactor=1 PriorityTier=2 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=1632 TotalNodes=51 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
Hello Waqas, Please refer to point 2 of comment 26. I have explicitly detailed everything I need to perform my initial diagnostic. Your partitions are heterogeneous and I cannot hypothesize with only partial and incomplete information. *Follow my instructions* and *do not skip anything requested/asked there*. Make sure that the outputs that I have requested are relevant and coherent before sending them. Best regards, Ricard.
Hello Waqas, Are there any updates with this issue? Do you have a reproducer with the outputs and information I requested (see comment 26, point 2)? Best regards, Ricard.
Hi Ricard, I will have the request by COB today. Thanks On Mon, Mar 24, 2025 at 6:56 AM <bugs@schedmd.com> wrote: > Comment # 30 on ticket 22172 from Ricard Zarco Badia Hello Waqas, Are > there any updates with this issue? Do you have a reproducer with the > outputs and information I requested (see comment 26, point 2)? Best > regards, Ricard. You are receiving > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > > *Comment # 30 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_show-5Fbug.cgi-3Fid-3D22172-23c30&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=qTO1ErNBhX-OUmfQL7YR-jE5l2uWsaHoc1FllOQlitI&m=YFHavgTTXKdx6_QWwwG_ei6WITaMIgRzxjXsyHlGq5lg421T_YBqZTSKo7OEyDR4&s=cm5s5HnctCLRazSp76HBkQsSm9x9BL2pLIrJPnPzbvE&e=> > on ticket 22172 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_show-5Fbug.cgi-3Fid-3D22172&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=qTO1ErNBhX-OUmfQL7YR-jE5l2uWsaHoc1FllOQlitI&m=YFHavgTTXKdx6_QWwwG_ei6WITaMIgRzxjXsyHlGq5lg421T_YBqZTSKo7OEyDR4&s=D5y2EpJ0d_TJe_Y7-eeEst1n8ZH9u62Hdon0znHPjRc&e=> > from Ricard Zarco Badia <ricard@schedmd.com> * > > Hello Waqas, > > Are there any updates with this issue? Do you have a reproducer with the > outputs and information I requested (see comment 26 <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_show-5Fbug.cgi-3Fid-3D22172-23c26&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=qTO1ErNBhX-OUmfQL7YR-jE5l2uWsaHoc1FllOQlitI&m=YFHavgTTXKdx6_QWwwG_ei6WITaMIgRzxjXsyHlGq5lg421T_YBqZTSKo7OEyDR4&s=UJQlujE1Wd0zE6X3t3jT5bTCK936x1z2Zx0kssOr5eQ&e=>, point 2)? > > Best regards, Ricard. > > ------------------------------ > You are receiving this mail because: > > - You reported the ticket. > - You are on the CC list for the ticket. > >
Hi Ricard, When we run the command you mentioned in comment 26, the job gets submitted too quickly for us to capture exactly what was requested. However, for the sake of completeness, I have gathered the outputs of `sinfo`, `squeue`, and `scontrol`. The `slurm.conf` remains the same as previously shared, with only the suggested change to the `DefMemPerCPU` value. Additionally, I'm sharing the user's latest response regarding the issue they are currently facing. **I deleted all screen sessions and opened a new one. In there, I requested exclusive access to 9 nodes. This created job 1493068. I then sbatch submitted the run_MITgcm.sh script to begin the experiment, which began job 1493073. This is the squeue printout: Screen Shot 2025-03-21 at 1.39.02 PM.png This is the first time I have seen a (REASON) printed next to the nodelist. In case you can't read that easily, next to the nodes of the run_MITg job there is the following: Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions. This is also the first time that my experiment hasn't initiated but also hasn't failed. It seems to be stuck in a holding pattern, at least for the past few minutes since I tried to start it. The issue, as I see it, is that the sbatch command inherently runs a parallel allocation for the nodes that I request separately prior to beginning an experiment. My previously successful workflow circumvented this by initiating experiments using " ./run_MITgcm.sh " - which only used my requested nodes, and never interfered with another allocation. What do you think? I'm happy to meet now if you are able to discuss, but we can also continue emailing until our next regularly scheduled weekly meeting on 3/26. Best, Lauren**
Created attachment 41273 [details] sinfo
Created attachment 41274 [details] squeue
Created attachment 41275 [details] scontrol
Hello Waqas, >> SubmitTime=2025-03-26T14:56:53 >> StartTime=2025-03-26T14:56:53 This job entered execution instantaneously, this does not represent the case to be investigated. If I remember correctly, it was mentioned some time ago that the pending for resources issue happened if NodeList=g[001-009] was not available at the time of submission. Have you tried allocating some of those nodes yourself beforehand to try to force the same scenario? >> **I deleted all screen sessions and opened a new one. In there, I requested >> exclusive access to 9 nodes. This created job 1493068. I then sbatch submitted >> the run_MITgcm.sh script to begin the experiment, which began job 1493073. I do not understand this. If the actual workload to be performed is in the sbatch job, the first salloc would be wasting 9 nodes (the actual work is done in job 1493073, which has its own allocation). Is there a reason for this? >> next to the nodes of the run_MITg job there is the following: Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions. With the exception of "reserved for job in higher priority partitions" (it is a bit tricky to confirm, and even then I think you do not have job preemption enabled), have you checked if the other reasons were accurate? >> The issue, as I see it, is that the sbatch command inherently runs a parallel >> allocation for the nodes that I request separately prior to beginning an >> experiment. My previously successful workflow circumvented this by initiating >> experiments using " ./run_MITgcm.sh " - which only used my requested nodes, >> and never interfered with another allocation. I really have a feeling that the user is misunderstanding how salloc, sbatch or allocations operate in general. I have no knowledge of what he is actually running or how it would be interfering with another allocation, but the main issue here is that it looks like he is using two different allocations (one from salloc, another from sbatch) for a reason that still is not apparent. What is stopping him from just launching his workload via sbatch and ignore salloc completely? These are my comments and questions for now, but I cannot shake the feeling that we are dragging this ticket around with no real direction or purpose. Each time that there is an interaction, the case and context to be investigated changes completely. We should stick to a concrete topic and thoroughly deal with it before jumping to the next. I will try to reproduce the original "job pending with enough resources available" issue myself in the meantime. If I manage to do that, it will facilitate things. Best regards, Ricard.
Hi Ricard, We would like to schedule a Zoom call to expedite the resolution of the ongoing issue. The researcher has been waiting for six months to resolve the workflow problem, and we need to confirm that this is not related to a SLURM misconfiguration. As you can appreciate, the researcher is located on the West Coast, and the time zone difference has caused delays in troubleshooting. Real-time assistance would be far more effective, so we are requesting a training session or live troubleshooting. The researcher has limited availability and is only available on Mondays and Fridays from 3 - 4 PM. Could you please confirm if one of your engineers would be available during this time to join the Zoom session and assist with troubleshooting? If it is determined that SLURM is not the root cause, we will assign a developer to work directly with the researcher to adjust their workflow as needed. Additionally, could you please escalate the severity of this ticket? The issue has now exceeded the terms of our SLA. Thank you for your prompt attention to this matter. Best, Waqas
Hello Waqas, As I said previously, calls fall outside our regular methodology. They are usually an exception and, if done, need to follow this: 1 - There must be a reason e.g. we can not get the information via normal means like uploads/logs/cli output. 2 - There must be an agenda of what we will specifically cover/do. 3 - A time limit of 30 minuets. Engineers can extend it if they need extra time. Right now, points 1 and 2 are very unclear. I have talked with Jason (director of support) and he said that we can discuss this. Please confirm the timezone of that 3-4 PM slot. After that, discuss the three points above directly with him to see if we could schedule something. Please use jbooth@schedmd.com to contact him about this, and let him know that the context for this is ticket 22172. Let me know as soon as you do that, just to be on the same page. Best regards, Ricard.
Hi Ricard, FYI, I'm scheduling a call with Jason. Respectfully Waqas Hanif
Hello, Understood, please give me a brief summary of the meeting after you do it to know the situation and what needs to be done. Best regards, Ricard.
Created attachment 41304 [details] Script file
I definitely will. In the meantime, please review the attached script. Currently, the job is being submitted using the following `salloc` syntax: $ salloc -N 9 -A mckinley --exclusive --mail-type=ALL --mail-type=BEGIN --mail-user=wh2612@columbia.edu --time=5-00:00:00 Based on our observations, we suspect the issue might be related to how the job is being submitted. You should be able to run some tests in his test environment to prepare for Monday.
Hello Waqas, Just some comments. I have read the script and, correctness aside, I still do not see which role does salloc have here. The first section of the script contains SBATCH pragmas, so this looks intended to be launched directly via sbatch. Having an allocation first via salloc does not contribute anything if later the script will be launched using sbatch. It would be another thing if, once you have the salloc allocation, you executed the script as is (./runMITgcmn instead of using sbatch). It looks like it can detect if it is being run inside an allocation already, and in that case it would use salloc's allocation. However, I mentioned some time ago that it is preferable to execute long workloads using sbatch instead of interactively via salloc. That means getting rid of salloc entirely and use an equivalent sbatch command in its place. As it stands, I still have a feeling that there is a misunderstanding here on how allocations are meant to be used, unless there is an actual reason for this salloc+sbatch thing that we do not know about. Best regards, Ricard.
Ricard, The script requires a minimum allocation of 9 nodes. To successfully execute the job, the researcher follows these procedural steps: 1. Initialize a screen session** to manage job execution in the background: $ screen -dR 2. Request resources via `salloc` to allocate 9 nodes exclusively for the job, ensuring the correct project and necessary configurations are applied: $ salloc -N 9 -A mckinley --exclusive --mail-type=ALL --mail-type=BEGIN --mail-user=wh2612@columbia.edu --time=5-00:00:00 3. Execute the job** on one of the allocated nodes by running the following command: $ ./runMITgcmn Let me know if further clarification is needed.
Ricard, I wanted to give you my theory regarding the resource allocation issue before tomorrow’s Zoom session. It appears that the use of salloc and srun may be causing the jobs to remain in a pending state due to resources being requested twice, which could explain why two JOB IDs are generated. Here’s the breakdown: An interactive job is initiated with salloc and srun, requesting 9 nodes. The job script (./runMITgcmn) then requests an additional 9 nodes, effectively doubling the requested resources. This causes ambiguity as the script generates a local node list file, and the system is unsure whether to use the 9 nodes assigned by the interactive job or the 9 nodes requested by the script, leaving the job pending. If this theory is correct, no changes would be required to the job script itself. Instead, the solution would be to adjust how the job is submitted, ensuring that resources are only requested once. Looking forward to discussing this in tomorrow’s session. Max
Hello Max, >> An interactive job is initiated with salloc and srun, requesting 9 nodes. I think you mean just "salloc". This "srun" would be inside the script. >> The job script (./runMITgcmn) then requests an additional 9 nodes, >> effectively doubling the requested resources. If we only do "./runMITgcm" (important to differentiate between "./runMITgcm" and "sbatch runMITgcm") once we are inside the salloc session, this will not create any more allocations. If the script is still the same as the one you provided earlier, the script internals amount to the following srun call: >> export SLURM_HOSTFILE=./slurm_nodes.txt >> export I_MPI_PMI_LIBRARY=/cm/shared/apps/slurm/21.08.8/lib64/libpmi.so >> export I_MPI_DEBUG=5 >> export I_MPI_FABRICS="shm:ofi" >> srun --nodes $nNodes \ >> --ntasks $nProcs \ >> --job-name=MITgcm \ >> --export=SLURM_HOSTFILE,I_MPI_PMI_LIBRARY,I_MPI_DEBUG,I_MPI_FABRICS,ALL \ >> --output=MITgcm.o%j \ >> --error=MITgcm.e%j \ >> --label \ >> --verbose \ >> --exclusive \ >> --sockets-per-node=$socketsPerNode \ >> --cores-per-socket=$coresPerSocket \ >> --cpu-bind=verbose,rank_ldom \ >> --mem-bind=verbose,local \ >> --distribution=arbitrary \ >> --mem=0 \ >> ./${MITgcmExec} There are two things to unpack here: 1 - That slurm_nodes.txt will only contain nodes from the salloc allocation. They are used for some sort of load_balancing logic implemented in that script (which can probably be simplified by just using srun parameters). 2 - This srun, as it is right now, should use the salloc's allocation resources. It will not create another allocation or go through the scheduler. It is what we call a "step" inside a job. In short, there will not be two allocations, only the one from salloc. I have not seen any "sbatch" inside the script, but I know that it has been mentioned beforehand in the ticket. I think that there are some concepts that need to be clarified before we continue, so the call should help with that. FYI, I will not be present in the meeting, since it will be 8PM in my time zone. Jason will be the one coming over if I am not mistaken. Best regards, Ricard.