Summary: | Fix regression: `SLURM_NTASKS` is not set in the job environment if `--ntasks-per-node` is specified | ||
---|---|---|---|
Product: | Slurm | Reporter: | Olivier Fisette <ofisette> |
Component: | Documentation | Assignee: | Marshall Garey <marshall> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | bill, bmundim, csamuel, janna.nugent, kaizaad, nathan.wielenga, tim |
Version: | 23.02.3 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: | https://bugs.schedmd.com/show_bug.cgi?id=18217 | ||
Site: | Simon Fraser University | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | CentOS | Machine Name: | Cedar |
CLE Version: | Version Fixed: | 23.02.5 23.11.0rc1 | |
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- |
Description
Olivier Fisette
2023-07-04 14:04:34 MDT
As a follow-up to my remark about an easy way to get the expected total number of tasks in a resource allocation, perhaps a new variable (e.g. `SLURM_NTASKS_IN_ALLOC`) could be used to hold the value that was previously available? This would not interact with job steps that are allocated a different number of tasks. Please note that this is for the Cedar cluster at Simon Fraser University, which holds a support contract I'd like to add a me too. Many of our researcher have relied upon SLURM_NTASKS and either I missed this removal or it is a bug. I'll make sure this is documented. Note on the change: Although SLURM_NTASKS was automatically set, it was not always correct. Consider the example from the commit message: https://github.com/SchedMD/slurm/commit/ef513023ad87a3870bf575efd2329672819c59f0 ``` Only send the number of tasks to the batch script if they were explicitly requested. i.e. sbatch -N2 --ntasks-per-node=1 --wrap="srun -N1 -v env | grep SLURM_STEP" Before this patch the srun would run 2 tasks instead of just 1. Bug 15690 ``` This example shows how it can be wrong for a job step. In addition to this example, the way that SLURM_NTASKS was calculated was a guess based on some job request parameters, but did not use the actual number of tasks in the job allocation. So it's possible that it was not always correct for the job allocation, too. Adding a new environment variable SLURM_NTASKS_IN_ALLOC is an interesting idea, but we need to discuss it further and also guarantee that it is always correct. As a workaround, you can use a job_submit plugin as described here: https://bugs.schedmd.com/show_bug.cgi?id=16278#c2 However, you'll run into the same problems that I just described where SLURM_NTASKS is not always correct. We have pushed a documentation fix in commit 3ae60c6b2e. It will be live on the website when 23.02.5 is released. Our dev lead has been busy, so we have not yet discussed adding a new environment variable (SLURM_NTASKS_IN_JOB or something like that). However, I have some additional notes about the behavior before version 23.02: SLURM_NTASKS was only set when one of the following options was requested: --ntasks-per-gpu --ntasks-per-node --ntasks If none of these options were requested, then SLURM_NTASKS was not set. In addition, we set SLURM_NTASKS based on the job request, not the actual job allocation. There are at least two potential problems with that: (1) The job allocation can have more tasks than the job request (like with --exclusive node allocations). A simple example: salloc --exclusive This job requests one task, but requests a whole node so it could have any number of cpus. It can run as many tasks as cpus, although the default is one. srun -n1 srun -n2 Or a more complicated example: salloc --ntasks-per-gpu=1 --gpus-per-node=2 --exclusive This job requests a node with at least 2 gpus per node. But it could be allocated a node with more than 2 gpus per node, and would be given all of those gpus. srun hostname # Default: 2 tasks srun --gpus-per-node=4 hostname # If on a node with 4 gpus, this would give 4 tasks (2) A range of nodes can be requested salloc -N1-4 So, setting an environment variable with the number of tasks allocated to a job is not as trivial as simply doing the 22.05 behavior, since SLURM_NTASKS was not always set and sometimes incorrect when set. Thank for the explanation, Marshall. It helped me understand that the number of tasks cannot be inferred in the general case. Thus, `SLURM_NTASKS_IN_JOB` would be ill-defined in some situations. At best, it would be possible to compute `SLURM_MAX_NTASKS_IN_JOB` unambiguously, but that sounds silly. I guess the best options are to leave `SLURM_NTASKS` undefined unless `--ntasks` is set (current behaviour), or to set `SLURM_NTASKS` only when it is unambiguous. The latter feels like more trouble than would be worth. Update: We are looking into this request. I'll let you know when I have more information. Hi there, Another "me too" from NERSC, we just upgraded to Slurm 23.02.4 and are getting a bunch of people reporting their scripts no longer work, including our vendor who is trying to run some tests. I'm going see if I can put a workaround in via the task prolog for the moment to stem the bleeding, but definitely would like to see something set that users could refer to if SLURM_NPROC and SLURM_NTASKS won't get set in this situation. All the best, Chris Chris, another easy workaround is to use the job_submit plugin as described here: https://bugs.schedmd.com/show_bug.cgi?id=16278#c2 *** Ticket 17451 has been marked as a duplicate of this ticket. *** Hi folks, We have decided to revert the change in 23.02: In 23.02.5, SLURM_NTASKS will be set in the job's environment if you request --ntasks-per-node. I am updating the title of this bug. We are also updating the documentation to say that SLURM_NTASKS will be set in the job's environment if any of the following options are requested: --ntasks --ntasks-per-node --ntasks-per-gpu If none of these options are requested, then SLURM_NTASKS is not set. SLURM_NTASKS is set correctly if a node range is requested along with --ntasks-per-node. SLURM_NTASKS will be set to however many tasks are in the job allocation, which just depends on the number of nodes allocated to the job. All of this should restore the 22.05 behavior for SLURM_NTASKS. One thing to note if you use a job_submit plugin: num_tasks will not be set if --ntasks-per-node is requested. I will update you when we have pushed changes upstream. We have pushed fixes upstream ahead of 23.02.5. See commits f3b93ea3e7..cc6f49d1f4 (In reply to Marshall Garey from comment #41) > We are also updating the documentation to > say that SLURM_NTASKS will be set in the job's environment if any of the > following options are requested: > > --ntasks > --ntasks-per-node > --ntasks-per-gpu > > If none of these options are requested, then SLURM_NTASKS is not set. Correction: I found a few edge cases where SLURM_NTASKS is still set. We opted not to document them because they are rare. For now do not plan on changing the behavior of these edge cases. We are wary of making further changes to how SLURM_NTASKS is set. I'm closing this as fixed in 23.02.5. |