| Summary: | Unable to satisfy cpu bind request when launches nested jobs | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | GSK-ONYX-SLURM <slurm-support> |
| Component: | Scheduling | Assignee: | Scott Hilton <scott> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | cinek |
| Version: | 22.05.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | GSK | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | CentOS |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | slurm.conf | ||
|
Description
GSK-ONYX-SLURM
2022-09-07 00:45:56 MDT
Could you please attach your slurm.conf? Created attachment 26640 [details]
slurm.conf
Radek, I am not able to reproduce the issue with your example. The jobs are not nested. sbatch will submit a separate new job even from inside another job. The error you are seeing comes from mask or map cpu binding. If you have the environment variable SLURM_CPU_BIND set, it could lead to this error without using --cpu-bind. This would make sense because --cpu-bind=none or quiet would override SLURM_CPU_BIND. See the documentation https://slurm.schedmd.com/srun.html#OPT_cpu-bind -Scott Hi Scott, I forgot to add srun to the outer.sh file. The original version is: #!/bin/sh #SBATCH --mem=8192 #SBATCH --time=120 #SBATCH --cpus-per-task=1 srun sbatch --array=0-3 inner.sh As you can see there's srun and then sbatch. User is saying that it worked in the previous version of Slurm and now it's stopped working. Once he gets rid srun of then it's working fine. Even though it looks odd, could you please advise something here? Thanks, Radek Radek, srun sbatch doesn't make sense and shouldn't make a difference. sbatch will still launch a new not nested job. srun here launches a step in the first job only to submit a new job to slurmctld. It doesn't make sense to do that. Testing it I don't see any difference. -Scott Hi Scott, I know and this is exactly what we told the user. Once he modified the script, everything seems to work without any errors. I think we can close the ticket. Thanks, Radek Closing ticket |