| Summary: | Jobs failing in parallel cluster | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | J.P. Waller <jwaller> |
| Component: | Cloud | Assignee: | Broderick Gardner <broderick> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | schaudhari |
| Version: | 23.11.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Acadian Asset | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
sacct output for a jobid that failed today
scontrol show config slurmctld log file |
||
|
Description
J.P. Waller
2023-05-11 14:20:19 MDT
(In reply to J.P. Waller from comment #0) > Summary: We think jobs are completing, but there is a validation step at > the end which is failing and marking jobs as failed. This is happening on > compute worker nodes in parallel cluster. What kind of validation step are you thinking? In slurm? > SLURM says partition 3789 has failed. What does partition mean here? Partitions don't really "fail" in slurm. The node? The job? Is this happening with all jobs or just some? It looks like you are using array jobs. Are all of the individual jobs failing? Or just some? Show sacct -j <job id> Thanks Hi Broderick, To give more information on what we are facing when we are submitting a job on our aws parallel cluster, it completes all the steps, but at the very end when it should send out a job status notification, at that point the job is failing. There are no indications in the log files that the job is generating which can tell us why it is failing. > What kind of validation step are you thinking? In slurm? When we are submitting jobs to slurm on pcluster cluster, there are multiple steps that job performs (most of them are writing to an s3 bucket kind of steps). The very first step, __validation__, is to make sure there are no code errors like bad config yaml files or imports. And then rest of the steps are getting performed. > What does partition mean here? Partitions don't really "fail" in slurm. The > node? The job? In the example of the failed job that JP provided, by partition, we meant one of the s3 write step in the DFWFactor_daily_risk step of our submitted job. But the log file for that indicated it was completed without issue. > Is this happening with all jobs or just some? > It looks like you are using array jobs. Are all of the individual jobs > failing? Or just some? Most of the submitted jobs on this particular parallel cluster are failing. We did a test by limiting the number of worker processes to 200 while submitting that same job, and that is succeeding. But then if that limit is removed again, it fails. > Show > sacct -j <job id> This is for one of the jobs that failed today. Attaching a text file with this comment for the complete output of sacct --job=<jobid>: adm_schaudhari@use1-pc2110:~$ sacct --job=27909825_4646 --format=JobID,JobName,Partition,State JobID JobName Partition State ------------ ---------- ---------- ---------- 27909825_46+ neo.Prod2+ batch FAILED 27909825_46+ batch CANCELLED Please let me know what other logs you might want to look into. Thank you. Created attachment 30296 [details]
sacct output for a jobid that failed today
Okay, please attach the whole slurmctld.log. Also the output of this command scontrol show config (In reply to Shalmali Chaudhari from comment #2) > When we are submitting jobs to slurm on pcluster cluster, there are multiple > steps that job performs (most of them are writing to an s3 bucket kind of > steps). The very first step, __validation__, is to make sure there are no > code errors like bad config yaml files or imports. And then rest of the > steps are getting performed. Can you provide an example of a job batch script and how you submit it? > Most of the submitted jobs on this particular parallel cluster are failing. > We did a test by limiting the number of worker processes to 200 while > submitting that same job, and that is succeeding. But then if that limit is > removed again, it fails. How did you "[limit] the number of worker processes"? In Slurm or in the OS? This line: [2023-05-10T03:18:56.273] _job_complete: JobId=27404372_3789(27408164) WTERMSIG 11 and this sacct line: 27909825_520 neo.Prod2+ batch 1 FAILED 0:11 indicate that the job process seg faulted (signal 11). Meaning it crashed due to accessing invalid memory. slumrctld.log attached scontrol config file attached > Can you provide an example of a job batch script and how you submit it? We have a wrapper code called "runner" which is what we use to submit our jobs to the slurm from the master node of parallel cluster. Below is the job status email with the job submission command: Subject: FAILURE: Job neo.<modelcode>@r7t2ykzo@<username>.mdw (started @2023-05-16 11:55:26, duration = 21 mins) Overall status: FAILURE Submission command (executed on general-dy-m512xlarge-1): python /home/<username>/git_ws/neo/pyenv/lib/python3.7/site-packages/runner/Runner.py --conf /home/<username>/git_ws/neo/neo/conf/models/<model_code>.yaml --options /home/<username>/git_ws/neo/neo/conf/NeoOptions.yaml --config_xsd /home/<username>/git_ws/neo/neo/conf/xsd/Neo.xsd --lite --newrelic --skip_view_creation --cluster --bucket=aam-<username>-avm --database=<username> --server=<psql_db_server> --newrelic --run_id=10 --start_date=20040101 --end_date=20230131 --DFWFactor_daily_stab_coeff --weekday Processed steps: Step __validation__ (1 jobs, TOT = 12 secs, PRT = 0 secs) rss=0.21GB vms=0.47GB, mmax=nanGB logs: /home/<username>/tmp/slurm/neo.<modelcode>/r7t2ykzo/log/__validation__ Step DFWFactor_daily_stab_coeff (4979 jobs, TOT = 17 mins, PRT = 1 secs) rss=0.22GB vms=0.62GB, mmax=-1.00GB logs: /home/<username>/tmp/slurm/neo.<modelcode>/r7t2ykzo/log/DFWFactor_daily_stab_coeff 1/4979 failed > How did you "[limit] the number of worker processes"? In Slurm or in the OS? The same wrapper has a parameter that helps limit the worker process. This is at slurm level with runner partitions. With this parameter, runner will not release the pending tasks from the same job pool unless the number of active running tasks from the pool is less than 200. Hope that answers the question. > This line: > [2023-05-10T03:18:56.273] _job_complete: JobId=27404372_3789(27408164) > WTERMSIG 11 > and this sacct line: > 27909825_520 neo.Prod2+ batch 1 FAILED 0:11 > > indicate that the job process seg faulted (signal 11). Meaning it crashed > due to accessing invalid memory. Oh, let me see if I can find memory-related messages in any of the system logs on the nodes. Created attachment 30322 [details]
scontrol show config
Created attachment 30323 [details]
slurmctld log file
Have you found anything on the seg fault issue? I've looked at the logs, but since the crash was in the job process, there's not a lot to see at this point. Downgrading the severity. Do you have any update on this issue? Closing for now. If you see this issue again and have more information, please reopen or file a new ticket. Thanks |