Ticket 8504

Summary: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 256
Product: Slurm Reporter: Hjalti Sveinsson <hjalti.sveinsson>
Component: slurmdAssignee: Marcin Stolarek <cinek>
Status: RESOLVED CANNOTREPRODUCE QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 18.08.7   
Hardware: Linux   
OS: Linux   
Site: deCODE Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: RHEL
Machine Name: ru-hpc-0361.decode.is CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurmd log

Description Hjalti Sveinsson 2020-02-13 02:47:26 MST
Hello, we are seeing some strange errors on two compute nodes this morning but no error in .err file and nothing in .out file. Also only thing we see in the logs on the compute nodes is this:


[2020-02-12T17:31:30.632] _run_prolog: prolog with lock for job 59854450 ran for 
0 seconds
[2020-02-12T17:31:30.632] Launching batch job 59854450 for UID 1065
[2020-02-12T17:31:30.927] [59854450.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 256
[2020-02-12T17:31:30.936] [59854451.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 256
[2020-02-12T17:31:30.992] [59854450.batch] done with job
[2020-02-12T17:31:31.015] [59854451.batch] done with job

And on the head node we are getting:

root@ru-lhpc-head:~# grep 59854450 /var/log/slurm/slurmctld.log
[2020-02-12T17:31:30.416] _slurm_rpc_submit_batch_job: JobId=59854450 InitPrio=0 usec=5686
[2020-02-12T17:31:30.521] sched: _release_job_rec: release hold on JobId=59854450 by uid 1065
[2020-02-12T17:31:30.521] _slurm_rpc_update_job: complete JobId=59854450 uid=1065 usec=1949
[2020-02-12T17:31:30.618] sched: Allocate JobId=59854450 NodeList=ru-hpc-0361 #CPUs=1 Partition=cpu_hog
[2020-02-12T17:31:30.971] _job_complete: JobId=59854450 WEXITSTATUS 1
[2020-02-12T17:31:30.992] _job_complete: JobId=59854450 done

These nodes along with over a 150 others were added to the cluster recently, few weeks back and have been working fine. There is nothing in the journalctl from that time 17:31:* that gives an indicator on why these jobs failed, enough resources.

Please help with solving this issue.
Comment 1 Hjalti Sveinsson 2020-02-13 02:54:05 MST
Created attachment 13040 [details]
slurmd log
Comment 3 Marcin Stolarek 2020-02-13 10:09:38 MST
Hjalti,

Job status 256 is set when slurm is unable to configure job output. Looking at your slurmd log there are multiple lines like:
>error: Could not open stdout file /nfs/odinn/datavault/assoc/

Is NFS filesystem mounted on the nodes in question?

It would be quite standard good practice to verify this by a check in HealthCheckProgram.

cheers,
Marcin
Comment 4 Marcin Stolarek 2020-02-17 00:50:17 MST
Hjalti,

Where you able to check if the mentioned filesystem is correctyl mounted on affected nodes?

Is there anything else I can help you with?

cheers,
Marcin
Comment 5 Hjalti Sveinsson 2020-02-18 08:29:58 MST
Hi, the filesystem mount errors you mention are from 20th of january and earlier. We were seeing the errors I sent last week. 

Yes everything is mounted and the node has not been rebooted.

We do have a Slurm health check program that runs every 30 minutes and checks for these things.
Comment 7 Marcin Stolarek 2020-02-19 01:35:36 MST
Hjalti,

You're right - I missed the date of the logs. I'm checking the code to find the probable root cause. 

Are you still experiencing the issue? If yes, could you please increase SlurmdDebug to verbose (if it's possible debug is preferred) this should provide additional information around task handling which will be very helpful to understand the source of return code/status.

Do you know if this application-specific? Is this an MPI job/does it use PMI to launch tasks?

cheers,
Marcin
Comment 8 Hjalti Sveinsson 2020-02-23 07:29:59 MST
Hi, no we are not experiencing this anymore. Maybe when it happens next time I will re-open a case and we close this one for now?
Comment 9 Marcin Stolarek 2020-03-05 03:54:48 MST
Hjalti,

I failed to find a way to trigger that. I'll close this now with "cannot reproduce" status as you suggested in comment 8. However, when you'll notice this happening again please reopen.

cheers,
Marcin