Ticket 8504

Summary:	sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 256
Product:	Slurm	Reporter:	Hjalti Sveinsson <hjalti.sveinsson>
Component:	slurmd	Assignee:	Marcin Stolarek <cinek>
Status:	RESOLVED CANNOTREPRODUCE	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	18.08.7
Hardware:	Linux
OS:	Linux
Site:	deCODE	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	RHEL	Machine Name:	ru-hpc-0361.decode.is
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurmd log

Description Hjalti Sveinsson 2020-02-13 02:47:26 MST

Hello, we are seeing some strange errors on two compute nodes this morning but no error in .err file and nothing in .out file. Also only thing we see in the logs on the compute nodes is this:


[2020-02-12T17:31:30.632] _run_prolog: prolog with lock for job 59854450 ran for 
0 seconds
[2020-02-12T17:31:30.632] Launching batch job 59854450 for UID 1065
[2020-02-12T17:31:30.927] [59854450.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 256
[2020-02-12T17:31:30.936] [59854451.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 256
[2020-02-12T17:31:30.992] [59854450.batch] done with job
[2020-02-12T17:31:31.015] [59854451.batch] done with job

And on the head node we are getting:

root@ru-lhpc-head:~# grep 59854450 /var/log/slurm/slurmctld.log
[2020-02-12T17:31:30.416] _slurm_rpc_submit_batch_job: JobId=59854450 InitPrio=0 usec=5686
[2020-02-12T17:31:30.521] sched: _release_job_rec: release hold on JobId=59854450 by uid 1065
[2020-02-12T17:31:30.521] _slurm_rpc_update_job: complete JobId=59854450 uid=1065 usec=1949
[2020-02-12T17:31:30.618] sched: Allocate JobId=59854450 NodeList=ru-hpc-0361 #CPUs=1 Partition=cpu_hog
[2020-02-12T17:31:30.971] _job_complete: JobId=59854450 WEXITSTATUS 1
[2020-02-12T17:31:30.992] _job_complete: JobId=59854450 done

These nodes along with over a 150 others were added to the cluster recently, few weeks back and have been working fine. There is nothing in the journalctl from that time 17:31:* that gives an indicator on why these jobs failed, enough resources.

Please help with solving this issue.

Comment 1 Hjalti Sveinsson 2020-02-13 02:54:05 MST

Created attachment 13040 [details]
slurmd log

Comment 3 Marcin Stolarek 2020-02-13 10:09:38 MST

Hjalti,

Job status 256 is set when slurm is unable to configure job output. Looking at your slurmd log there are multiple lines like:
>error: Could not open stdout file /nfs/odinn/datavault/assoc/

Is NFS filesystem mounted on the nodes in question?

It would be quite standard good practice to verify this by a check in HealthCheckProgram.

cheers,
Marcin

Comment 4 Marcin Stolarek 2020-02-17 00:50:17 MST

Hjalti,

Where you able to check if the mentioned filesystem is correctyl mounted on affected nodes?

Is there anything else I can help you with?

cheers,
Marcin

Comment 5 Hjalti Sveinsson 2020-02-18 08:29:58 MST

Hi, the filesystem mount errors you mention are from 20th of january and earlier. We were seeing the errors I sent last week. 

Yes everything is mounted and the node has not been rebooted.

We do have a Slurm health check program that runs every 30 minutes and checks for these things.

Comment 7 Marcin Stolarek 2020-02-19 01:35:36 MST

Hjalti,

You're right - I missed the date of the logs. I'm checking the code to find the probable root cause. 

Are you still experiencing the issue? If yes, could you please increase SlurmdDebug to verbose (if it's possible debug is preferred) this should provide additional information around task handling which will be very helpful to understand the source of return code/status.

Do you know if this application-specific? Is this an MPI job/does it use PMI to launch tasks?

cheers,
Marcin

Comment 8 Hjalti Sveinsson 2020-02-23 07:29:59 MST

Hi, no we are not experiencing this anymore. Maybe when it happens next time I will re-open a case and we close this one for now?

Comment 9 Marcin Stolarek 2020-03-05 03:54:48 MST

Hjalti,

I failed to find a way to trigger that. I'll close this now with "cannot reproduce" status as you suggested in comment 8. However, when you'll notice this happening again please reopen.

cheers,
Marcin