Ticket 4686

Summary:	sbatch doesn't print jobid but a job record is created
Product:	Slurm	Reporter:	Marshall Garey <marshall>
Component:	User Commands	Assignee:	Unassigned Developer <dev-unassigned>
Status:	OPEN ---	QA Contact:
Severity:	5 - Enhancement
Priority:	---	CC:	marshall, ustbdante
Version:	18.08.x
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=4687
Site:	SchedMD	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Marshall Garey 2018-01-25 12:58:55 MST

See bug 4660.

The sbatch job was rejected and didn't print out the jobid, but a job record was created and could be viewed with scontrol show job and sacct. sbatch should print the job id if the job record is created.

Comment 1 Marshall Garey 2018-02-05 12:21:44 MST

I think this is possible - here's how I would approach it:

I believe it would involve changing the logic in slurm_submit_batch[_pack]_job to make sure a response is always sent back (never NULL). But also the logic in sbatch.c (main()) to distinguish between a successful and failed and successful batch submission, since right now if resp == NULL then it knows that the batch submission failed. So it would probably need a field added to the submit_response_msg_t struct that would say if the batch submission failed or not. And changing that would mean changing the protocol to pack/unpack that new field.

But I highly suspect this would mess up a bunch of other things that would then need to be fixed.

Since this isn't trivial and we don't have a SOW or sponsorship for this, I'm going to reassign this to dev-unassigned and let it sit on the back burner.

Maybe there's a simpler way, but I don't see it.

Comment 2 Moe Jette 2018-02-05 13:06:41 MST

One thing we have done through the years is add more error codes to more precisely identify the reason that a job is waiting or rejected. I'm not sure how easy it would be to identify the root cause of a job being rejected as being network topology related, but if that is not to difficult to do then I would recommend adding it.

Also note the data structure contents:
typedef struct submit_response_msg {
	uint32_t job_id;	/* job ID */
	uint32_t step_id;	/* step ID */
	uint32_t error_code;	/* error code for warning message */
	char *job_submit_user_msg; /* job submit plugin user_msg */
} submit_response_msg_t;

"job_submit_user_msg" can include an arbitrary message. It could be fairly verbose if desired, for example: "Rejected job 1234 since insufficient resources exist in nodes within a single network domain".

Comment 3 Marshall Garey 2018-02-09 10:26:56 MST

I've started working on adding an additional error message. It doesn't quite work yet but I think I'm close. I'll definitely want some feedback. I'll go ahead and take the bug again.

Comment 5 Wesley 2023-12-15 01:47:36 MST

(In reply to Marshall Garey from comment #3)
> I've started working on adding an additional error message. It doesn't quite
> work yet but I think I'm close. I'll definitely want some feedback. I'll go
> ahead and take the bug again.

Hi Marshall,Have you ultimately resolved this problem?
We have recently encountered the same problem. We want the sbatch command or restapi(POST /slurm/v0.0.39/job/submit) to print out Jobid of failed jobs,is there a solution?
Thanks!