| Summary: | sbatch doesn't print jobid but a job record is created | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Marshall Garey <marshall> |
| Component: | User Commands | Assignee: | Unassigned Developer <dev-unassigned> |
| Status: | OPEN --- | QA Contact: | |
| Severity: | 5 - Enhancement | ||
| Priority: | --- | CC: | marshall, ustbdante |
| Version: | 18.08.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=4687 | ||
| Site: | SchedMD | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Marshall Garey
2018-01-25 12:58:55 MST
I think this is possible - here's how I would approach it: I believe it would involve changing the logic in slurm_submit_batch[_pack]_job to make sure a response is always sent back (never NULL). But also the logic in sbatch.c (main()) to distinguish between a successful and failed and successful batch submission, since right now if resp == NULL then it knows that the batch submission failed. So it would probably need a field added to the submit_response_msg_t struct that would say if the batch submission failed or not. And changing that would mean changing the protocol to pack/unpack that new field. But I highly suspect this would mess up a bunch of other things that would then need to be fixed. Since this isn't trivial and we don't have a SOW or sponsorship for this, I'm going to reassign this to dev-unassigned and let it sit on the back burner. Maybe there's a simpler way, but I don't see it. One thing we have done through the years is add more error codes to more precisely identify the reason that a job is waiting or rejected. I'm not sure how easy it would be to identify the root cause of a job being rejected as being network topology related, but if that is not to difficult to do then I would recommend adding it.
Also note the data structure contents:
typedef struct submit_response_msg {
uint32_t job_id; /* job ID */
uint32_t step_id; /* step ID */
uint32_t error_code; /* error code for warning message */
char *job_submit_user_msg; /* job submit plugin user_msg */
} submit_response_msg_t;
"job_submit_user_msg" can include an arbitrary message. It could be fairly verbose if desired, for example: "Rejected job 1234 since insufficient resources exist in nodes within a single network domain".
I've started working on adding an additional error message. It doesn't quite work yet but I think I'm close. I'll definitely want some feedback. I'll go ahead and take the bug again. (In reply to Marshall Garey from comment #3) > I've started working on adding an additional error message. It doesn't quite > work yet but I think I'm close. I'll definitely want some feedback. I'll go > ahead and take the bug again. Hi Marshall,Have you ultimately resolved this problem? We have recently encountered the same problem. We want the sbatch command or restapi(POST /slurm/v0.0.39/job/submit) to print out Jobid of failed jobs,is there a solution? Thanks! |