Ticket 12714

Summary: slurmd/v0.0.37/job/submit cause the slurmrestd daemon corrupt.
Product: Slurm Reporter: brown kestrel <qing.na>
Component: slurmrestdAssignee: Jacob Jenson <jacob>
Status: RESOLVED INVALID QA Contact:
Severity: 6 - No support contract    
Priority: --- CC: nate
Version: 21.08.1   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description brown kestrel 2021-10-21 00:21:52 MDT
Everything is just fine when I use the slurmrestd.
But once I add the argv parameters in the request body.
The slurmrestd will be terminated with a segment fault.
is it a bug?

Request URL: http://{{slurmUrl}}/slurm/v0.0.37/job/submit

Request Body
 
{
    "job": {
        "name":"demo_test",
        "current_working_directory":"/gfs/jobs",
        "tasks": 1,
        "nodes": [1,2],
        "environment": {
            "PATH":"/bin:/usr/bin/:/usr/local/bin/",
            "LD_LIBRARY_PATH":"/lib/:/lib64/:/usr/local/lib"
        },
        "standard_output": "demo.%j.out"
    },
    "script":"#!/bin/bash\n sbatch demo.sh"
}


Request Body which cause the slurmrestd corrupt.

{
    "job": {
        "name":"demo_test",
        "current_working_directory":"/gfs/jobs",
        "tasks": 1,
        "nodes": [1,2],
        "environment": {
            "PATH":"/bin:/usr/bin/:/usr/local/bin/",
            "LD_LIBRARY_PATH":"/lib/:/lib64/:/usr/local/lib"
        },
        "standard_output": "demo.%j.out",
        "argv": [
            "hello"
        ]
    },
    "script":"#!/bin/bash\n sbatch demo.sh"
}


 


Slurmrest log

=======
21 14:14:08 10-23-145-163 kernel: slurmrestd[1327]: segfault at 2f ip 00007feaf6ef9f35 sp 00007feaf52e9440 error 4 in libslurmfull.so[7feaf6e3a000+1d0000]
21 14:14:08 10-23-145-163 systemd[1]: slurmrestd.service: main process exited, code=killed, status=11/SEGV
21 14:14:08 10-23-145-163 systemd[1]: Unit slurmrestd.service entered failed state.
21 14:14:08 10-23-145-163 systemd[1]: slurmrestd.service failed.
 

Can anyone could help with it ?
Comment 2 Jacob Jenson 2021-10-21 09:31:37 MDT
This has been verified as a bug. If ucloud.cn will purchase Slurm support then our professional support team can work with you to resolve this bug. 

Thank you,
Jacob
Comment 3 brown kestrel 2021-10-21 19:59:51 MDT
(In reply to Jacob Jenson from comment #2)
> This has been verified as a bug. If ucloud.cn will purchase Slurm support
> then our professional support team can work with you to resolve this bug. 
> 
> Thank you,
> Jacob


Hi Jacob,

Thanks for your kindly reply,


I'm evaluating/learning Slurm for the moment, and hope to run some demo-level computing jobs.

I understood that Schedmd does not provide non-commercial support since time is precious for everyone.

I have read through the Slurm contribution guide ( https://github.com/SchedMD/slurm/blob/master/CONTRIBUTING.md
), it seems like I cloud push my patch here in the attachment if I find the solution by myself right?

Of course, if I find the commercial support is necessary during the evaluation.

We will get one.

Best regards,
kestrel
Comment 4 Jason Booth 2021-10-22 09:31:17 MDT
>I have read through the Slurm contribution guide 
> (https://github.com/SchedMD/slurm/blob/master/CONTRIBUTING.md),
>  it seems like I cloud push my patch here in the attachment if I find the
> solution by myself right?

Correct, this is the procedure you would use when submitting a patch to us for inclusion with Slurm.