| Summary: | sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 256 | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Hjalti Sveinsson <hjalti.sveinsson> |
| Component: | slurmd | Assignee: | Marcin Stolarek <cinek> |
| Status: | RESOLVED CANNOTREPRODUCE | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 18.08.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | deCODE | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | RHEL |
| Machine Name: | ru-hpc-0361.decode.is | CLE Version: | |
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | slurmd log | ||
|
Description
Hjalti Sveinsson
2020-02-13 02:47:26 MST
Created attachment 13040 [details]
slurmd log
Hjalti,
Job status 256 is set when slurm is unable to configure job output. Looking at your slurmd log there are multiple lines like:
>error: Could not open stdout file /nfs/odinn/datavault/assoc/
Is NFS filesystem mounted on the nodes in question?
It would be quite standard good practice to verify this by a check in HealthCheckProgram.
cheers,
Marcin
Hjalti, Where you able to check if the mentioned filesystem is correctyl mounted on affected nodes? Is there anything else I can help you with? cheers, Marcin Hi, the filesystem mount errors you mention are from 20th of january and earlier. We were seeing the errors I sent last week. Yes everything is mounted and the node has not been rebooted. We do have a Slurm health check program that runs every 30 minutes and checks for these things. Hjalti, You're right - I missed the date of the logs. I'm checking the code to find the probable root cause. Are you still experiencing the issue? If yes, could you please increase SlurmdDebug to verbose (if it's possible debug is preferred) this should provide additional information around task handling which will be very helpful to understand the source of return code/status. Do you know if this application-specific? Is this an MPI job/does it use PMI to launch tasks? cheers, Marcin Hi, no we are not experiencing this anymore. Maybe when it happens next time I will re-open a case and we close this one for now? |