Ticket 4540 - Consider changing argv[0] on srun for clarity on error root cause
Summary: Consider changing argv[0] on srun for clarity on error root cause
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: User Commands (show other tickets)
Version: 17.02.9
Hardware: Cray XC Linux
: 4 - Minor Issue
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-12-18 16:18 MST by S Senator
Modified: 2017-12-19 19:56 MST (History)
1 user (show)

See Also:
Site: LANL
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description S Senator 2017-12-18 16:18:05 MST
Since moving to slurm we have noticed that our users are classifying job errors as "srun bugs" much more commonly than with the previous scheduler. Our hypothesis on why this would be is that the previous scheduler would either grant an allocation, in which the user's application would run, or the user would submit a job, in which their application would run. If that application encountered an error, it was clear which executable required debugging.

With slurm, the many ways in which users can invoke a job are comingled. So, the users will see that "srun encountered a Bus error", "srun ran out of memory" rather than "the program that srun was running within the batch script or allocation {encountered a bus error, ran out of memory, etc}."

Perhaps the argv[0] of the srun application could be changed to imply the root cause of the problem? Alternatively, similar to the JVM, a stack trace could be emitted, where appropriate, to show the cause of the problem.
Comment 1 Tim Wickberg 2017-12-19 19:56:41 MST
(In reply to S Senator from comment #0)
> Since moving to slurm we have noticed that our users are classifying job
> errors as "srun bugs" much more commonly than with the previous scheduler.
> Our hypothesis on why this would be is that the previous scheduler would
> either grant an allocation, in which the user's application would run, or
> the user would submit a job, in which their application would run. If that
> application encountered an error, it was clear which executable required
> debugging.
> 
> With slurm, the many ways in which users can invoke a job are comingled. So,
> the users will see that "srun encountered a Bus error", "srun ran out of
> memory" rather than "the program that srun was running within the batch
> script or allocation {encountered a bus error, ran out of memory, etc}."
>
> Perhaps the argv[0] of the srun application could be changed to imply the
> root cause of the problem? Alternatively, similar to the JVM, a stack trace
> could be emitted, where appropriate, to show the cause of the problem.

This is a user-education issue more than anything, and not something I plan to tackle.

Altering the argv array as you suggest is not a trivial change, and would lead to plenty of further confusion.

Producing stack traces when not asked for is not Slurm's prerogative - the srun command itself does not have that level of insight into the actual command or script that is being launched.

- Tim