Since moving to slurm we have noticed that our users are classifying job errors as "srun bugs" much more commonly than with the previous scheduler. Our hypothesis on why this would be is that the previous scheduler would either grant an allocation, in which the user's application would run, or the user would submit a job, in which their application would run. If that application encountered an error, it was clear which executable required debugging. With slurm, the many ways in which users can invoke a job are comingled. So, the users will see that "srun encountered a Bus error", "srun ran out of memory" rather than "the program that srun was running within the batch script or allocation {encountered a bus error, ran out of memory, etc}." Perhaps the argv[0] of the srun application could be changed to imply the root cause of the problem? Alternatively, similar to the JVM, a stack trace could be emitted, where appropriate, to show the cause of the problem.
(In reply to S Senator from comment #0) > Since moving to slurm we have noticed that our users are classifying job > errors as "srun bugs" much more commonly than with the previous scheduler. > Our hypothesis on why this would be is that the previous scheduler would > either grant an allocation, in which the user's application would run, or > the user would submit a job, in which their application would run. If that > application encountered an error, it was clear which executable required > debugging. > > With slurm, the many ways in which users can invoke a job are comingled. So, > the users will see that "srun encountered a Bus error", "srun ran out of > memory" rather than "the program that srun was running within the batch > script or allocation {encountered a bus error, ran out of memory, etc}." > > Perhaps the argv[0] of the srun application could be changed to imply the > root cause of the problem? Alternatively, similar to the JVM, a stack trace > could be emitted, where appropriate, to show the cause of the problem. This is a user-education issue more than anything, and not something I plan to tackle. Altering the argv array as you suggest is not a trivial change, and would lead to plenty of further confusion. Producing stack traces when not asked for is not Slurm's prerogative - the srun command itself does not have that level of insight into the actual command or script that is being launched. - Tim