Hello, We have a user that submitted a job and got a node assigned. The job ran for about 6 days and then exited with NODE_FAIL. The same job ID then resumed on another node and continues running. I see no presence of the job on the original node (I looked for the users UID and the job ID in the slurmd.log). The only traces I found of the job ID "existing" twice are in the DB: ``` MariaDB [slurm_acct_db]> select derived_ec,time_start,time_end,state,nodelist,nodes_alloc,exit_code from greatlakes_job_table where id_job='675242'; +------------+------------+------------+-------+----------+-------------+-----------+ | derived_ec | time_start | time_end | state | nodelist | nodes_alloc | exit_code | +------------+------------+------------+-------+----------+-------------+-----------+ | 0 | 1569282469 | 1569732851 | 7 | gl3111 | 1 | 0 | | 0 | 1569732972 | 0 | 1 | gl3028 | 1 | 0 | +------------+------------+------------+-------+----------+-------------+-----------+ ``` When I do an scontrol show job on the job itself, I do see: ``` Requeue=1 Restarts=1 ``` Is this unusual? Is there a way to programmatically tell jobs to restart or requeue if they land on a bum node? Thanks! David
It appears we DO have JobRequeue set to 1 according to scontrol show config. So, I think that accounts for it. But I'll leave this open to be sure. David
After reading a bit, I see how a user can request this option on their own as well. So, I'm going to close this. Thanks for listening, though :) David