Ticket 8276

Summary: Jobs not starting after change to cons_tres
Product: Slurm Reporter: Marcus Boden <mboden>
Component: slurmctldAssignee: Director of Support <support>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 19.05.5   
Hardware: Linux   
OS: Linux   
Site: GWDG Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: slurmcltd log file
slurm.conf

Description Marcus Boden 2019-12-30 08:36:23 MST
Created attachment 12641 [details]
slurmcltd log file

Hey guys,

I have the following problem: If I change the SelectType from cons_res to cons_tres and restart the slurmctld, jobs won't start anymore. Or rather, they start, then immediately fail and cause the node to drain.

I've attached the slurmctld log and our config file. The only thing I changed in this test is the SelectType line. Nothing else was touched. The job I triet to run was a simple 'srun hostname' (id 2337599), not even on a GPU node or anything.

The following snippet from the logs shows some errors I found rather unusual:
[2019-12-30T16:04:48.248] sched: Allocate JobId=2293860 NodeList=gwdd075 #CPUs=6 Partition=medium-fmz
[2019-12-30T16:04:48.249] sched: Allocate JobId=2293861 NodeList=gwdd079 #CPUs=6 Partition=medium-fmz
[2019-12-30T16:04:48.250] sched: Allocate JobId=2293862 NodeList=gwdd029 #CPUs=6 Partition=medium-fmz
[2019-12-30T16:04:48.251] sched: Allocate JobId=2293863 NodeList=gwdd103 #CPUs=6 Partition=medium-fmz
[2019-12-30T16:04:48.386] prolog_running_decr: Configuration for JobId=2337599 is complete
[2019-12-30T16:04:48.388] prolog_running_decr: Configuration for JobId=2293860 is complete
[2019-12-30T16:04:48.481] prolog_running_decr: Configuration for JobId=2293861 is complete
[2019-12-30T16:04:48.505] Killing non-startable batch JobId=2293860: Header lengths are longer than data received
[2019-12-30T16:04:48.538] prolog_running_decr: Configuration for JobId=2293862 is complete
[2019-12-30T16:04:48.553] prolog_running_decr: Configuration for JobId=2293863 is complete
[2019-12-30T16:04:48.563] Killing non-startable batch JobId=2293861: Header lengths are longer than data received
[2019-12-30T16:04:48.575] Killing non-startable batch JobId=2293862: Header lengths are longer than data received
[2019-12-30T16:04:48.586] _job_complete: JobId=2293860 WEXITSTATUS 1
[2019-12-30T16:04:48.587] _job_complete: JobId=2293860 done
[2019-12-30T16:04:48.604] _job_complete: JobId=2293861 WEXITSTATUS 1
[2019-12-30T16:04:48.627] _job_complete: JobId=2293861 done
[2019-12-30T16:04:48.628] _job_complete: JobId=2293862 WEXITSTATUS 1
[2019-12-30T16:04:48.629] _job_complete: JobId=2293862 done
[2019-12-30T16:04:48.647] Killing non-startable batch JobId=2293863: Header lengths are longer than data received
[2019-12-30T16:04:48.731] _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=0
[2019-12-30T16:04:48.736] _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=0
[2019-12-30T16:04:48.744] _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=0
[2019-12-30T16:04:48.745] _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=0
[2019-12-30T16:04:48.821] _job_complete: JobId=2293863 WEXITSTATUS 1
[2019-12-30T16:04:48.823] _job_complete: JobId=2293863 done
[2019-12-30T16:04:48.840] _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=0
[2019-12-30T16:04:48.841] _slurm_rpc_requeue: 2337599: Only batch jobs are accepted or processed
[2019-12-30T16:04:48.843] error: _slurm_rpc_complete_batch_script: Could not find batch step for JobId=2337599, this should never happen
[2019-12-30T16:04:48.843] error: slurmd error running JobId=2337599 on node(s)=gwdd052: Unspecified error
[2019-12-30T16:04:48.843] drain_nodes: node gwdd052 state set to DRAIN
[2019-12-30T16:04:48.865] _job_complete: JobId=2337599 done

Any idea what's causing this?

Best wishes and a happy new year,
Marcus
Comment 1 Christian Köhler 2019-12-30 08:36:38 MST
Danke für Ihre Nachricht! Ich bin bis zum 13. Januar 2020 nicht im Hause. In dieser Zeit kann sich die Beantwortung verzögern. Bitte wenden Sie sich mit HPC-Supportanfragen an hpc@gwdg.de bzw. support@hlrn.de, ansonsten an christian.boehme@gwdg.de oder philipp.wieder@gwdg.de

Thanks for your message! I'm on leave until January 13th, 2020. During this time answers can be delayed. In case of HPC support requests please contact hpc@gwdg.de or support@hlrn.de, respectively, otherwise christian.boehme@gwdg.de or philipp.wieder@gwdg.de

Viele Grüße / best regards
Christian Köhler
Comment 2 Marcus Boden 2019-12-30 08:37:26 MST
Created attachment 12642 [details]
slurm.conf
Comment 4 Michael Hinton 2019-12-30 14:24:42 MST
Hi Marcus,

(In reply to Marcus Boden from comment #0)
> I have the following problem: If I change the SelectType from cons_res to
> cons_tres and restart the slurmctld, jobs won't start anymore. Or rather,
> they start, then immediately fail and cause the node to drain.
Well, the first thing I would suggest is to shutdown slurmctld, restart the slurmds, and then start slurmctld again. Then, all the daemons will be running with the same slurm.conf with cons_tres set.

If the slurmctld is running with a different slurm.conf than the slurmds, all bets are off for coherent functionality. That’s why Slurm emits these errors:

[2019-12-30T16:15:02.685] error: Node dmp078 appears to have a different slurm.conf than the slurmctld.  This could cause issues with communication and functionality.  Please review both files and make sure the
y are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.

I’m not exactly sure what would happen if slurmctld is running with cons_tres while slurmds are running with cons_res. However, I suspect that if slurmctld deals with jobs in the cons_tres format, but the slurmds expect to run jobs with the cons_res format, that might cause an RPC mismatch error, which might explain “Header lengths are longer than data received”.

So make sure that all daemons are on cons_tres, try again, and if there are still errors, could you send the slurmd log of node gwdd052 during the same time period as the slurmctld log already attached?

Thanks,
Michael

P.S. Note that while cons_tres should be backwards compatible with cons_res, moving the system from cons_tres back down to cons_res could give you problems.
Comment 5 Marcus Boden 2020-01-02 05:56:43 MST
Hey Micheal,

thanks for the quick reply. Restarting the slurmd processes worked, we can now use the cons_tres plugin.

> If the slurmctld is running with a different slurm.conf than the slurmds, all bets are off for coherent functionality.

Yeah, we have our config on a shared fs, but restating all slurmds after every small change is cumbersome, that's why we mostly ignore those log messages. In this case, we shouldn't have... But anyway, it's working now, so thanks for your help!

Best,
Marcus
Comment 6 Michael Hinton 2020-01-02 08:28:09 MST
(In reply to Marcus Boden from comment #5)
> thanks for the quick reply. Restarting the slurmd processes worked, we can
> now use the cons_tres plugin.
Great!
 
> > If the slurmctld is running with a different slurm.conf than the slurmds, all bets are off for coherent functionality.
> 
> Yeah, we have our config on a shared fs, but restating all slurmds after
> every small change is cumbersome, that's why we mostly ignore those log
> messages. In this case, we shouldn't have... But anyway, it's working now,
> so thanks for your help!
Ok. In most cases, a simple `scontrol reconfigure` should be enough to reload all daemons with the most recent slurm.conf file, and that probably would have worked in this case. But when that doesn't work, restarting all the daemons is the safe way to do it.