| Summary: | slurmctld fails to start. AcceleratorArray depth incorrect | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Jason Coverston <jason.coverston> |
| Component: | Cray ALPS | Assignee: | Danny Auble <da> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | da, david.gloe |
| Version: | 2.6.x | ||
| Hardware: | Cray XT/XE | ||
| OS: | Linux | ||
| Site: | CRAY | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | patch to fix "depth" with basil 1.3 on nodes with accelerators | ||
John Metzner 2013-08-12 10:10:43 CDT
I tried changing parser_basil_5.1:
[BT_ACCELARRAY] = {
.tag = "AcceleratorArray",
.depth = 7,
.uniq = true,
.hnd = NULL
},
To:
[BT_ACCELARRAY] = {
.tag = "AcceleratorArray",
.depth = 5,
.uniq = true,
.hnd = NULL
},
Rebuilt slurm 2.6.0, reinstalled, started slurmctld on the sdb node and got:
[2013-08-12T10:07:03.267] sched: Backfill scheduler plugin loaded
[2013-08-12T10:07:03.268] debug3: Success.
[2013-08-12T10:07:03.298] fatal: Tag 'Accelerator' appeared at depth 6 instead of 8
I'm on the right track. I'll try changing the Accelerator entry as well.
John Metzner 2013-08-12 10:21:42 CDT
slurmctld will not exit with a fatal error on startup now after changing:
[BT_ACCEL] = {
.tag = "Accelerator",
.depth = 8,
.uniq = false,
.hnd = eh_accel
},
To:
[BT_ACCEL] = {
.tag = "Accelerator",
.depth = 6,
.uniq = false,
.hnd = eh_accel
},
in src/plugins/select/cray/libalps/parser_basil_5.1.c
Who knows if it will actually schedule something on a GPU, but at least the daemon doesn't exit.
I think someone with more knowledge of the apbasil 1.3 structure needs to evaluate the parser_basil_5.1.c code to see what else could be wrong.
Thanks for the analysis Jason, I will check this out right now and report back. Danny Jason, Can I get access to galaxy (or any other machine) with nodes with the accelerators? I think what you did is the correct thing, but I would like to verify. Created attachment 375 [details]
patch to fix "depth" with basil 1.3 on nodes with accelerators
Jason, if you can test this patch I believe it fixes the problem completely. Let me know. The only difference it I also "guessed" the AcceleratorAllocation also changed spots.
If you could allocate a few of the nodes with gpus that would be good as well.
> I did a couple apbasil querries, using both basil 1.2 and 1.3. It looks
> like there is a difference in which column <AcceleratorArray> appears
> between 1.2 and 1.3.
>
> apbasil 1.2:
>
> <AcceleratorArray>
> <Accelerator ordinal="0" type="GPU" state="UP" family="Tesla_K20X"
> memory_mb="6144" clock_mhz="732">
> <AcceleratorAllocation reservation_id="390522"/>
> </Accelerator>
> </AcceleratorArray>
>
> apbasil 1.3:
>
> <AcceleratorArray>
> <Accelerator ordinal="0" type="GPU" state="UP" family="Tesla_K20X"
> memory_mb="6144" clock_mhz="732">
> <AcceleratorAllocation reservation_id="390522"/>
> </Accelerator>
> </AcceleratorArray>
>
Just so you understand this a little better...
I believe the "depth" variable here represents how many levels into the xml we are. As you can see we are 1 less in the 1.3 than we were in 1.2. Good sleuthing.
Jason, have you had a chance to test this? We are waiting for your results so we can tag a new 2.6. Thanks I have applied the patch to 2.6 and it will be in 2.6.1. If in the future you find this patch didn't work please let me know and I'll fix it. *** Ticket 433 has been marked as a duplicate of this ticket. *** At first we interpreted this as a change in 1.3 but after reviewing the specs this appears to be an issue with the original 1.3 code as the 1.2 BASIL code has the correct info set. Not sure why it was changed, but putting this the way it was in the 1.2 code appears to fix it correctly. Thanks for testing it. This is fixed correctly in b9bc66dce38b389376d88f44848451adfc8dd38f This makes 1.3 code the same as 1.2 The confusion here might have been prompted by an indentation bug in apbasil's 1.3 Accelerator output. Note all the Accelerator output is shifted to the left of where it should be by one space.
</SocketArray>
<AcceleratorArray>
<Accelerator ordinal="0" type="GPU" state="UP" family="Tesla_K20X" memory_mb="6144" clock_mhz="732">
<AcceleratorAllocation reservation_id="3043930"/>
</Accelerator>
</AcceleratorArray>
</Node>
I'll look into fixing that from our end.
|
While running CLE 5.1.14 on galaxy, I am trying to bring up SLURM 2.6.0 with the newly installed Kepler (GPU) blades in c1-0c0s14 and c1-0c2s3. slurmctld gets a fatal error on startup: ... [2013-08-09T09:15:05.577] debug3: Trying to load plugin /opt/slurm/2.6.0/lib/slurm/sched_backfill.so [2013-08-09T09:15:05.577] sched: Backfill scheduler plugin loaded [2013-08-09T09:15:05.577] debug3: Success. [2013-08-09T09:15:05.623] fatal: Tag 'AcceleratorArray' appeared at depth 5 instead of 7 A quick google search showed this message probably came out of the parser_basil code. I found several parser_basil versions in src/plugins/select/cray/libalps: -rw-r--r-- 1 1001 1001 2272 Jul 10 09:55 parser_basil_1.0.c -rw-r--r-- 1 1001 1001 3420 Jul 10 09:55 parser_basil_1.1.c -rw-r--r-- 1 1001 1001 6554 Jul 10 09:55 parser_basil_3.1.c -rw-r--r-- 1 1001 1001 8865 Jul 10 09:55 parser_basil_4.0.c -rw-r--r-- 1 1001 1001 6141 Jul 10 09:55 parser_basil_5.1.c In the 5.1 vers, which I believe is the version we are using with CLE 5.1.14, I found: [BT_ACCELARRAY] = { .tag = "AcceleratorArray", .depth = 7, .uniq = true, .hnd = NULL }, I did a couple apbasil querries, using both basil 1.2 and 1.3. It looks like there is a difference in which column <AcceleratorArray> appears between 1.2 and 1.3. apbasil 1.2: <AcceleratorArray> <Accelerator ordinal="0" type="GPU" state="UP" family="Tesla_K20X" memory_mb="6144" clock_mhz="732"> <AcceleratorAllocation reservation_id="390522"/> </Accelerator> </AcceleratorArray> apbasil 1.3: <AcceleratorArray> <Accelerator ordinal="0" type="GPU" state="UP" family="Tesla_K20X" memory_mb="6144" clock_mhz="732"> <AcceleratorAllocation reservation_id="390522"/> </Accelerator> </AcceleratorArray> ------------------ I'm not sure how this relates to the "depth" mentioned in the fatal error, but I think a change is necessary to parser_basil_5.1.c I may try changing the "7" above to "5"; rebuild and see what happens.