Created attachment 11517 [details] git patch Please review our patch that adds support for AMD GPU
Comment on attachment 11517 [details] git patch This is now in a branch 'bug7714'. Please give any future patches against this branch.
This is Mike Li and I’m working on AMD plugin phase 2. I have some questions regarding the requirement for phase 2: RE: power consumption, RAPL (GPU, CPU): Is this power consumption per node or per GPU AcctGatherEnergy Cray/Aries, rapl are both per node. Both above include only CPU + Memory. We can add all GPUs to sum. That will be per node. RE: GPU utilization, GPU memory utilization, temperature Can you point me to the interfaces where they can be added. Thanks Mike
(In reply to Mike Li from comment #4) > This is Mike Li and I’m working on AMD plugin phase 2. > I have some questions regarding the requirement for phase 2: > > RE: power consumption, RAPL (GPU, CPU): > Is this power consumption per node or per GPU The smaller resolution the better, but I am not the one coming up with the requirements. If it were me I would want it by core ;). Obviously that is most likely not going to happen, so as low as you are able to go would be prefered. > AcctGatherEnergy Cray/Aries, rapl are both per node. > Both above include only CPU + Memory. We can add all GPUs to sum. That will > be per node. Yeap, we deal with what we deal with. As stated above the lower you can go the better. It would be nice to not have to rely on node basis, we can easily add the GPUs up if give per GPU. > > RE: GPU utilization, GPU memory utilization, temperature > Can you point me to the interfaces where they can be added. I think the best way to accomplish this is by stacking different acct_gather_energy plugins. At the moment we don't "stack" acct_gather_energy plugins, but it wouldn't be hard to add that. An example of a stackable plugin can be found in src/common/slurm_acct_gather_interconnect.c. Commit da1e837a0e6570e10 will most likely be of great help. It is fairly straightforward. After that you can add a new acct_gather_energy plugin and go from there. Good luck :).
thanks
Mike, I just pushed a simple patch to allow stacking of the energy plugins to get you started.
Thanks Danny Mike
Lets see if I understood correctlly. Base on the stack patch, I can do AcctGatherEnergyTpye=acct_gather_energy/gpu0, acct_gather_energy/gpu1, ,, , acct_gather_energy/gpu8 Then I have to add 8 plugins for gpu0, … gpu8. Mike
No. I would expect you to have 1 new plugin acct_gather_energy/amd_gpu (whatever you call it) Then I would expect that to handle all the different gpus. You can see how we currently do that in ipmi (which handles multiple sensors). So what I would expect is a configuration like AcctGatherEnergyTpye=rapl,amd_gpu This would use the rapl and the new amd_gpu to give you the power consumption of the node. What I would expect the plugin to do is figure out what gpus are being used by the job (hopefully using cgroups) and then only query the ones that matter. If I was looking for stats on the node outside of a job then I would query all on the node. Profiling should be done on a per gpu basis. Does that make sense?
Yes, that's more clear. Thanks Mike
RE: I would expect you to have 1 new plugin acct_gather_energy/amd_gpu (whatever you call it) acct_gather_energy/amd_gpu RE: Then I would expect that to handle all the different gpus. You can see how we currently do that in ipmi (which handles multiple sensors). The amd_gpu plugin will be designed similar to acct_gather_energy/ipmi with following details: 1)The RSMI interface will provide the number of GPUs on the node and current watts for each GPU. 2)The amd_gpu plugin will initiates a node-level thread which will read the current watts periodically for all GPUs through RSMI 3)One acct_gather_energy_t structure will created for each GPU and will updated and maintained by the thread to provide the power consumed for each GPU 4)The plugin will figure out which gpus are being used by the job using cgroups and then only query the ones that matter. 5)For stats on the node outside of a job, the power consumed from all gpus on the node will provided. 6)Profiling should be done on a per gpu basis. 7)There no configurations like IPMI required currently. RE: So what I would expect is a configuration like AcctGatherEnergyTpye=rapl,amd_gpu to stack rapl and amd_plagin This would use the rapl and the new amd_gpu to give you the power consumption of the node. Supported What I would expect the plugin to do is figure out what gpus are being used by the job (hopefully using cgroups) and then only query the ones that matter. Supported If I was looking for stats on the node outside of a job then I would query all on the node. Supported Profiling should be done on a per gpu basis. Supported
4)The plugin will figure out which gpus are being used by the job using cgroups and then only query the ones that matter. I have a issue to find out which gpus are being used by the job. I'm trying to use ROCR_VISIBLE_DEVICES which should be "0,1". But I got ROCR_VISIBLE_DEVICES=NULL static bool _check_device_visibility(uint16_t dv_ind) { char *local_list; local_list = getenv("ROCR_VISIBLE_DEVICES"); info("ROCR_VISIBLE_DEVICES=%s", local_list); if(local_list) { char *tmp, *tok; uint16_t i; tmp = xstrdup(local_list); info("ROCR_VISIBLE_DEVICES=%s", tmp); tok = strtok(tmp, ","); while (tok) { i = strtol(tok, NULL, 10); info("tok=%s, i=%d, dv_ind=%d", tok, i, dv_ind); if (dv_ind == i) return true; tok = strtok(NULL, ","); } xfree(tmp); } return false; } static void _get_node_energy_up(acct_gather_energy_t *energy) { uint16_t i; acct_gather_energy_t *e; // sum the energy of all gpus for this job memset(energy, 0, sizeof(acct_gather_energy_t)); for (i = 0; i < gpus_len; i++) if (_check_device_visibility(i) && (gpus[i].energy.current_watts != NO_VAL)) { e = &gpus[i].energy; energy->base_consumed_energy += e->base_consumed_energy; energy->ave_watts += e->ave_watts; energy->consumed_energy += e->consumed_energy; energy->current_watts += e->current_watts; energy->previous_consumed_energy += e->previous_consumed_energy; /* node poll_time is computed as the oldest poll_time of the gpus */ if (energy->poll_time == 0 || energy->poll_time > e->poll_time) energy->poll_time = e->poll_time; if (debug_flags & DEBUG_FLAG_ENERGY) info("_get_node_energy_up: gpu: %d, current_watts: %u, consumed %"PRIu64" Joules %"PRIu64" new, ave watts %u", i, energy->current_watts, energy->consumed_energy, energy->base_consumed_energy, energy->ave_watts); } } Is there another way to do it? Thanks Mike
I have talked to Harish. He think that AMD will provide instantaneous value for all gpus and Slurm will aggregate these information for the user and also correlate it with jobs. acct_gather_energy_g_get_sum() need to be changed.
(In reply to Mike Li from comment #17) > 4)The plugin will figure out which gpus are being used by the job using > cgroups and then only query the ones that matter. > > I have a issue to find out which gpus are being used by the job. > I'm trying to use ROCR_VISIBLE_DEVICES which should be "0,1". But I got > ROCR_VISIBLE_DEVICES=NULL > > static bool _check_device_visibility(uint16_t dv_ind) > { > char *local_list; > > local_list = getenv("ROCR_VISIBLE_DEVICES"); > info("ROCR_VISIBLE_DEVICES=%s", local_list); > if(local_list) > { > char *tmp, *tok; > uint16_t i; > > tmp = xstrdup(local_list); > info("ROCR_VISIBLE_DEVICES=%s", tmp); > tok = strtok(tmp, ","); > while (tok) { > i = strtol(tok, NULL, 10); > info("tok=%s, i=%d, dv_ind=%d", tok, i, dv_ind); > if (dv_ind == i) > return true; > tok = strtok(NULL, ","); > } > xfree(tmp); > } > return false; > } > > static void _get_node_energy_up(acct_gather_energy_t *energy) > { > uint16_t i; > acct_gather_energy_t *e; > > // sum the energy of all gpus for this job > memset(energy, 0, sizeof(acct_gather_energy_t)); > for (i = 0; i < gpus_len; i++) > if (_check_device_visibility(i) && > (gpus[i].energy.current_watts != NO_VAL)) { > e = &gpus[i].energy; > energy->base_consumed_energy += e->base_consumed_energy; > energy->ave_watts += e->ave_watts; > energy->consumed_energy += e->consumed_energy; > energy->current_watts += e->current_watts; > energy->previous_consumed_energy += e->previous_consumed_energy; > /* node poll_time is computed as the oldest poll_time of > the gpus */ > if (energy->poll_time == 0 || energy->poll_time > e->poll_time) > energy->poll_time = e->poll_time; > if (debug_flags & DEBUG_FLAG_ENERGY) > info("_get_node_energy_up: gpu: %d, current_watts: %u, consumed > %"PRIu64" Joules %"PRIu64" new, ave watts %u", > i, > energy->current_watts, > energy->consumed_energy, > energy->base_consumed_energy, > energy->ave_watts); > } > } > > Is there another way to do it? > > Thanks > > Mike Does it work to use the constrained devices in the cgroup. That might be the easier way to do it.
(In reply to Mike Li from comment #16) > 2)The amd_gpu plugin will initiates a node-level thread which will read the > current watts periodically for all GPUs through RSMI This is fine to do (copying ipmi I presume) but the only reason it was done this way for ipmi is because the ipmi call could take up to 5 seconds which was something we didn't want to do in the step. It is fine to do it this way, but just giving you history to avoid making things more complicated than they need to be. If the amd call isn't that heavy then you might consider running it locally in the stepd.
(In reply to Mike Li from comment #18) > I have talked to Harish. He think that > AMD will provide instantaneous value for all gpus and > Slurm will aggregate these information for the user and also correlate it > with jobs. > > acct_gather_energy_g_get_sum() need to be changed. It depends on what you want to do I suppose. At the moment we only profile or keep track of energy used on the node. If we want to put this into a different bucket then we will probably need to do something like you are talking about. It is unclear what your end game is though.
I understand slurmd is running on each node. Where is stepd running?
The slurmstepd (stepd) is what is running the step or actual application. It is started from the slurmd one per step on a node.
I have implemented amd_gpu energy plugin based on IPMI and got most stuff working. I have a issue to find out which gpus are allovcated for the job. There are 2 options which I can see and which have been listed based on my preference: 1) When stepd calls acct_gather_energy_p_get_data with type ENERGY_DATA_NODE_ENERGY_UP, pass in a pointer of an array of all gpus. AMD plugin will fill the array with energy for all gpus. stepd will aggregate correlate them with jobs. 2) Pass the allocated gpus for the job (i.e. usable_gpu like freq set) to amd_plugin. amd_plugin will correlate and add only allocated gpus for job. Thank Mike
I think you are experiencing your issues because you are running things from the slurmd instead of the stepd. The stepd should have this info. If the amd calls aren't heavy I would suggest you call things in the stepd instead. If they are heavy then you will have to change the request.
For getting energy value for AMD GPU, the library call isn't heavy and can be done within 100ms. I think I'm calling from stepd, may be not. How do I know? Where are the information? Here is my code: case ENERGY_DATA_NODE_ENERGY_UP: slurm_mutex_lock(&amd_gpu_mutex); if (_is_thread_launcher()) { if (_thread_init() == SLURM_SUCCESS) _thread_update_node_energy(); } else { _get_joules_task(10); } _get_node_energy_up(energy); slurm_mutex_unlock(&amd_gpu_mutex); break; The following are logs: [2019-11-05T10:56:47.534] debug2: Finish processing RPC: REQUEST_ACCT_GATHER_ENERGY [2019-11-05T10:56:47.534] [169.0] _get_joules_task: consumed 270 Joules (received 270(9 watts) from slurmd) [2019-11-05T10:56:47.534] [169.0] _get_joules_task: consumed 145 Joules (received 145(6 watts) from slurmd) [2019-11-05T10:56:47.534] [169.0] ROCR_VISIBLE_DEVICES=(null) [2019-11-05T10:56:47.534] [169.0] ROCR_VISIBLE_DEVICES=(null) [2019-11-05T10:56:47.534] [169.0] debug2: jag_common_poll_data: energy = 0 watts = 0 ave_watts = 0
You should just know [U+1F609]. Read your logs. It should be very obvious if you have any debug around it.
It seems that _get_joules_task is called by stepd. I have dumped env and ROCR_VISIBLE_DEVICES isn't set for stepd. I can't use this env. [2019-11-07T10:58:00.110] [177.0] _get_joules_task: consumed 0 Joules (received 0(9 watts) from slurmd) [2019-11-07T10:58:00.110] [177.0] _get_joules_task: consumed 0 Joules (received 0(3 watts) from slurmd) [2019-11-07T10:58:00.110] [177.0] This is stepd [2019-11-07T10:58:00.110] [177.0] environ=LANG=en_CA.UTF-8 [2019-11-07T10:58:00.110] [177.0] LANG=en_CA.UTF-8 [2019-11-07T10:58:00.110] [177.0] LANGUAGE=en_CA:en [2019-11-07T10:58:00.110] [177.0] PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin [2019-11-07T10:58:00.110] [177.0] INVOCATION_ID=f2570bdac01c41adb709484b05cf90d1 [2019-11-07T10:58:00.110] [177.0] JOURNAL_STREAM=9:181020 [2019-11-07T10:58:00.110] [177.0] SLURM_CONF=/etc/slurm/slurm.conf [2019-11-07T10:58:00.110] [177.0] SLURMSTEPD_OOM_ADJ=-1000 [2019-11-07T10:58:00.110] [177.0] SLURM_MPI_TYPE=none [2019-11-07T10:58:00.110] [177.0] ROCR_VISIBLE_DEVICES=(null) How to access gres_bit_alloc from stepd and in energy acct plugin? Can you pass it like gpu plugin when setting pgu freq?
Mike, the stepd will not have this environment set, you are correct. If you want to trust this envvar you should use **env of the stepd_step_rec_t. At the moment that is not exported to that plugin though. But note it will always start with 0 (if it behaves like nvidia does when added to cgroups). The gpus would also be in the step_gres_list of the same structure which might be easier/better to use, but you still have the same problem with that structure not being exposed to the plugin. Perhaps it makes sense to call rocm-smi in the init of the function there to grab the gpus it lists as it should report the ones you have access to. Would that work? If we need to adjust this to send in the stepd_step_rect_t that would work as well, but I would rather explore these other options first. Let me know how it goes.
The current rocm-smi is based on sysfs and only provides the total GPUs available on that node. The only way I can see is to pass step_gres_list to energy accounting plugin.
1) We still need to send in the stepd_step_rect_t. 2) I got: CurrentWatts=12 AveWatts=6 when I call scontrol show node. I was expected: CorrentWatts=# LowestJoules=# ConsumedJoules=# 3) Another question is how to show the energy consumption of a job?
(In reply to Mike Li from comment #30) > The current rocm-smi is based on sysfs and only provides the total GPUs > available on that node. The only way I can see is to pass step_gres_list to > energy accounting plugin. This is surprising. Will this change in the future? I would expect it to only show those you have access to. Passing it shouldn't be a hard thing. I would recommend sending in the entire stepd_step_rect_t instead of just the step_gres_list in case we need something else from it. I would put probably modify acct_gather_energy_[g|p]_set_data() to handle this new type and set a static pointer to the main struct. I am guessing you will be able to figure out that.
(In reply to Mike Li from comment #31) > 1) We still need to send in the stepd_step_rect_t. > 2) I got: > CurrentWatts=12 AveWatts=6 > when I call scontrol show node. I was expected: > CorrentWatts=# LowestJoules=# ConsumedJoules=# This was changed in 19.05, I am guessing you were looking at an older version of Slurm before? > 3) Another question is how to show the energy consumption of a job? I would expect you could use sstat while the job was running or you can run with hdf5 and get a profile while the job is running. After the job is done you can use sacct to look at this.
Just FYI, there will be a code freeze for 20.02 on January 15th. So any new code not written by us should be in no later than Jan 1st if there will be a hope of getting it in. NOTE: Getting the code to us by Jan 1st will not guarantee it will make it into 20.02. But getting it to us after that date will guarantee it will not make it.
Created attachment 12659 [details] Add brand type for AMD GPU
Created attachment 12660 [details] Add PCI Info for AMD GPU
Created attachment 12661 [details] Add energy accounting plugin for AMD GPU
I have added 3 patches. Sorry about the delay during the holidays, since we have to get AMD internal approval. Let e know if there is any questions.
We have an issue with sstat, it happens without the patches. $ sstat -j 252 -a JobID MaxVMSize MaxVMSizeNode MaxVMSizeTask AveVMSize MaxRSS MaxRSSNode MaxRSSTask AveRSS MaxPages MaxPagesNode MaxPagesTask AvePages MinCPU MinCPUNode MinCPUTask AveCPU NTasks AveCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov ConsumedEnergy MaxDiskRead MaxDiskReadNode MaxDiskReadTask AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask AveDiskWrite TRESUsageInAve TRESUsageInMax TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot TRESUsageOutAve TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutMin TRESUsageOutMinNode TRESUsageOutMinTask TRESUsageOutTot ------------ ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ------------- ------------- ------------- -------------- ------------ --------------- --------------- ------------ ------------ ---------------- ---------------- ------------ -------------- -------------- ------------------ ------------------ -------------- ------------------ ------------------ -------------- --------------- --------------- ------------------- ------------------- --------------- ------------------- ------------------- --------------- sstat: symbol lookup error: /usr/lib/slurm/auth_munge.so: undefined symbol: slurm_get_log_level
I would expect you aren't compiling it correctly, as this should work just fine.
I was using ./configure --prefix=/tmp/slurm-build --sysconfdir=/etc/slurm --enable-pam --with-pam_dir=/lib/x86_64-linux-gnu/security/ --without-shared-libslurm --with-rsmi It's any other options that are required?
Not sure why you are using --without-shared-libslurm, (we haven't ever recommended that) but that shouldn't matter. Hopefully you figure it out. I haven't ever heard of this issue before.
How is the status with 20.02 version. Have you guys review and included the changes in new release?
As pointed out, since the deadline was missed the odds of these making it in are low. I will do my best to review them, but I would not count these to be in 20.02.
Do you have a date for release 20.02? What were merged into 20.02? Is the phase 1 code with auto detection of AMD GPUs in 20.02? Thanks Mike
In the coming weeks we will release a few release candidates and then near the end of the month we will do an actual tag for 20.02. At this moment nothing has been merged.
Mike, would it be wrong to call this new plugin acct_gather_energy/rsmi to match instead of amd_gpu?
Also, I notice in _get_node_energy() we don't check for cgroups, but in similar code in _get_node_energy_up() we do. Is that expected to have divergence like that?
This next question is a follow up to comment 48. It looks like you are overloading acct_gather_energy_g_set_data(ENERGY_DATA_RECONFIG, job); to get the gres bits you are using if not using cgroups. I am wondering if there is a different way you can do this. There currently isn't anything that sends in the stepd_step_rec_t in question, so this is a good way to do it without messing with function signatures, just seems like we could do it in a cleaner way. For the time being I'll allow it but I would rather add a new enum of something so the code doesn't get messy from this kind of overloading.
would it be wrong to call this new plugin acct_gather_energy/rsmi to match instead of amd_gpu? Ok for rsmi, since we call another plugin rsmi.
Also, I notice in _get_node_energy() we don't check for cgroups, but in similar code in _get_node_energy_up() we do. Is that expected to have divergence like that? _get_node_energy() is for energy consumed for node for all GPUs. _get_node_energy_up() is for a job or job step. I understand that setup for the cgroup is per job submission.
It looks like you are overloading acct_gather_energy_g_set_data(ENERGY_DATA_RECONFIG, job); Yes, It is. I couldn't find a better way to do it. I would rather add a new enum of something. OK. Do you want me to do it or you can do it? Thanks Mike
Thanks for your prompt response Mike. All this code has now been merged into Master 40d323d899ea (20.02). If anything new needs adding please open a new bug on the matter. I ended up adding a new enum ENERGY_DATA_STEP_PTR in reference to comment 52. I also called the plugin rsmi in reference to comment 50.
Thanks Danny I saw 20.02.0 is released. What are the feartures still missing? Thnaks Mike
You tell me :). Perhaps open a new bug as this one is closed if there are some.
*** Ticket 7560 has been marked as a duplicate of this ticket. ***