How can energy data collection and reporting be implemented by a cloud provider? #4

adrianco · 2023-07-28T15:13:52Z

Cloud providers may not be collecting energy use at a system level across their fleet of machines at present, so there could be a development and deployment cost to provide this information. Raw energy data can't be provided at the virtual machine instance because it's only collected at the full system level, and there are security implications - an Intel CVE https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/running-average-power-limit-energy-reporting.html - this issue provides a place to discuss workarounds and solutions for this problem.

adrianco · 2023-07-28T20:59:36Z

Workaround - Cloud providers may be able to supply what are known as "bare metal" instances that are a complete machine, with no hypervisor and no partitioning. On those instance types it may be ok to allow access to interfaces such as Intel RAPL that would allow energy monitoring for the whole instance. Questions: Which cloud providers supply bare metal instances, and do they currently allow or block RAPL?

adrianco · 2023-08-09T16:57:33Z

How is energy collected in datacenters? The PDUs instrument power usage by each outlet, there's a different API depending on which vendor is used. APC is a common vendor. I was talking to Rob Hirschfield of RackN who knows these APIs well and may be able to help us figure out how to collect the data.

adrianco · 2023-08-10T15:47:18Z

Workaround - Cloud providers may be able to supply what are known as "bare metal" instances that are a complete machine, with no hypervisor and no partitioning. On those instance types it may be ok to allow access to interfaces such as Intel RAPL that would allow energy monitoring for the whole instance. Questions: Which cloud providers supply bare metal instances, and do they currently allow or block RAPL?

It appears that AWS EC2 bare metal instances do not block RAPL. One next step is to make a list of those bare metal instance types and see if Kepler's model can be calibrated based on real bare metal data.

ArneTR · 2023-08-17T10:24:54Z

Hey @adrianco, just stumbled over this post as we were writing an overview post for ourselves lately.

Did you know that Teads has a list with RAPL data which also includes machines from AWS, Scaleway, Equinix etc. that supposedly allow RAPL access? This could provide very helpful: https://docs.google.com/spreadsheets/d/1DqYgQnEDLQVQm5acMAhLgHLD8xXCG9BIrk-_Nv6jF3k/edit#gid=985503428

Also, as said, we have written up a little piece, as we were looking into what MSRs are available for some cloud vendors as well as what Hypervisors they are running. Maybe also helpful: https://www.green-coding.berlin/blog/cloud-energy-usage-data/

I also linked the awesome project you are leading here :)

adrianco · 2024-01-02T17:47:19Z

We discussed this a bit and decided that we need to investigate the RedFish API in more detail as it is more general than RAPL, it's a DMTF standard, and Kepler has figured out how to use it. Next step is to coordinate with Kepler team to see if we can share in what they have learned.

adrianco · 2024-01-02T17:53:26Z

Cloud providers may not be currently logging energy data for all their machines, so the additional cost of providing it as an API would be high in that case. An alternative of on-demand logging of energy data would be less overhead but still could be a significant engineering project to implement. A lighter weight alternative would be for each cloud provider to publish a calibration curve that maps utilization to power consumption. This works fairly well for simple CPUs, has issues with Hyperthreading, and doesn't work for GPUs - which are of particular interest now that they are becoming common and use a lot more power than CPUs. Calibration curves are available for CPU types that map to datacenter usage or bare metal instances, but there are a lot of custom CPU chips in use at cloud provider, both special versions of Intel and AMD parts and fully custom ARM designs and GPU/TPU accelerators.

seanmcilroy29 mentioned this issue Aug 16, 2023

2023.08.17 #5

Closed

12 tasks

seanmcilroy29 mentioned this issue Aug 29, 2023

2023.08.29 #9

Closed

12 tasks

seanmcilroy29 mentioned this issue Sep 12, 2023

2023.09:12 #10

Closed

11 tasks

seanmcilroy29 mentioned this issue Oct 5, 2023

2023.10.10 #11

Closed

12 tasks

seanmcilroy29 mentioned this issue Oct 23, 2023

2023.10.24 #13

Closed

13 tasks

seanmcilroy29 mentioned this issue Nov 6, 2023

2023.11.21 #15

Closed

14 tasks

seanmcilroy29 mentioned this issue Dec 5, 2023

2023.12.05 #18

Closed

15 tasks

seanmcilroy29 mentioned this issue Dec 19, 2023

2023.12.19 #21

Closed

17 tasks

seanmcilroy29 mentioned this issue Jan 2, 2024

2024.01.02 #22

Closed

16 tasks

seanmcilroy29 mentioned this issue Jan 12, 2024

2024.01.16 #23

Closed

15 tasks

seanmcilroy29 mentioned this issue Jan 30, 2024

2024.02.13 #24

Closed

15 tasks

seanmcilroy29 mentioned this issue Feb 27, 2024

2024.02.27 #25

Closed

15 tasks

seanmcilroy29 mentioned this issue Mar 8, 2024

2024.03.12 #26

Closed

19 tasks

seanmcilroy29 mentioned this issue Mar 22, 2024

2024.03.26 #30

Closed

24 tasks

seanmcilroy29 added the action item label Apr 5, 2024

seanmcilroy29 assigned adrianco Apr 5, 2024

seanmcilroy29 mentioned this issue Apr 5, 2024

2024.04.09 #36

Closed

31 tasks

seanmcilroy29 mentioned this issue Apr 19, 2024

2024.04.23 #38

Closed

29 tasks

seanmcilroy29 mentioned this issue May 7, 2024

2024.05.07 #41

Closed

25 tasks

seanmcilroy29 mentioned this issue May 17, 2024

2024.05.21 #42

Closed

24 tasks

seanmcilroy29 mentioned this issue Jun 3, 2024

2024.05.04 #44

Closed

24 tasks

seanmcilroy29 mentioned this issue Jun 14, 2024

2024.06.18 #45

Closed

24 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can energy data collection and reporting be implemented by a cloud provider? #4

How can energy data collection and reporting be implemented by a cloud provider? #4

adrianco commented Jul 28, 2023

adrianco commented Jul 28, 2023

adrianco commented Aug 9, 2023

adrianco commented Aug 10, 2023

ArneTR commented Aug 17, 2023

adrianco commented Jan 2, 2024

adrianco commented Jan 2, 2024

How can energy data collection and reporting be implemented by a cloud provider? #4

How can energy data collection and reporting be implemented by a cloud provider? #4

Comments

adrianco commented Jul 28, 2023

adrianco commented Jul 28, 2023

adrianco commented Aug 9, 2023

adrianco commented Aug 10, 2023

ArneTR commented Aug 17, 2023

adrianco commented Jan 2, 2024

adrianco commented Jan 2, 2024