Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can energy data collection and reporting be implemented by a cloud provider? #4

Open
Tracked by #5 ...
adrianco opened this issue Jul 28, 2023 · 6 comments
Assignees

Comments

@adrianco
Copy link
Contributor

Cloud providers may not be collecting energy use at a system level across their fleet of machines at present, so there could be a development and deployment cost to provide this information. Raw energy data can't be provided at the virtual machine instance because it's only collected at the full system level, and there are security implications - an Intel CVE https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/running-average-power-limit-energy-reporting.html - this issue provides a place to discuss workarounds and solutions for this problem.

@adrianco
Copy link
Contributor Author

Workaround - Cloud providers may be able to supply what are known as "bare metal" instances that are a complete machine, with no hypervisor and no partitioning. On those instance types it may be ok to allow access to interfaces such as Intel RAPL that would allow energy monitoring for the whole instance. Questions: Which cloud providers supply bare metal instances, and do they currently allow or block RAPL?

@adrianco
Copy link
Contributor Author

adrianco commented Aug 9, 2023

How is energy collected in datacenters? The PDUs instrument power usage by each outlet, there's a different API depending on which vendor is used. APC is a common vendor. I was talking to Rob Hirschfield of RackN who knows these APIs well and may be able to help us figure out how to collect the data.

@adrianco
Copy link
Contributor Author

Workaround - Cloud providers may be able to supply what are known as "bare metal" instances that are a complete machine, with no hypervisor and no partitioning. On those instance types it may be ok to allow access to interfaces such as Intel RAPL that would allow energy monitoring for the whole instance. Questions: Which cloud providers supply bare metal instances, and do they currently allow or block RAPL?

It appears that AWS EC2 bare metal instances do not block RAPL. One next step is to make a list of those bare metal instance types and see if Kepler's model can be calibrated based on real bare metal data.

@seanmcilroy29 seanmcilroy29 mentioned this issue Aug 16, 2023
12 tasks
@ArneTR
Copy link

ArneTR commented Aug 17, 2023

Hey @adrianco, just stumbled over this post as we were writing an overview post for ourselves lately.

Did you know that Teads has a list with RAPL data which also includes machines from AWS, Scaleway, Equinix etc. that supposedly allow RAPL access? This could provide very helpful: https://docs.google.com/spreadsheets/d/1DqYgQnEDLQVQm5acMAhLgHLD8xXCG9BIrk-_Nv6jF3k/edit#gid=985503428

Also, as said, we have written up a little piece, as we were looking into what MSRs are available for some cloud vendors as well as what Hypervisors they are running. Maybe also helpful: https://www.green-coding.berlin/blog/cloud-energy-usage-data/

I also linked the awesome project you are leading here :)

@seanmcilroy29 seanmcilroy29 mentioned this issue Aug 29, 2023
12 tasks
@seanmcilroy29 seanmcilroy29 mentioned this issue Sep 12, 2023
11 tasks
@seanmcilroy29 seanmcilroy29 mentioned this issue Oct 5, 2023
12 tasks
@seanmcilroy29 seanmcilroy29 mentioned this issue Oct 23, 2023
13 tasks
@seanmcilroy29 seanmcilroy29 mentioned this issue Nov 6, 2023
14 tasks
@seanmcilroy29 seanmcilroy29 mentioned this issue Dec 5, 2023
15 tasks
@seanmcilroy29 seanmcilroy29 mentioned this issue Dec 19, 2023
17 tasks
@seanmcilroy29 seanmcilroy29 mentioned this issue Jan 2, 2024
16 tasks
@adrianco
Copy link
Contributor Author

adrianco commented Jan 2, 2024

We discussed this a bit and decided that we need to investigate the RedFish API in more detail as it is more general than RAPL, it's a DMTF standard, and Kepler has figured out how to use it. Next step is to coordinate with Kepler team to see if we can share in what they have learned.

@adrianco
Copy link
Contributor Author

adrianco commented Jan 2, 2024

Cloud providers may not be currently logging energy data for all their machines, so the additional cost of providing it as an API would be high in that case. An alternative of on-demand logging of energy data would be less overhead but still could be a significant engineering project to implement. A lighter weight alternative would be for each cloud provider to publish a calibration curve that maps utilization to power consumption. This works fairly well for simple CPUs, has issues with Hyperthreading, and doesn't work for GPUs - which are of particular interest now that they are becoming common and use a lot more power than CPUs. Calibration curves are available for CPU types that map to datacenter usage or bare metal instances, but there are a lot of custom CPU chips in use at cloud provider, both special versions of Intel and AMD parts and fully custom ARM designs and GPU/TPU accelerators.

@seanmcilroy29 seanmcilroy29 mentioned this issue Jan 12, 2024
15 tasks
@seanmcilroy29 seanmcilroy29 mentioned this issue Jan 30, 2024
15 tasks
@seanmcilroy29 seanmcilroy29 mentioned this issue Feb 27, 2024
15 tasks
@seanmcilroy29 seanmcilroy29 mentioned this issue Mar 8, 2024
19 tasks
@seanmcilroy29 seanmcilroy29 mentioned this issue Mar 22, 2024
24 tasks
@seanmcilroy29 seanmcilroy29 mentioned this issue Apr 5, 2024
31 tasks
@seanmcilroy29 seanmcilroy29 mentioned this issue Apr 19, 2024
29 tasks
@seanmcilroy29 seanmcilroy29 mentioned this issue May 7, 2024
25 tasks
@seanmcilroy29 seanmcilroy29 mentioned this issue May 17, 2024
24 tasks
@seanmcilroy29 seanmcilroy29 mentioned this issue Jun 3, 2024
24 tasks
@seanmcilroy29 seanmcilroy29 mentioned this issue Jun 14, 2024
24 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants