-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat] Prometheus metric export #134
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work! This is an important piece in making Zeus more usable in a real world scenario. I looked over it at a mid- to high-level (not the nitty gritty details yet) and left some comments. Let me know what you think.
Co-authored-by: Jae-Won Chung <[email protected]>
- Changed metric instantiation to accept CPU and GPU indices directly instead of class objects. - Improved multiprocessing logic to address and fix pickle-related errors. - Added consistent handling for sync_execution across begin_window and end_window calls for all metrics. - Centralized bucket range validation and default handling for EnergyHistogram. - Improved error handling and logging for multiprocessing processes. - Standardized Prometheus metric labels (e.g., window and index) across Histogram, Counter, and Gauge. - Updated docstrings for consistency and clarity across all Metric subclasses.
Adjust target names to standardize pushgateway references, ensuring consistency with the Docker Compose configuration.
examples/prometheus/train_single.py
Outdated
from PIL import Image, ImageFile, UnidentifiedImageError | ||
#set_start_method("fork", force=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the random image-related stuff that were added and completely remove the power limit optimizer, not commenting it out. Overall, this file needs cleanup.
examples/prometheus/train_single.py
Outdated
# Histogram to track energy consumption over time | ||
energy_histogram = EnergyHistogram(cpu_indices=[0,1], gpu_indices=[0], prometheus_url='http://localhost:9091', job='training_energy_histogram') | ||
# Gauge to track power consumption over time | ||
power_gauge = PowerGauge(gpu_indices=[0], update_period=2, prometheus_url='http://localhost:9091', job='training_power_gauge') | ||
# Counter to track energy consumption over time | ||
energy_counter = EnergyCumulativeCounter(cpu_indices=[0,1], gpu_indices=[0], update_period=2, prometheus_url='http://localhost:9091', job='training_energy_counter') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines are very long. They're better off using multiple lines, like before the change.
examples/prometheus/train_single.py
Outdated
energy_histogram.begin_window("training_energy") | ||
energy_histogram.end_window("training_energy") | ||
train(train_loader, model, criterion, optimizer, epoch, args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't this begin_window
, train
, then end_window
?
examples/prometheus/train_single.py
Outdated
print(f"Top-1 accuracy: {acc1}") | ||
|
||
# Allow metrics to capture remaining data before shutting down monitoring. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These comments are useful. Please bring them back.
@@ -430,3 +418,4 @@ def accuracy(output, target, topk=(1,)): | |||
|
|||
if __name__ == "__main__": | |||
main() | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Newline.
tests/test_metric.py
Outdated
gpu_energy={0: 30.0, 1: 35.0, 2: 40.0}, | ||
cpu_energy={0: 20.0, 1: 25.0}, | ||
gpu_energy={0: 50.0, 1: 100.0, 2: 200.0}, | ||
cpu_energy={0: 40.0, 1: 50.0}, | ||
dram_energy={}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If mock CPU 0 supports DRAM energy measurement (in mock_get_cpus
), shouldn't this be something like dram_energy={0: 10.0}
?
The metrics would be expecting the monitor to provide DRAM energy measurements for CPU 0, but if the Measurement object has nothing, shouldn't it raise an error?
zeus/metric.py
Outdated
|
||
Args: | ||
name (str): Name of the measurement window. | ||
sync_execution (bool): Whether to execute synchronously. Defaults to None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wrong. See ZeusMonitor
.
zeus/metric.py
Outdated
@@ -54,6 +73,9 @@ class EnergyHistogram(Metric): | |||
gpu_bucket_range: Histogram buckets for GPU energy. | |||
cpu_bucket_range: Histogram buckets for CPU energy. | |||
dram_bucket_range: Histogram buckets for DRAM energy. | |||
gpu_histograms: A single Prometheus Histogram metric for all GPU energy consumption, indexed by window and GPU index. | |||
cpu_histograms: A single Prometheus Histogram metric for all CPU energy consumption, indexed by window and CPU index. | |||
dram_histograms: A single Prometheus Histogram metric for all DRAM energy consumption, indexed by window and DRAM index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the entire Attributes
section. They're not intended to be public attributes AFAIK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For every class.
Add link to the push gateway Co-authored-by: Jae-Won Chung <[email protected]>
@@ -63,7 +85,7 @@ def __init__( | |||
prometheus_url: str, | |||
job: str, | |||
gpu_bucket_range: Sequence[float] = [50.0, 100.0, 200.0, 500.0, 1000.0], | |||
cpu_bucket_range: Sequence[float] = [10.0, 20.0, 50.0, 100.0, 200.0], | |||
cpu_bucket_range: Sequence[float] = [10.0, 50.0, 100.0, 500.0, 1000.0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You updated the default range, but you never reflected the change in any docstring or doc. (1) Update the docstring, and (2) remove the defaults listed in measure.md
and instead point people to the API reference page for defaults.
Generalize the device as {component} with the note that Gauge only supports GPU Co-authored-by: Jae-Won Chung <[email protected]>
Co-authored-by: Jae-Won Chung <[email protected]>
zeus/metric.py
Outdated
self.energy_monitor.begin_window( | ||
f"__EnergyHistogram_{name}", sync_execution=True | ||
) | ||
self.energy_monitor.begin_window(f"__EnergyHistogram_{name}", sync_execution) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.energy_monitor.begin_window(f"__EnergyHistogram_{name}", sync_execution) | |
self.energy_monitor.begin_window(f"__EnergyHistogram_{name}", sync_execution=sync_execution) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto for end_window.
Co-authored-by: Jae-Won Chung <[email protected]>
Co-authored-by: Jae-Won Chung <[email protected]>
Co-authored-by: Jae-Won Chung <[email protected]>
Co-authored-by: Jae-Won Chung <[email protected]>
zeus/metric.py
Outdated
@@ -288,28 +319,36 @@ def begin_window(self, name: str) -> None: | |||
self.update_period, | |||
self.prometheus_url, | |||
self.job, | |||
sync_execution, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wrong. If sync_execution
is True
, you need to call zeus.utils.framework.sync_execution
on the main thread there the application is running. On the other hand, the power/energy monitor process's ZeusMonitor
should always be invoked with sync_execution=False
.
Read ZeusMonitor
to see how sync_execution
(sometimes a boolean parameter and other times a function in zeus.utils.framework
) is being used.
zeus/metric.py
Outdated
self.window_state[name] = MonitoringProcessState( | ||
queue=self.queue, proc=self.proc | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you putting these in self.queue
and self.proc
??
zeus/metric.py
Outdated
if self.queue is not None: | ||
self.queue.put("stop") | ||
else: | ||
raise RuntimeError("Queue is not initialized") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.queue
can the queue from any random window??? More specifically, it's going to be the queue that belongs to the most recently started window.
This level of quality is completely unacceptable. Please re-check the correctness of every line of code and documentation, and then ask for review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments on changes to make. I think they will be more or less straightforward ones. Let's hope this is the final round of change requests. Thanks!
…ng window operations
…ogic and multiprocessing window state management.
This pull request introduces Prometheus-based metric tracking for energy and power usage within the Zeus framework. It includes functionality for monitoring GPU, CPU, and DRAM energy usage via Histograms, Cumulative Counters, and Gauges.
zeus/metric.py
:A new module that introduces
EnergyHistogram
,EnergyCumulativeCounter
, andPowerGauge
classes. These classes enable real-time monitoring of CPU, GPU, and DRAM energy and power consumption by integrating with Prometheus.zeus/prometheus.yml
:Configuration file for setting up Prometheus monitoring.
zeus/docker-compose.yml
:A Docker Compose file for easily setting up Prometheus with the project for local or cloud-based monitoring.
Modified
pyproject.toml
:Added
prometheus-client
as an optional dependency for Prometheus metric integration.