Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Implement CPU and DRAM monitoring for zeusd #137

Merged
merged 27 commits into from
Nov 28, 2024
Merged

Conversation

wbjin
Copy link
Collaborator

@wbjin wbjin commented Oct 27, 2024

Additions

  • /cpu_id/get_index_energy route for getting CPU and DRAM Rapl energy measurements. Request body includes "cpu" and "dram" fields that indicate whether zeusd should poll cpu and/or dram. Response body contains "cpu_energy_uj" and "dram_energy_uj".

Changes

  • devices/gpu/linux.rs device_count method. Changed so that returns 0 if NVML is unavailable or an error occurs. TODO: change method so that it doesn't squash NVML errors that users will have to be aware of.
  • startup.rs start_cpu_device_tasks method. Create CpuManagementTasks used in main.rs. Also added CpuManagementTasks to app_data in start_server_uds and start_server_tcp methods.
  • main.rs. Added a CpuManagementTasks stop_monitoring method call to stop monitoring tasks.

@wbjin wbjin requested a review from jaywonchung October 27, 2024 18:37
Copy link
Member

@jaywonchung jaywonchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your work! This is a super valuable addition to the Zeus daemon. I've left some comments and questions. Let me know what you think.

zeusd/Cargo.toml Outdated Show resolved Hide resolved
zeusd/src/devices/cpu/linux.rs Outdated Show resolved Hide resolved
zeusd/src/devices/cpu/linux.rs Outdated Show resolved Hide resolved
zeusd/src/devices/cpu/linux.rs Show resolved Hide resolved
zeusd/src/devices/cpu/mod.rs Outdated Show resolved Hide resolved
zeusd/src/startup.rs Outdated Show resolved Hide resolved
zeusd/src/devices/cpu/mod.rs Outdated Show resolved Hide resolved
zeusd/src/devices/cpu/mod.rs Outdated Show resolved Hide resolved
zeusd/src/devices/cpu/linux.rs Outdated Show resolved Hide resolved
zeusd/src/devices/cpu/linux.rs Outdated Show resolved Hide resolved
@wbjin
Copy link
Collaborator Author

wbjin commented Nov 4, 2024

I'm getting this compile error with the set_persistent method all of a sudden.

error[E0599]: no method named `set_persistent` found for struct `nvml_wrapper::Device` in the current scope
  --> src/devices/gpu/linux.rs:45:24
   |
45 |         Ok(self.device.set_persistent(enabled)?)
   |                        ^^^^^^^^^^^^^^ method not found in `Device<'static>`

For more information about this error, try `rustc --explain E0599`.
error: could not compile `zeusd` (lib) due to 1 previous error

Was there a change with Nvml?

@jaywonchung
Copy link
Member

The error indicates a Rust NVML wrapper side issue. If actual NVML (libnvidia-ml.so) removed it, you'll get a runtime symbol not found error or alike from dynamic loading. See if stuff like nvml-wrapper or nvml-wrapper-sys were updated. I think it's very unlikely that that API will be removed from NVML.

@jaywonchung
Copy link
Member

Look at nvml-wrapper's source code of set_persistent and see if there are any #[cfg] directives that may have removed the method during compilation. That may hint an some unintended change in your compilation environment.

@wbjin wbjin requested a review from jaywonchung November 5, 2024 18:54
@jaywonchung jaywonchung changed the title Implement CPU and DRAM monitoring for the Zeusd [Feat] Implement CPU and DRAM monitoring for zeusd Nov 13, 2024
Copy link
Member

@jaywonchung jaywonchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing major! Thanks a lot for the great work! This will be the last round of reviewing.

zeusd/src/devices/cpu/linux.rs Show resolved Hide resolved
zeusd/tests/helpers/mod.rs Outdated Show resolved Hide resolved
zeusd/tests/cpu.rs Outdated Show resolved Hide resolved
zeusd/src/devices/cpu/mod.rs Show resolved Hide resolved
@jaywonchung jaywonchung linked an issue Nov 23, 2024 that may be closed by this pull request
Co-authored-by: Jae-Won Chung <[email protected]>
@wbjin wbjin requested a review from jaywonchung November 28, 2024 17:34
Copy link
Member

@jaywonchung jaywonchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jaywonchung
Copy link
Member

jaywonchung commented Nov 28, 2024

Oh, there seems to be a merge conflict that needs to be resolved.

Resolved merge conflict.

@jaywonchung
Copy link
Member

Failing test is unrelated to this PR. Merging, thanks again @wbjin!

@jaywonchung jaywonchung merged commit 5a419c4 into master Nov 28, 2024
2 of 3 checks passed
@jaywonchung jaywonchung deleted the jw-zeusd-rapl branch November 28, 2024 19:46
parthraut pushed a commit that referenced this pull request Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Intel RAPL support to zeusd
2 participants