Benchmarking Results: Energy Efficiency and Wall Time Analysis #3400

alanakihn · 2023-12-06T06:18:28Z

alanakihn
Dec 6, 2023

I wanted to share a few performance results from an undergraduate research project I conducted, supervised by @BrodiePearson and @amrapallig. I evaluated energy efficiency and wall time across different hardware available to us:

```
AMD 23xx CPU 
```
```
Nvidia V100 GPU 
```
```
Nvidia A100 GPU 
```

Nvidia H100 GPU (incoming – we just got access to our H100 servers)

We used the Langmuir turbulence example, exploring different sized simulations by adjusting the number of grid points while maintaining the same resolution across simulations. We also performed a similar analysis for a Julia package for spectral simulation of 2D/layered geophysical systems (GeophysicalFlows.jl - some lines included below; @navidcy) and the convecting plankton Oceananigans example (adjusted to be 3D – results not shown here).

Speed of simulations

Our data indicates a marked superiority of the Nvidia A100 GPUs over the V100s in terms of computational wall time. The A100s demonstrate significantly lower normalized wall times across all tested grid points, and allow larger simulations than the V100s, showcasing the A100’s advanced memory and computational capabilities. The wall times are normalized by the fastest simulation (across all hardware) for a given software package. For Oceananigans, the fastest simulation at the largest amount of simulated grid points was 247 seconds (about 4 minutes), for GeophysicalFlows it was 100 seconds.

Energy Efficiency

In terms of energy efficiency, the A100 GPUs consume less energy than V100 or CPUs for a given simulation. This benefit stems from the increased speed of the simulations (shorter simulations require hardware to run for less time), rather than from changes in the rate of hardware energy usage (power consumption) as the number of grid points is changed. The energy consumption per grid point for the A100 GPUs is consistently lower than that for the V100 GPUs and CPUs, highlighting the A100s ability to conduct more energy-efficient (faster!) simulations, despite having the highest power consumption of the three pieces of hardware.

Quantitative Insights

For large simulations, the Nvidia A100 GPUs are approximately 1000 times faster and consume roughly 100 times less energy per grid point compared to the AMD CPUs. In addition, the A100 GPUs are about twice as fast and consume nearly half as much energy as the Nvidia V100 GPUs.

Future Directions

We aim to extend this benchmarking effort to include the 2023 Nvidia H100 GPUs and multi-GPU LES benchmarks and will update as we get them. We currently have access to ~8 H100s and a small number of V100s and A100s. We’re also expecting some Grace-hoppers and a larger number of H100s in the medium term.

If anyone has suggestions for new directions/extensions of this work, please let us know! @glwagner has already suggested that we look at diagnostics and explore whether there are ways to speed up the code when a large number of diverse or similar diagnostics are being computed.

vchuravy · 2023-12-06T14:24:34Z

vchuravy
Dec 6, 2023
Collaborator

Nice analysis!

For large simulations, the Nvidia A100 GPUs are approximately 1000 times faster and consume roughly 100 times less energy per grid point compared to the AMD CPUs.

Without knowing more about your AMD CPU I would be very careful with such statements. You are comparing right now a TGV against a Tram. You call is an AMD 23xx CPU but the only one I can find that matches that description is the Ryzen 3 2300X.
So that's a 100 USD part vs a 40k USD part.

How did you measure the energy utilization?

1 reply

BrodiePearson Feb 12, 2024
Collaborator

Thanks for your comments @vchuravy, and sorry for not getting responding sooner. You're right that the CPUs used are vastly different to the GPUs. We're currently repeating benchmarks with some other CPUs (and with larger thread counts) to create a more complete comparison.

For the energy utilization we use nvidia-smi to query the power draw of GPUs. For the CPUs we indirectly estimate energy utilization using the manufacturer's maximum power draw for the specific CPU and the percentage CPU usage. We aren't aware of simple direct methods for the CPU energy estimates.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking Results: Energy Efficiency and Wall Time Analysis #3400

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Benchmarking Results: Energy Efficiency and Wall Time Analysis #3400

alanakihn Dec 6, 2023

Replies: 1 comment · 1 reply

vchuravy Dec 6, 2023 Collaborator

BrodiePearson Feb 12, 2024 Collaborator

alanakihn
Dec 6, 2023

Replies: 1 comment 1 reply

vchuravy
Dec 6, 2023
Collaborator

BrodiePearson Feb 12, 2024
Collaborator