Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

10-15% Speedup by enqueuing more at a time #231

Open
Meerkov opened this issue Sep 27, 2024 · 2 comments
Open

10-15% Speedup by enqueuing more at a time #231

Meerkov opened this issue Sep 27, 2024 · 2 comments

Comments

@Meerkov
Copy link

Meerkov commented Sep 27, 2024

for(uint d=0u; d<get_D(); d++) lbm_domain[d]->enqueue_stream_collide(); // run LBM stream_collide kernel after domain communication

Tested on 2D Taylor Green Vortex

By default, I get something around 2400-2500 Steps per Second. I'll use 2490 as my starting FPS.

I added the following simple modification.

	for (uint d = 0u; d < get_D(); d++)
	{
		for (uint step = 0; step < 4; step++) {
			lbm_domain[d]->increment_time_step();
			lbm_domain[d]->enqueue_stream_collide(); // run LBM stream_collide kernel after domain communication
		}
		
	}

This enqueues 4 steps at a time, before doing a blocking synchronization step.

On my PC, this now will show me as having 692 Steps/s, which multiplied by 4, is 2768 (since the machine is confused due to the domain running 4x steps when the output is only expected 1).

2768/2490 is just about 11% speedup.

You can enqueue more at a time, say 100 steps per iteration.

Now the output says it's 29 Steps/s implying it's running at a slightly faster 2900 FPS. (16% speedup). The downside however is now you're probably only rendering just under 30 FPS (at 100 *29 steps per second) instead of 60 FPS.

Probably an ideal solution would be to dynamically change the number of steps enqueued whenever the FPS is above 60.

Edit: Make sure you remove the lbm_domain[d]->increment_time_step(); that's called after synchronization to keep the timestep count correct.

@Meerkov
Copy link
Author

Meerkov commented Sep 28, 2024

Benchmark:
Before (enqueue 1 at a time) - 1000 x 1000 runs

|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                      32 x 32 x 32 = 32768 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                     D3Q19 SRT (FP32/FP32) |
| Memory Usage    |                                     CPU 0 MB, GPU 1x 2 MB |
| Max Alloc Size  |                                                      2 MB |
| Time Steps      |                                                      1000 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                            Re < 18.475208 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|     934 |    143 GB/s |     28514 |       999687  69% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 905                                                    |

After (enqueue 100 at a time) - 1000 x 10 runs

|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                      32 x 32 x 32 = 32768 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                     D3Q19 SRT (FP32/FP32) |
| Memory Usage    |                                     CPU 0 MB, GPU 1x 2 MB |
| Max Alloc Size  |                                                      2 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                            Re < 18.475208 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|      24 |      4 GB/s |       718 |       996723 7230% |                  0s ||
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 24                                                     |

Notice the current step is the same (1 million total steps) but the MLUPs / Bandwidth / Steps-per-sec are all 100x lower than they really are:


|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    2400 |    400 GB/s |     71800 |       996723 7230% |                  0s ||
|---------'-------------'-----------'-------------------'---------------------|

This implies enqueueing these commands creates (in this toy example which takes nearly no GPU time) a 250% speedup.

@Meerkov
Copy link
Author

Meerkov commented Sep 28, 2024

If I fix the stats, and then test on a more strenuous benchmark (e.g. the default 256x256x256) the benefit goes away. That makes sense because the benefit should effect smaller simulations that need to synchronize too often compared to the work:

|                                     \ /               FluidX3D Version 2.17 | // With custom enqueuing modification
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | NVIDIA GeForce RTX 3060                                    |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA GeForce RTX 3060                                    |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 552.22 (Windows)                                           |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 28 at 1867 MHz (3584 cores, 13.383 TFLOPs/s)               |
| Memory, Cache  | 12287 MB, 784 KB global / 48 KB local                      |
| Buffer Limits  | 3071 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| Info: Allocating memory. This may take a few seconds.                       |
|-----------------.-----------------------------------------------------------|
| Grid Resolution |                                256 x 256 x 256 = 16777216 |
| Grid Domains    |                                             1 x 1 x 1 = 1 |
| LBM Type        |                                    D3Q19 SRT (FP32/FP16S) |
| Memory Usage    |                                 CPU 272 MB, GPU 1x 880 MB |
| Max Alloc Size  |                                                    608 MB |
| Time Steps      |                                                        10 |
| Kin. Viscosity  |                                                1.00000000 |
| Relaxation Time |                                                3.50000000 |
| Reynolds Number |                                                  Re < 148 |
|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Current Step      | Time Remaining      |
|    4012 |    309 GB/s |       239 |      1000000 10000% |                 13s |
|---------'-------------'-----------'-------------------'---------------------|
| Info: Peak MLUPs/s = 4013                                                   |```

This MLUPs matches other benchmarks for the 3060, so the benefit here likely matters only for smaller sims that suffer from unnecessary CPU overhead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant