Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Rend3 4d10795 about 3x slower3 than old Rend3 f2b7df4 on low end GPU #477

Closed
John-Nagle opened this issue Mar 16, 2023 · 7 comments
Closed

Comments

@John-Nagle
Copy link
Contributor

John-Nagle commented Mar 16, 2023

I just revised my "render-bench" program to work with Rend3 4d10795. This is the city of identical buildings, where, after 10 seconds, half of the buildings are deleted, then re-added 10 seconds later. That's the test for the WGPU locking issue. I updated it to help with testing that.

Only the changes necessary to make it run again were made. The frame rate dropped from around 30 FPS to around 9 FPS on a machine with an old NVidia 640 GPU. On the big machine with an NVidia 3070, both old and new versions get around 60 FPS (except when loading new content, which is the WGPU locking issue.)

  • Old version: branch "main".
  • New version: branch "rend3-mar2023".

Both are ready to clone, build with "cargo build --release", and run.

On the NVidia 640 machine,

  • Old version: 22 FPS, GPU utilization is 100%. GPU memory (2GB) is 61% full, CPU load is 100% of one CPU.
  • New version: 8 FPS, GPU utilization is 100%, GPU memory is 81% full, and CPU utilization is 50% of one CPU .

On the NVidia 3070 machine,

  • Old version: 60 FPS, GPU utilization is about 37% on the old version, GPU memory (8GB) is 17% full, and CPU utilization is maybe 30% of one CPU.
  • New version: 60 FPS, GPU utilization is about 45%, GPU memory is 37% full.

This is unexpected. All the metrics became worse. Am I doing something wrong?

@John-Nagle
Copy link
Contributor Author

I'm totally mystified by the GPU memory consumption increase. Exactly the same meshes. Exactly the same textures. In the new version, vertices without rigging info are supposed to be smaller. Memory consumption should have decreased. That has to be a bug.

@cwfitzgerald
Copy link
Member

Memory usage difference is very surprising, there isn't significant changes in how memeory is handled.

The performance delta of about 130% on the new machine is about what is to be expected on the current trunk. This is because the new culling shaders are running but not culling anuything. The gpu-culling branch I'm working on is what enables culling and makes the culling improve performance and not regress it.

The difference on 3x the 640 is quite surprising. Would need to figure out why that is. I suspect it's due Kepler having issues with compute shaders and or vertex pulling.

@John-Nagle
Copy link
Contributor Author

Memory usage difference is very surprising, there isn't significant changes in how memory is handled.

Something big changed. I'm running the same test on two version of Rend3. Exactly the same number of triangles and textures.
Same machines. Repeated trials. Something increased GPU memory consumption by about half a gigabyte. That's huge, especially on smaller GPUs. It wasn't in my code. You have both branches and can diff.

For a while, all vertices were carrying full rigging information, even for non-rigged mesh. Which of the versions of Rend3 listed has that? I thought that was fixed and expected GPU memory usage to drop.

The performance delta of about 130% on the new machine is about what is to be expected on the current trunk.

The fastest version of Rend3 was 0.2.2. A year ago, my viewer using Rend3 was outperforming all the other Second Life viewers. Two rounds of Rend3 slowdowns later, and several rounds of speedups in the C++ and Unity-based viewers, and they now get higher frame rates than I do.
I can't ship a demo based on Rend3 now. It would be dismissed as a failure.

@John-Nagle
Copy link
Contributor Author

I watched your video on occlusion culling. That has potential. If it works. The video seemed to indicate that occlusion groups all had to use the same "material". Does that mean "Material" in the Rend3 sense, or just using the same shader? Each of my objects has its own Rend3 "Material", because every object has its own base color and UV transform. (It's user-created content - no commonality. There's no uniformity in UV transforms.) Occlusion culling may not be a way out of the performance loss. Some special cases, mostly indoor scenes, may improve, but for big outdoor scenes, it may be a net lose. Not sure. It would be useful to be able to turn all that off and go back to fast but dumb mode.

@cwfitzgerald
Copy link
Member

I thought that was fixed and expected GPU memory usage to drop.

It is still expected to be lower. trunk does have those improvements to mesh memory usage. There could be a couple causes of the increase:

  • rend3 no longer automatically defragments the data buffer. This proved to be extremely expensive and caused major hitching on the gpu side.
  • The culling solution effectively doubles the index buffers. Index buffers don't take up that much space though, so this would be surprising

The video seemed to indicate that occlusion groups all had to use the same "material". Does that mean "Material" in the Rend3 sense, or just using the same shader?

in GpuPowered mode, as long as they have the same shader, they can be batched togehter. In CpuPowered mode, as long as they use the same textures (regardless of the rest of the material) they can be batched together.

Not sure. It would be useful to be able to turn all that off and go back to fast but dumb mode.

This is a potential option, and I know how I would do it. My first priority is to get the bugs out of the gpu-culling code and merge that in to see how the performance delta looks. There are options to make the culling yet faster, which I haven't explored yet

@John-Nagle
Copy link
Contributor Author

It's quite possible that the metaverse usage pattern (most instances are unique, GPU objects are constantly being created and removed) is causing problems. There's much more churn than in small-world games with pre-built content.

Rend3 no longer automatically defragments the data buffer. This proved to be extremely expensive and caused major hitching on the gpu side.

That might come up with "render-bench", but in that test, it's the same textures and meshes being created and destroyed. So fragmentation shouldn't occur. Memory usage as measured by the NVidia utility goes up to a peak value and stops.

The culling solution effectively doubles the index buffers. Index buffers don't take up that much space though, so this would be surprising.

Don't have enough info to evaluate that.

So the memory usage increase is still a puzzle.

Ref: gfx-rs/wgpu#2447
Ref: #348

which is about being able to find out the memory situation from the application level. There are things I can do to cut memory consumption (reduce texture sizes, for example) and need to know when that's needed.

The video seemed to indicate that occlusion groups all had to use the same "material". Does that mean "Material" in the Rend3 sense, or just using the same shader?
... in GpuPowered mode, as long as they have the same shader, they can be batched together. ...

Oh, good. Almost everything in my code uses the same default shader. Any future shaders will be for water, terrain, environment, sky, etc. which are special cases with very few instances. If it was per Rend3 "Material", all occlusion groups would have size 1.

The difference on 3x the 640 is quite surprising. Would need to figure out why that is. I suspect it's due Kepler having issues with compute shaders and or vertex pulling.

That makes sense. An NVidia 640 doesn't have compute shader hardware. Does running compute shaders without compute shader hardware fall back to CPU-side emulation in the driver, or do you have to emulate on your side?

@cwfitzgerald
Copy link
Member

Closing this after #593

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants