-
-
Notifications
You must be signed in to change notification settings - Fork 35.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Significant Performance Drop and High CPU Usage with BatchedMesh #28776
Comments
In the past few days, I’ve tried to find a solution, but without success. I’ve uploaded the relevant code and models to GitHub. The models consist of 12 million triangles and 16 million vertices. Is such a high CPU performance cost necessary for BatchedMesh? I don’t think it should be. When rendering Batched3DModel in Cesium, I didn’t encounter such issues. I believe the mesh batching in Cesium and BatchedMesh should be quite similar, right? Additionally, I’d like to mention that after the update to version 166, the performance consumption of BatchedMesh has worsened, and the frame rate has dropped further in the same scene. Here is the link to the code and models: batched-mesh-performance-test |
For the sake of easily understanding the issue please provide a live example that doesn't require pulling and running a separate Github project. You can host a demo page with Github pages, for example. Recordings of the Chrome performance monitor would be helpful, as well. |
Here is the demo link: https://batched-mesh-performance-test.vercel.app The model is compressed using Draco and is approximately 44MB in size, with a total of 7.6 million triangles and 9.6 million vertices. It takes about 10 seconds to load the model. Initially, the page does not use BatchedMesh, and the frame rate on my computer is 60 FPS. You can switch to BatchedMesh by clicking the button on the bottom left, after which the frame rate drops to about 17 FPS. |
I need to provide some additional details. When exporting the glTF model from Revit, I grouped meshes with the same materials. I added three extensions: |
Thanks for producing a live link. I think this demo is too complicated to dig into, though. There are over 800 individual meshes and a mix of batched and instanced meshes as well as a lot of custom GLTF user code that make it difficult to understand what's going on. It think it would best if we had an example that used a single batched mesh compared to a merged mesh to show any performance differences. Ideally without any external geometry file dependencies. |
Replicating this issue with a single |
Turning off the Additionally, I've noticed that enabling only the |
I understand but I'm asking for a minimal reproduction case to be provided. I think it's a more than reasonable ask for a simple demonstration case separate from user code to made when reporting an issue and asking maintainers to spend time investigating. I can take a closer look once a this minimal repro is available.
It depends on how many objects there are and where the bottleneck is. Frustum culling and sorting share a lot of the same logic, though, enabling one or the other will have a larger apparent impact then if one is already enabled and you enable the other. If you provide a simple reproduction case it will be easier to understand what you're describing. |
Ah, the case I mentioned above is actually based on the |
Sorry about this—I'm not very good at English, so I often rely on ChatGPT to help me write. If there are any impolite words or phrases, please forgive me... |
If the sort behavior is separate from the original performance question then I'd prefer to focus on one thing at a time. You can ask at the forum if you'd like help understanding the performance implications of sorting objects. Please provide a simple example in something like jsfiddle that shows the performance differences you're observing in #28776 (comment) without using any custom 3d model or complex feature processing logic. |
I've set up a page where you can switch between "BatchedMesh" and "MergedMesh". Here's the link: https://batched-mesh-performance-example.vercel.app/. Switching to "MergedMesh" might take about ten seconds or so. What I've noticed is that when using "BatchedMesh", the CPU usage significantly increases—from 15% to 40% on my computer. I did a quick debug with Spector.js and found that enabling the Also, another issue is when there are many materials in the scene (multiple BatchedMeshes or MergedMeshes), using the "MergedMesh" method allows the GPU to perform at its best, nearing 100% utilization. But with the "BatchedMesh" method, the GPU utilization seems to be about the same as when there's only a single material—around 30%. I'm not sure if the above situations can be optimized, or is this just the nature of the WebGL API? |
I've made a simpler example that just uses javascript and cubes to understand things a bit better. This demo allows for changes between a merged geometry, batched mesh, and instanced mesh by changing the "MODE" flag at the top. It also removes any extra texture sampling logic used in BatchedMesh to remove that as a possible performance bottleneck: I'm seeing that between the three options, BatchedMesh is the only one that suffers from this performance degradation. Instances and merged geometry both work fine otherwise. InstancedMesh and the merged geometry run at 120 fps while the BatchedMesh runs at ~30 fps on my 2021 M1 Pro Macbook. In terms of why this is happening - my only guess is that it's due to the buffers of draw "starts" and draw "counts" that must be uploaded to the GPU for drawing every frame, which will amount to ~1.6 MB of data for 200,000 items. It's hard to say for sure, though, because this isn't showing up on the profiler. It's possible that this GPU data upload is happening asynchronously and not reflected in the profiler unlike some of the texture upload function calls. In the original example all of the problematic BatchedMesh sub geometry draws seem to be unique so unfortunately without something like indirect draw support (supported in WebGPU) I think this is just pushing the limits of what we can do with BatchedMesh too far. |
Thank you very much for your response and for creating a new example. Does this mean that the operation causing the increase in CPU usage on my computer could be the data upload to the GPU? Another phenomenon is that on my desktop with a dedicated GPU, the GPU utilization can reach over 80% in examples not using BatchedMesh, but with BatchedMesh, it only peaks at 30%. Could this be due to the GPU waiting for data uploads? It's frustrating that whether it's the issue of rising CPU usage or the GPU not performing at full capacity, it seems to be a problem inherent to WebGL itself, and it appears to be unsolvable. However, you mentioned indirect draw support in WebGPU. If I switch to using WebGPURenderer, would it resolve these WebGL bottlenecks? If it's theoretically feasible, I might try switching the renderer in my current project to WebGPU. |
If what I've suggested is the cause - then yes it would explain the higher CPU usage and less GPU usage.
I'm not aware of the current capabilities of three.js' WebGPURenderer, so I can't say. But I expect it to eventually be supported if it's not now. |
Thank you for your insights. I'll look into the current state of three.js' WebGPURenderer and see if it supports the features needed to overcome these limitations. If it's not currently supported, I'll keep an eye on updates. Your explanation has been very helpful in clarifying the potential causes of the performance issues I'm facing. |
I switched to the WebGPURenderer in this example batched-mesh-performance-example, but unfortunately, I found that the frame rate with BatchedMesh is even lower now... |
It could be something else. For me, the frame rate with BatchedMesh in WebGL is 8 FPS, but 22 FPS in WebGPU. |
What graphics card and operating system are you using? Also, which browser are you using? My graphics card is an RTX 2080 Super, and I'm on Windows using Chrome. |
I definitely didn't make a mistake there; of course, I created only one |
Are you referring to the "batched-mesh-performance-test" project? That example was too complex and is no longer in use. You can check this one instead: batched-mesh-performance-example. However, even in the batched-mesh-performance-test example, if you carefully read the code related to the creation of BatchedMesh, you would see that I created only one BatchedMesh for each identical material, not multiple BatchedMeshes. |
My graphics card is an RTX 2050 4GB. I tested it batched-mesh-performance-example on Edge and Chrome and they both performed nearly 8 FPS in WebGL and 17-22 FPS in WebGPU! I'm not sure why I'm different from you. |
I deleted my previous post because I misunderstood something. I believe in the current version of BatchedMesh, multiDrawArraysInstancedWEBGL is not used. It is not used in the examples provided by @gkjohnson and @lanvada So what is being compared in examples above:
IIUC, the results are NOT actually surprising or that bad. The specific workflow of @lanvada which is revit CAD data, should probably not use multiDrawElementsWEBGL in this way. One single mesh is a great approach if it is static. But alternatively it should use multiDrawArraysInstancedWEBGL , since he has something like 800 unique geometries but many of instances of each. We don't really have a benchmark of that, but according to these presentiations of nvidia it should work well: https://on-demand.gputechconf.com/gtc/2013/presentations/S3032-Advanced-Scenegraph-Rendering-Pipeline.pdf |
Unless there's something odd in the way the model data is being stored this isn't the case - the original demo in the OP creates InstancedMeshes for anything with instances, and then everything in BatchedMesh is a unique geometry. That's how it appears from the current parsing logic, at least.
Agreed but the surprising thing is that this uploading timing doesn't seem to show up at all on the measured performance metrics. It makes it difficult to understand where exactly this is coming from. But as I've mentioned I assume it's from the start and counts buffer uploads. If there are practical use cases shown that |
I think as we get into these extreme performance cases where |
Since the support of Furthermore, using the same So it's a bit tricky but as long as we keep the |
Of course but the goal here is to enable this without end-users having to write custom shaders to take advantage of the functionality. It's been suggested it would happen multiple times but it would be nice if someone shared a public demonstration of how The big question for me is how you calculate the item index using the |
The way that I am thinking about doing it is looking up an I'm using this (experimental) translation of OffsetAllocator based on the work of Sebastian Aaltonen, which is how I manage buffers explicitly, it has extremely high occupancy. I'm not proposing to add this to threejs though https://gist.github.com/nkallen/f4ed889dc98e9a9da7283a01e3308450 |
But I should note also: it is possible to put everything in the same buffer and just issue thousands of calls to drawElementsInstanced. It's extremely fast because you do not need to switch the VAO. For example, in the below code, note that I am just incrementing the offset of the vertexAttribPointer. I have benchmarked this on apple and amd gpus and it can do 5k calls to drawElementsInstanced in < 1ms
|
I implemented this a while back, and if I recall correctly, I used an extra data texture to perform the lookup between the offset and the count, composed in the onBeforeRender hook. This approach seems similar to what nkallen described in their last two comments. Also in my implementation for simplicity and to handle larger data sizes for batching matrices (such as batch instanced skinning, where the number of matrices multiplied by the amount of bones is significant), I used sampler2DArray. This method does imply a limit of 2048 different geometries. But I remember struggling with the lookup and ultimately decided to use sampler2DArray to simplify the process. |
Thanks for the explanation! I thought there might be a method to for calculating this without an extra texture sample. It would be possible to pack these offsets into the beginning of the "indirect index" texture, though. We'd have to know how many geometries will be added up front to pack it perfectly tightly but if the capacity is reached then the texture could be expanded. It would work like so: // glsl
int size = textureSize( batchingColorTexture, 0 ).x;
ivec2 offsetPx = ivec2( gl_DrawID % size, gl_DrawID / size );
int offset = texelFetch( indirectIndexTexture, offsetPx, 0 ).r;
ivec2 indexPx = ivec2( ( offset + gl_InstanceID ) % size, ( offset + gl_InstanceID ) / size );
int index = texelFetch( indirectIndexTexture, indexPx, 0 ).r;
// use index to sample matrices, texture properties, etc This would have downsides of not allowing for sorting for overdraw compensation or transparency between instance groups but could improve performance in extreme cases where a tone of instances need to be used in a BatchedMesh. Again it's not clear that this is what's need for OPs use case, though. Anyway - I won't be working on this but it's something to keep in mind if this comes up or we want to make use of multi draw instanced more accessible. It may be possible to add something like a toggle BatchedMesh to switch between the two modes but I'm not sure how complicated that would be. |
I also found performance issues with the use of BatchedMeshes in one of my projects. In my example, only a single instance is used per mesh. When around 100 different materials are used, there are already significant performance differences between merged meshes and BatchedMesh. DrawCalls are exact the same (as expected). Here is the example: https://codesandbox.io/p/sandbox/three-js-forked-g69j8w |
Yes, indeed, I just ran your example, and on my computer, using BatchedMesh got 47fps, MergedMesh got 60fps, and I observed a significant increase in CPU and GPU usage after switching to BatchedMesh. |
For dedicated desktop gpus it might be necessary to increase the geometry count (const c = 400) to get the fps drop under 60fps (or whatever your monitor likes). |
I am currently using an RTX 2080s. I feel that in this not very complex scene, the frame rate dropping to 47 is already quite low; there is indeed a noticeable decrease in frame rate. |
If you don't have a dynamic scene you shouldn't used BatchedMesh. The purpose of BatchedMesh is to be able to show/hide individual objects, transform individual objects, etc., as well as do frustum culling and sorting objects by z in camera space. If you don't need any of that, then you are paying a significant cost for no reason. The total number of BatchedMeshes in any given scene should probably be much less than 100 |
nkallen - thank you for your response! I'm aware of the advatanges of BatchedMesh and this is the reason why i want to use them. Since there is only one draw call per batched mesh and the array for the draw ranges is small (less than 100 entrys) i'm wondering why there is so much difference between the merged mesh draw call and the multi elements draw call. But maybe it is not about the draw calls its something else under the hood of three js preventing the BatchedMesh to perform like the merged geometry. |
Each BatchedMesh does sorting and frustum culling, which is dominating the CPU in your example (see |
I created another fork and updated the example. Now the batched mesh is only using a single material - so it is overall only one single draw call for the whole geometry. It is still performing worse than the merged geometry rendering which is using much more draw calls now. I even disabled per object frustum culling and sorting. https://codesandbox.io/p/sandbox/three-js-forked-rfkcgt Even with your solution combining all materials into one single material i would probably run into the same problems. |
Well the practical application is using a tweaked MeshStandardMaterial with textures and so on. I'm still thinking about your solution. Would it be feasible to combine all 40 materials with different color, normal and roughness maps into one monster material (using texture atlas with more than 40 entries and baked uniforms) and this way rendering all meshes using a single batched mesh? I mean i know how to technically do it but is it worth the effort? Is it a good/recommended idea to go this way to have a single batched mesh? |
@QuisMagni realistically that sounds like a huge pain.... I think you should be fine with ~50 draw calls. You will need to benchmark to see where the issue is and then go from there. I currently work with about 20 BatchedMeshes, with a few of them being extremely large and the rest quite small. I'm easily achieving 60fps. But I have some custom sorting/culling logic and I don't use textures for the matrix transform. |
Please look at the comment above, currently, the performance degradation caused by using batched mesh when there are a large number of vertices and faces is unsolvable. |
Well... this is not really true. You have to understand what is going on and being compared. When we are working with high performance code we need to use specific techniques to the problem at hand. The example above compares multiDrawElementsWEBGL with 100k duplicated items to drawElementsInstanced a few unique items with 100k instances. The latter is obviously much faster, and the comparison is irrelevant because there is also an instancing variant of multidraw, namely multiDrawElementsInstancedWEBGL. The two versions of instancing will have comparable performance. The relevant comparison to an array of 100k multiDrawElementsWEBGL is 100k calls to drawElements -- or if we're comparing using vanilla threejs, to 100k calls to bindVertexArray and drawElements. The idea being rendering 100k UNIQUE geometries. In this latter case (bind+draw), multidraw is several thousand times faster, and in the former case it's faster by a factor of 10 or so |
I wanted to add this because I think it would be helpful for people. Assuming you have a dynamic scene (you need to transform, show/hide, or sort individual objects):
If your scene isn't dynamic and the geometry is small enough (say < 1gb), merging everything into one buffer can sometimes be best. The key thing is to understand the problem you are trying to solve, and understand what BatchedMesh is doing (#3 only!) |
Yes i saw the answer before. My example is quite different. In my first example there are only 60 geometries (later i increased it to 200) for every batched mesh - resulting in 6000 items for 100 batched meshes if i am correct. So in comparison with your example it is only 3% of the item count of the example before and there ist a huge performance hit of around 30% or even more. |
In my example, 6000 geometries are rendered. The performance with BatchedMesh is 30-50% worse compared to merged geometry. This impact is so significant that it might be worthwhile to replicate the necessary dynamic operations (geometry updates, visibility updates, and culling) based on merged geometry and with the help of web workers. I was simply amazed at how poorly BatchedMesh performed in direct comparison, even in such smaller scenarios. |
In your case, I would try to see where the bottleneck is. You should be able to disable frustum culling and sorting pretty easily. CPU usage should drop to basically zero. How much is the difference at that point? The remaining discrepency should be GPU overhead must come from somewhere, but it could be the textures, the multidraw arrays, or -- less likely -- inherent overhead in calling multiDrawElementsWEBGL. I would then compare it to merged geometry without rebinding the VAO: override onAfterRender and explicitly call gl.drawElements in a loop. Three'js will have already bound the vao and set the program (material/shader). You can then decide where to go from there. I am skeptical sorting and frustum culling in a worker will end up being the preferred approach... I can't know in advance but it seems like the theoretical maximum webgl performance would come from a loop like the following:
|
@nkallen |
Yes #4 is my best guess for you, although if you don't have a dynamic scene and the amount of data isn't enormous you can also materialize everything into one buffer and just render in drawElements call (or one drawElements call per material). But since it all depends on: how dynamic is the scene, how many triangles, how many materials, how many instances, how many unique objects, etc: we can't know what is best without benchmarking all of the options. I would test each option in |
Description
Hello,
I exported a building model from Revit in glTF format and merged meshes with the same materials to manage their visibility in Three.js using the BatchedMesh class. However, I've encountered a significant performance issue when rendering these merged meshes with BatchedMesh compared to using Mesh.
Performance Comparison:
This drastic difference in performance is concerning, especially the high CPU load and low frame rate when using BatchedMesh. I've already set
.perObjectFrustumCulled
and.sortObjects
tofalse
in BatchedMesh, which, if set totrue
, leads to an even more severe frame rate drop.Additionally, I'm using
three-csm
andpostprocessing
frameworks alongside Three.js.System Configuration:
Could someone help me understand why BatchedMesh increases the CPU overhead so significantly and suggest any possible optimizations or solutions to improve the frame rate?
Thank you!
Reproduction steps
Using BatchedMesh to render more than 10 million triangles and vertices, there are about 100,000 different geometries.
Code
Code in the project
batched-mesh-performance-test
Live example
Code in the project
batched-mesh-performance-test
Screenshots
No response
Version
r165
Device
No response
Browser
No response
OS
No response
The text was updated successfully, but these errors were encountered: