Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebGLRenderer: Use faster program cache key computation. #22530

Closed
Usnul opened this issue Sep 11, 2021 · 30 comments
Closed

WebGLRenderer: Use faster program cache key computation. #22530

Usnul opened this issue Sep 11, 2021 · 30 comments

Comments

@Usnul
Copy link
Contributor

Usnul commented Sep 11, 2021

I've recently did a profile of my project, haven't done that in 4-5 months, since then three.js version has been upgraded several times. Currently on 0.132.2

So, profile shows up that getProgramCacheKey takes up close to 20% of the entire time:
image

For a bit more context, this is in the scope of the entire game, with particle simulation, IK, sound and a bunch of other stuff going on. Not just pure rendering. Here's a screenshot of what's being rendered:
image

Here's the WebGLRenderer.info:
image

If you have a look at the cache key computation code, there's a lot wrong there. It's building an array of strings, that's a lot of allocation, and then there's array.join() which results in another string being created. All of that, just to throw the result away in the end (since no new materials are created during profiling period).

I propose offering an override for the cache keys, or building keys in a way that avoids much of the allocation effort. Also, doing equality check on material against what's already cached can be done without materializing this "cache key" at all. I understand that it's "simple" to do what's currently being done, and I understand that doing what I propose is less easy, but we're talking about 20% of the CPU time being wasted on an application that's hardly pure rendering.

As an alternative - perhaps offer an override for cache key computation, something like "customComputeMaterialCacheKey", that would be dirty, but it would allow users like me to work around the issue.

@Mugen87
Copy link
Collaborator

Mugen87 commented Sep 12, 2021

something like "customComputeMaterialCacheKey"

It's important to clarify that it is not a material but a program cache key. And identifying the correct program depends on many factors. I doubt it makes sense to implement the key computation on app level since you have to honor all the information from the rendering process like the engine currently does. Out of curiosity: Can you please share a live example that demonstrates how you would implement a callback like customComputeMaterialCacheKey()?

In general, the shader program key computation is a technical detail of the engine that should not be exposed to the app. I would rather try to figure out how to optimize the existing implementation.

or building keys in a way that avoids much of the allocation effort.

We often thought about this but could not come up with a different solution. Can you please make a concrete suggestion that explains how the cache key computation could be implemented differently?

@donmccurdy
Copy link
Collaborator

donmccurdy commented Sep 13, 2021

If I understand correctly, getProgramCacheKey is not meant to be invoked on every frame — but with numbers like those, either it's taking >50ms to initialize per material, or it's being invoked repeatedly. Perhaps something in the application is triggering unnecessary getProgram calls with newer three.js versions?

@Usnul
Copy link
Contributor Author

Usnul commented Sep 13, 2021

I doubt it makes sense to implement the key computation on app level since you have to honor all the information from the rendering process like the engine currently does.

I agree, I think having a custom key computation solution is a gapstop that lets people solve the problem themselves, but opens up an avenue for people to shoot themselves in the foot.

Out of curiosity: Can you please share a live example that demonstrates how you would implement a callback like customComputeMaterialCacheKey()?

Sure, there are plenty of shaders that are not affected by light/shadow, for example post-process shaders or various compute shaders (transform feedback). Even for standard materials, I can keep track of their equality, and when a "new" material is created - I can simply return an incremented integer. Think "hash map", similar to the current implementation in the sense that it keeps a key, but unlike three.js solution which has to be generic and doesn't know when a material changes - I have full knowledge of how my app works, so I can decide when to update the key.

Can you please make a concrete suggestion that explains how the cache key computation could be implemented differently?

something like this:

class MaterialFingerprint{
 ....
 match(material:Material):boolean // check if a given material/shader matches the fingerprint
 fromMaterial(material:Material):void // populate fingerprint from a material/shader
}

where MaterialFingerprint is your a slightly more semantically rich version of MaterialCacheKey concept. You could then keep a map of Map<Material, MaterailFingerprint> and validate the material against its last known fingerprint before usage, which would be way faster. After that you use fingerprint the same way as before.

@Mugen87
Copy link
Collaborator

Mugen87 commented Sep 13, 2021

The important question is how would you compute the program key differently. I'm afraid that is not yet clear from your post.

@Usnul
Copy link
Contributor Author

Usnul commented Sep 13, 2021

The important question is how would you compute the program key differently. I'm afraid that is not yet clear from your post.

For my own app? - I wouldn't. I'd keep something like an object ID and de-dupe shaders myself.

For generic three.js solution? - there are many options. One step would be to separate existing giant array into logical pieces, assign those to separate fields of the MaterialFingerprint class so you can compare those logical pieces to the respective values from a shader in match implementation.

here's roughly how the key is being computed now:

const key_array : string[] = [];

const part_a : string = compute_program_property_A(program);
key_array.push(part_a);

const part_b : string = compute_program_property_B(program);
key_array.push(part_b);
...

const key_string = key_array.join();

One simple optimization here is not to build the last string and not to create the array. How? like so:

class MaterialFingerprint{
	private property_a:string
	private property_b:string
	...
	
	match(program){
		if(this.property_a !== compute_program_property_A(program)){
			return false;
		}
		
		if(this.property_b !== compute_program_property_B(program)){
			return false;
		}
		
		return true;
	}
}

You get early rejection and don't have to materialize the key every time,

@Mugen87 Mugen87 changed the title Super slow material cache key computation WebGLRenderer: Use faster program cache key computation. Sep 14, 2021
@Usnul
Copy link
Contributor Author

Usnul commented Sep 17, 2021

Here's a small addition to what was already mentioned, here's the memory allocation trace for similar scene:
image

You'll see that 68.5% of the garbage generated throughout normal running of the app comes just from that single area.

The trace duration is just around 20s, there's 1,246,672 bytes of related garbage from those two functions or around 63kb of garbage per second. It's not super significant by itself, but since there are a lot of strings, which take up varied amount of space - it will cause a lot of heap fragmentation and will make GC pauses longer in general.

@donmccurdy
Copy link
Collaborator

donmccurdy commented Sep 17, 2021

We can make the function faster presumably, but if it's being called 5,000 times in 20s then something else seems wrong. Is it possible to tell if it was invoked the same number of times in the earlier three.js version?

@Usnul
Copy link
Contributor Author

Usnul commented Sep 19, 2021

I don't know if this is a regression issue. I don't keep backwards compatibility generally. Sorry.

@donmccurdy
Copy link
Collaborator

I can reproduce a similar issue on the WebGL | Animation | Keyframes example, tracking back to #21788 (r129) and probably similar removals like vertexTangents. These changes allowed greater reuse of a Material, with the library managing associated shader programs automatically.

In WebGL | Animation | Keyframes, the same material is reused by a Mesh and a SkinnedMesh; that wasn't supported before. But a side effect of the new support is that the material's needProgramChange flag is hit one or more times per frame, as we render the material with a different mesh...

} else if ( object.isSkinnedMesh && materialProperties.skinning === false ) {
needsProgramChange = true;

...invoking getProgramCacheKey each time — causing the additional CPU time shown above (and probably some GC cycles not shown above). I think the same thing is likely happening with @Usnul's application; there may be other similar cases like reusing a Material on meshes with and without vertex tangents.

So realistically there are two problems here:

  1. getProgramCacheKey is expensive
  2. three.js now invokes getProgramCacheKey more often

I do think I like the @Usnul's suggestion in #22530 (comment). It would certainly result in less GC, and might also be faster. But I am hoping we can find a way to avoid calling getProgramCacheKey at all, if the materials themselves have not changed since the last render. As it is, this does represent a performance decrease in r129.

@donmccurdy
Copy link
Collaborator

I'm little worried about the added complexity of multiple cache keys, but one option might be to keep a cheaper "fastProgramCacheKey" to map a single material to its program variants. Since there are not many possible combinations there, it could just be a cheaper bit mask instead of a string. For example:

const cacheKey = SKINNING & VERTEX_TANGENTS & MORPH_TARGETS;

@WestLangley
Copy link
Collaborator

@donmccurdy is correct. getProgramCacheKey() is called (multiple times) every frame in certain uses cases. There appear to be many use cases that suffer from this.

Starting in r130, transparent, double-sided materials appear to suffer from this, for example.

@mrdoob
Copy link
Owner

mrdoob commented Sep 22, 2021

Something I was considering after the introduction of the transmision, sheen, ... getters/setters is to do the same for all the maps.

Doesn't solve the particular problem being talked about here, but it helps a bit by moving out of the renderer the logic of deciding when a shader needs to be recomputed.

@Usnul
Copy link
Contributor Author

Usnul commented Sep 24, 2021

getters/setters updating version would help.
You could then make a reliable assumption that the unique state of a material is describable by its ID and the version. Keep the private values out of the documentation and if someone touches the _values directly - that's on them.

@dbuck
Copy link
Contributor

dbuck commented Oct 6, 2021

Just coming across this here also, specifically notable in larger 3d-tiles tilesets with custom shaders, each new tile/material allocating the string-concat key adds up QUITE quickly with thousands of tiles when unable to share material instances.

1: Is there a reason that something like the NodeMaterial.getHash wouldn't work as a lighter weight key than .joining the full fragment/vertex shaders?

2: Alternately, for custom ShaderMaterial/RawShaderMaterials or other items not existing in the ShaderLab index, if a material implements the customProgramCacheKey function, can that be used instead of the large concat?

if ( parameters.shaderID ) {

	array.push( parameters.shaderID );

} else if ( parameters.customProgramCacheKey ) {

	array.push( parameters.customProgramCacheKey );

else {

	array.push( parameters.fragmentShader );
	array.push( parameters.vertexShader );

}

edit: in a little more digging, I see that the getHash() is not a general thing, and only returns the material uuid there, I start the see the crux ;)

@dbuck
Copy link
Contributor

dbuck commented Oct 6, 2021

I did an experiment using a hashing function on the shader strings, and while I don't think this addressed the full set of problems mentioned in the initial issue, it certainly seems to address my own use case, which is streaming large 3d-tiles tilesets with custom RawShaderMaterials attached would allocate an unweildy amount of memory: #22650

@gonnavis
Copy link
Contributor

gonnavis commented Dec 28, 2021

Hello, looking forward for a fix.
I'm facing severe performance issue in projects which has many bone/skins, even after updated to r136.
#23106

@donmccurdy
Copy link
Collaborator

Continuing from #23106...

Sorry that update to r136 does not reduce the call many, still get 13 ~ 17 times.
Not the same as r129 which only 8 times.

Note that #22960 and #23022 should improve the performance of getProgramCacheKey, but won't reduce the number of times it is called back to the previous number. I'm not sure whether that is necessarily a goal. Do we have an example that still has a clear framerate regression in r136 vs r128?

@gonnavis
Copy link
Contributor

gonnavis commented Dec 28, 2021

Do we have an example that still has a clear framerate regression in r136 vs r128?

At first, I think it's very easy to test through my example, just change the mutants= queryString and check the fps.
But turned out that no very big difference between them, though the r136 has more calls of getProgram/CacheKey.
So that just duplicate the same mesh many times should not related to this problem.
https://raw.githack.com/gonnavis/annihilate/threejsR129GetProgram/index.html?mutants=50
image
https://raw.githack.com/gonnavis/annihilate/threejsR136GetProgram/index.html?mutants=50
image

But on another more complex project, I really even get worse performance, lower fps after upgrade from r133 to r136, but hard to publish example.

The main difference between the two projects is that:
The example project above only has getProgramCacheKey be the top time consuming function.
image
The other project has both getProgram & getProgramCacheKey be the top time consuming functions on r133,
and a new _getShaderStage on r136.
r136's performance/fps even worse than r133.
And if switch to r129, they magically disappeared, only left setProgram has a little time consuming, and fps increased dramatically compared to both r133 & r136.
r133:
image
r136:
image
r129:
image

@gero3
Copy link
Contributor

gero3 commented Dec 28, 2021

@gonnavis I'll have a look at your example to see what's going on between version r129 vs r136.

@gero3
Copy link
Contributor

gero3 commented Dec 29, 2021

@gonnavis I've looked into your example and noticed that you've added console.log call in the getProgramCacheKey function. console.log is a not very performant function. This function can take up to 100 microseconds for each call to 'console.log' when the devtools are open. I've created a testcase to show that this is a lot slower.

Testcase => link

This is my result without devtools open.
image

This is my result with devtools open.
image

I hope this shows that calling console.log has a big impact on the performance of javascript.

So I've updated your examples to remove the call to console.log & to me it seems r129 and r136 perform a lot better than r135.

r129 example =>comparision source code
r135 example =>comparision source code
r135 example =>comparision source code

PS: the reason I don't look at the profiles of javascript is that they are inherently unreliable for comparison between codebases. The javascript profiler causes functions to deoptimize which make their performance drop a lot compared to other functions. As far as I could tell getProgramCacheKey, getShaderStage and getProgram in r136 are of those functions that are unoptimized.

You can see this clearly when profiling the following example Webgl_performance_shader. This example for me drops from a stable 60FPS to an unstable 45-46fps. The drop in the middle of the image is when I created a javascript profile.
image

@gonnavis
Copy link
Contributor

gonnavis commented Dec 29, 2021

Oh, I somewhat forgot the influence of devtool. I'll continue to test both projects, as well as your examples, and keep an eye on the impact of devtool.
Thank you very much for your testing and explanation! It's wonderful!

@donmccurdy
Copy link
Collaborator

I was under the impression that the regression was introduced in r129 – if that's correct, the goal of this issue would be to get back to performance on this code path similar to r128, not r129? If #22960 and #23022 have fixed those issues then perhaps this can be closed.

@gero3
Copy link
Contributor

gero3 commented Dec 30, 2021

I was under the impression that the regression was introduced in r129 – if that's correct, the goal of this issue would be to get back to performance on this code path similar to r128, not r129? If #22960 and #23022 have fixed those issues then perhaps this can be closed.

Like you said before, there were 2 issues surrounding getProgramCacheKey.

  • getProgramCacheKey is expensive
  • three.js now invokes getProgramCacheKey more often

Both #22960 and #23022 only resolve the fact that getProgramCacheKey was very expensive in r135. This was also the target of the example webgl_performance_shader.

The amount of calls to getProgramCacheKey is still more often than r129 and earlier versions.

@gonnavis
Copy link
Contributor

gonnavis commented Dec 30, 2021

Added an r128 version, and adopted gero3's improvment to all version examples:

r128: getProgramCacheKey got called most 0, sometimes 2 ~ 15 times.
https://raw.githack.com/gonnavis/annihilate/threejsR128GetProgram/index.html?mutants=100

r129: getProgramCacheKey got called most 8 times.
https://raw.githack.com/gonnavis/annihilate/threejsR129GetProgram/index.html?mutants=100

r135: getProgramCacheKey got called most 17 times.
https://raw.githack.com/gonnavis/annihilate/threejsR135GetProgram/index.html?mutants=100

r136: getProgramCacheKey got called most 17 times.
https://raw.githack.com/gonnavis/annihilate/threejsR136GetProgram/index.html?mutants=100

Tough the call count of getProgramCacheKey very different, the fps all round 35fps on my pc when not open devtool.
Seems no very big perfromance difference between all versions of this example.

I was under the impression that the regression was introduced in r129

According to my test results, it should be.

@Mugen87
Copy link
Collaborator

Mugen87 commented Dec 30, 2021

I see no difference in FPS when using the above links.

TBH, the shared URLs are no good test case for profiling the program cache key overhead since way too many things happen in your app. Please use an updated version of the simplified webgl_performance_shader to demonstrate performance regressions.

In general, the number of times getProgramCacheKey() varies due to changes in the engine. The sole number of invocations is no indicator for better or worse performance.

@gonnavis
Copy link
Contributor

gonnavis commented Dec 30, 2021

Still surprised why the profiles of chrome devtool is sooo wrong, misleading Usnul and me to the wrong direction.

Using another approach to check the performance, found that getProgramCacheKey is really consume very little time compared to other functions.
https://raw.githack.com/gonnavis/annihilate/GetProgramCacheKeyTimeVsUpdateMatrixWorldTime/index.html?mutants=50

getProgramCacheKeyTime: 0.10 updateMatrixWorldTime: 21.00
getProgramCacheKeyTime: 0.00 updateMatrixWorldTime: 27.20
getProgramCacheKeyTime: 0.10 updateMatrixWorldTime: 23.80
getProgramCacheKeyTime: 0.20 updateMatrixWorldTime: 37.10
getProgramCacheKeyTime: 0.10 updateMatrixWorldTime: 23.20

And the profiles is very unstable, I can't get the getProgramCacheKey-top result today on this example.
But on another more complex project, it still is, the same time, the same chrome browser.

image

image

PS: the reason I don't look at the profiles of javascript is that they are inherently unreliable for comparison between codebases. The javascript profiler causes functions to deoptimize which make their performance drop a lot compared to other functions. As far as I could tell getProgramCacheKey, getShaderStage and getProgram in r136 are of those functions that are unoptimized.

@Mugen87
Copy link
Collaborator

Mugen87 commented Dec 30, 2021

@gero3 did a great job in r136 in improving the performance of getProgramCacheKey(). This improvement is also clearly measurable with webgl_performance_shader.

I'm not surprised that the world matrix update routine consumes a lot of time, especially in complex scene graphs. However, this issue is know and was discussed multiple times in the project. The current recommendation is to set matrixAutoUpdate to false and manually trigger matrix updates if necessary.

However, I'm not a fan of this policy because it is complicated and thus error-prone. The engine should solve this problem and not delegate this task to apps. Unfortunately, solving this issue requires changes in the engine which are not accepted by all project members. I doubt that we can make progress in this issue anytime soon.

@gonnavis
Copy link
Contributor

gonnavis commented Dec 30, 2021

Oh, I only focused on the chrome devtool profiles' wrong result of getProgramCacheKey, updateMatrixWorld is just mentioned for comparison.

getProgramCacheKey is really consume very little time compared to other functions.

, but be profiled the top time consuming function at some situations.

Let me to find out whether firefox or other browsers have similar functionalities, and what's the result.

@gonnavis
Copy link
Contributor

gonnavis commented Dec 30, 2021

Made another example https://raw.githack.com/gonnavis/annihilate/GetProgramCacheKeyTimeVsUpdateMatrixWorldTime2/index.html
Once again proved that the chrome devtool profiles have extremely serious errors.
This is the screenshot on my current pc status, not sure if the same on yours
bddeb4d56619cb0cd425879f35695f0

I found that console.log() will increase the inaccuracy, this example has so many console.log() so the inaccuracy very obvious.
If delete some console.log(), the inaccuracy will decrease.
But console.log() is not the only cause, because other projects that has not repeated console.log() also has serious inaccuracy.

Firefox has the same problem, but better.

@mrdoob
Copy link
Owner

mrdoob commented Dec 30, 2021

Seems like the issue has been addressed.
Better to discuss browser issues in a different place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants