-
Notifications
You must be signed in to change notification settings - Fork 8
Optimizations
This page is intended to be a general running discussion of places with known hot-spots which could be optimized, as part of a deep dive into optimizing memory usage, load times, and general runtime performance.
All asset loading is handled by Elation Engine's asset manager. This system is responsible for fetching files from any number of storage backends (commonly HTTP, but also data URIs, File objects, or distributed filesystems like DAT and IPFS). Assets are fetched asynchronously by the assetdownloader, then handed off to loaders by type (image, model, video, audio, etc). Each one has their own performance characteristics which need to be considered separately.
Models are loaded using a pool of workers. The number of workers is determined by looking at navigator.hardwareConcurrency
which tells us the number of cores available to us. We don't have any control over CPU affinity like we might in native environments, so we just create n-1 asset workers to leave one core available for the main thread, and hope that the browser/OS handles scheduling intelligently.
Inside the model loader worker script, we first determine whether the content is gzipped. If it is, we inflate it and pass it on to the next step, otherwise the raw data is passed on to a function which determines the content type. Based on this content type, we then pass it on to the model-specific loader - OBJ, glTF, FBX, DAE, etc. Each of these loaders maps to the underlying THREE.Loader, so again, each of those implementations has its own performance characteristics and memory usage patterns - some are more optimized than others.
Once the loader has done its job, we usually end up with a THREE.Object3D containing the whole hierarchy of objects represented by the model we've just loaded. We need to transfer this back to the main thread - here's one place where we get into trouble though. The easiest way to get this whole hierarchy from our worker thread back to our main thread is by serializing it to JSON, but it turns out that's very inefficient.
Here we see the CPU and memory usage of a worker that's loading 8.3mb of geometry data from a .gltf/.bin pair. Let's break it into stages.
- The first block of time we see is spent evaluating the asset worker's script. During profiling, this takes about 600ms - real-world usage is a bit faster, but this could be improved by paring down on what gets pulled into our assetworker's JS. We're currently including a full copy of Three.js and the engine code, when we could pare this down a bit to only code that's necessary for loading models.
- Once the worker's scripting environment is up and ready to start processing, our onmessage handler fires, as the main thread has passed some work off to us to process. We very quickly identify the file type by looking at the first few bytes, and pass it off to THREE.GLTFLoader. It's interesting to note that for separate .gltf/.bin files, the way GLTFLoader works actually breaks some of our asset loading optimizations, by forcing the large data to download in a way whicch blocks the worker thread. Ideally we would have already fetched the .bin file before passing it to the worker to be processed, so this is one possible optimization already. Other formats generally don't suffer from this problem.
- Once the .bin file loads, THREE.GLTFLoader parses it into a THREE.Scene hierarchy for us in about 90ms. Not bad, but if this were on the main thread, that's up to 10 dropped frames in VR, which would be quite uncomfortable. The worker completely removes this problem - score!
- Now that we have a THREE.Scene object containing all the objects, geometries, and materials that make up our newly loaded model, we need to send that back to the main thread. This is where things get ugly. Right now, we're doing that by calling scene.toJSON(), which turns the whole hierarchy into a JSON object representing the sum total of all textures, images, geometry data, lights, cameras, and object relationships. In our example, this step takes a whopping 2.56 seconds.
- From here, we take the JSON object and do a little bit of postprocessing on it. We replace simple JavaScript arrays containing geometry data with TypedArrays where it makes sense - Float32Array, Int32Array, etc. In theory this makes it more efficient for us to transfer the data back to the main thread, but since we're doing this as an extra step on the end, it ends up costing us another 310ms.
So ignoring the time we spent downloading the .bin file (or at least waiting for the browser's download cache to give it to us), the total time spent in the worker to load this model is about 3 seconds of actual CPU processing time. Of this, 2.8 seconds is wasted in inefficient serialization. On top of the excess CPU usage and wall-clock time, we balloon the memory up from about 6.7MB up to about 26MB during the serialization process, triggering several GC passes in the process. We can clearly do better!
It turns out the fix for this is pretty simple, and highly effective. Looking into the toJSON() function, it turns out that serializing the Object3Ds themselves is pretty quick, and almost all of the time is spent in THREE.BufferGeometry
's .toJSON()
which encodes the actual geometry data. The way this function works is by using Array.slice() to clone the TypedArrays used internally by THREE.BufferGeometry
/ WebGL in general. This gives us a regular old non-typed JavaScript array, containing a copy of all the values that make up the geometry. While this is necessary if we were truly outputting JSON, in our case it would actually be preferable to use the TypedArrays directly.
As an experiment, we tried overriding this function inside of our assetloader worker. JavaScript makes this easy - we clone the function from the original implementation (https://github.com/mrdoob/three.js/blob/dev/src/core/BufferGeometry.js#L952-L1074), and tweak it slightly:
THREE.BufferGeometry.prototype.toJSON = function() {
...
data.data = { attributes: {} };
var index = this.index;
if ( index !== null ) {
data.data.index = {
type: index.array.constructor.name,
array: index.array
};
}
var attributes = this.attributes;
for ( var key in attributes ) {
var attribute = attributes[ key ];
data.data.attributes[ key ] = {
itemSize: attribute.itemSize,
type: attribute.array.constructor.name,
array: attributes.array,
normalized: attribute.normalized
};
}
...
return data;
};
Now that we've made this change, we can also remove the remapping step - now instead of creating new TypedArrays, we just make a list of transferrable objects, so we can pass that a the second argument to postMessage()
:
this.parse(modeldata, job).then(function(data) {
var transferrables = [];
// Buld a list of ArrayBuffers that can be transferred to the main thread, to avoid memory copies
try {
if (data.geometries) {
for (var i = 0; i < data.geometries.length; i++) {
var geo = data.geometries[i];
for (var k in geo.data.attributes) {
transferrables.push(geo.data.attributes[k].array.buffer);
}
}
}
postMessage({message: 'finished', id: job.id, data: data}, transferrables);
} catch (e) {
postMessage({message: 'error', id: job.id, data: e.toString()});
}
}, function(d) {
postMessage({message: 'error', id: job.id, data: d.toString()});
});
After making this change, we profile again. The same asset now loads in 37ms, and only uses 6.7mb of memory to do so, with only a single GC pass. Hooray! After making this change, all rooms we tried loaded noticeably faster - this change affects all model types, not just glTF. This change is a big win across the board.
FBX loading seems to be a bit of a degenerate case, the FBX loader regularly takes multiple seconds to parse models and uses 100mb of ram, triggering hundreds of GC events when loading an 18.8mb model. Most of the GC seems to be triggered in the genFace() function, which is using what looks like a fairly unoptimized method of building vertex data lists:
// Generate data for a single face in a geometry. If the face is a quad then split it into 2 tris
genFace: function ( buffers, geoInfo, facePositionIndexes, materialIndex, faceNormals, faceColors, faceUVs, faceWeights, faceWeightIndices, faceLength ) {
for ( var i = 2; i < faceLength; i ++ ) {
buffers.vertex.push( geoInfo.vertexPositions[ facePositionIndexes[ 0 ] ] );
buffers.vertex.push( geoInfo.vertexPositions[ facePositionIndexes[ 1 ] ] );
buffers.vertex.push( geoInfo.vertexPositions[ facePositionIndexes[ 2 ] ] );
buffers.vertex.push( geoInfo.vertexPositions[ facePositionIndexes[ ( i - 1 ) * 3 ] ] );
buffers.vertex.push( geoInfo.vertexPositions[ facePositionIndexes[ ( i - 1 ) * 3 + 1 ] ] );
buffers.vertex.push( geoInfo.vertexPositions[ facePositionIndexes[ ( i - 1 ) * 3 + 2 ] ] );
buffers.vertex.push( geoInfo.vertexPositions[ facePositionIndexes[ i * 3 ] ] );
buffers.vertex.push( geoInfo.vertexPositions[ facePositionIndexes[ i * 3 + 1 ] ] );
buffers.vertex.push( geoInfo.vertexPositions[ facePositionIndexes[ i * 3 + 2 ] ] );
...
This continues for several other attributes, with about 60 separate calls to .push() across several different arrays. This is an inefficient way of building these arrays, since push() will cause the JS engine to dynamically reallocate the underlying array under the hood. It would be much more efficient if we knew the size of the arrays ahead of time and preallocated them to the right size. Combine multiple push calls into a single call does help marginally, but there are greater gains to be had here.
Another thing to look into is that genBuffers() is called numerous times, seemingly repeating parsing of the same attributes tens of times. We should investigate whether this is intentional or not.
Now that we've got our models loading more efficiently, it's time to look at image loading. We know that loading a lot of images is one of the quickest route to crashing mobile browsers, so the cleaner we can get this working, the better. We want to make maximum use of browser features like createImageBitmap when available, and we want to minimize the work that's done in the main thread. We also have to be wary of texImage2D calls blocking the GPU as the texture is uploaded, and we should probably clean up after ourselves on the CPU side once we no longer need the image data there.
Once we've parsed each model asset, they get added to the scene as many times as needed. For each instance of the model, we clone the object and its materials. Cloning the materials is necessary in order to allow markup overrides for various textures, colors, and other material attributes, so we have a custom pass which remaps the materials that the asset loader gives us to materials that are set up for JanusWeb. However, I have a suspicion that this material remapping step is being called significantly more times than it should be for each material, it may be that it's triggering onload for each dependent asset, causing multiple shader recompiles for the same material. It's also possible that we're uploading the same textures to the gpu multiple times - we should double check this with WebGLInspector. Either way it feels like we're flooding the GPU with commands, which causes the whole system to bog down. I suspect there are big wins to be had here by carefully analyzing the logic of when shader recompiles are triggered.
Once we get past the asset loading stage, we're starting to look at what the engine looks like each frame. For our test scene, everything is running smoothly at 60fps no problem, but we see a nasty sawtooth pattern to the memory graph. This generally points to us allocating memory somewhere in the game loop - you can see from the memory graph that our memory usage steps up by about 800kb 60 times per second, triggering roughly 3 rounds of minor GC per second. Ouch! In an ideal world, our memory graph would be almost flat from frame to frame. We'll need to deep dive into the functions that get called each frame and see if we can spot any obvious optimizations here. The most common fix for this type of thing is to replace per-frame allocations with preallocated scratch variables - let's see what we find.