Ryan Lam (rwl) and Lucy Tan (jltan)
We accelerated lightmap compilation using the GPU. Valve's lightmap compiler, VRAD, is resource heavy and basically turns the user's computer into a potato while it processes everything, so it's a boon that 418RAD not only is on the GPU, but it's also faster. Now you can use your computer for other things while waiting for the lightmap to finish compiling!
Radiosity is a global illumination algorithm which simulates the effects of real light. Real-time radiosity computations are expensive, so many game engines pre-compute radiosity and overlay the resulting lightmap across their maps to give the sense of somewhat realistic lighting. As one might expect, the computation of the lightmap is quite costly as well. When buiding in the Source Engine, the lightmap compilation can easily take hours and become a bottleneck.
There are a few different ways to handle radiosity. We decided to use the gathering method as opposed to the shooting method, since with gathering, the only shared update was to the gatherer, as opposed with the shooting method where an update would have to be written to all incident patches. The assumption was that we would be able to parallelize across all patches with the gathering method for good performance, as opposed to the shooting method where we could only do shots from one patch at a time. Parallelizing across the patches seemed to be the obvious answer, and terribly data-parallel.
This algorithm took in an array of patches and updated the patches with the correct information. Creating and updating this array is highly important throughout the code, as all patch information is stored there and will eventually need to be returned.
As this project was intended to supplement the Source Engine, much of the code is based off the Source Engine code. We decided to use CUDA since people developing for Source Engine usually have some kind of graphics card available in their machine, as the typical gamer tends to have discrete graphics cards.
Most of the classes were modeled after the Source Engine's since we needed to read and store data in a file format specific to the Source Engine.
After importing all the data into the GPU, the data required for radiosity is transfered from BSP to CUDABSP (in device memory). From there, various manipulations are made to the CUDABSP class depending on the phase, and after all work is done, the modifications to the device CUDABSP are transferred back to the host BSP.
For the direct lighting phase, the level's faces are broken up into a set of triangles, which is then partitioned into a KD-tree. Each KD node represents a subdivision of 3D space, and each KD leaf contains a set of triangles (or no triangles). This forms a basic ray-trace acceleration structure that can be used to perform point-to-point line-of-sight tests. To calculate direct lighting, we parallelize over all faces in the level geometry, and use the KD tree to test whether each light sample on each face can see any of the level's light sources. If the sample can see the light source, an attenuation factor is applied to the light source's inherent brightness according to the inverse-square law, in order to determine the initial color of the sample.
Since this discrete sampling of lightmap points can result in aliased shadows, an anti-aliasing pass is then made over the lightmaps of each face. A simple gradient detection algorithm is used to determine which points are likely to require anti-aliasing, and supersampling is then performed on each of those candidate points. For faster (but slightly less aesthetically pleasing) results, basic box filtering can be applied to the candidate points instead of supersampling.
For the light bouncing phase, each face is first placed into an array for easy access. Then each face is divided into patches. These patches are put into a massive array, and the associated face is given a pointer to its first patch. Gathering, the parallel part of this business, begins afterwards. Each patch calculates the amount of light that is reflected to it from all other patches. It updates its color value accordingly. This repeats until the iteration cap is reached or the values converge. We parallelize over the patches, and sync between iterations. After all light data is collected, one final process takes all the data and updates CUDABSP, which will later be translated into BSP on the host side and written to file.
Our original plan was to include light bouncing radiosity computations, but unfortunately the version rolled out does not have this functionality properly implemented. However, the direct lighting implementation provides lightmaps comparable to those generated by Source Engine's VRAD (with its light bounces disabled).
lightmaptest
comparison -- VRAD on left, 418RAD on right
lightmaptest_hires
comparison -- VRAD on left, 418RAD on right
LJM
comparison -- VRAD on left, 418RAD on right
There are some differences in the shadows due to minor differences in the way the level's lighting rules are interpreted by the respective lightmap generators.
The baseline was performed on Ryan's laptop, which has an Intel i7 CPU, 2.4 GHZ, 4 cores (8 logical cores). The GTX 850M GPU benchmark was performed on the same machine; the GTX 1080 tests were conducted on the GHC machines. lightmaptest
is a very simple map; just a room with a few objects and lights placed into it. lightmaptest_hires
is the same map but with higher light sampling density, which is why the shadows are much sharper. LJM
is an excerpt from an actual level used in a real Source Engine game project (specifically, the Black Mesa Hazard Course, one of Ryan's hobby side-projects).
As can be seen, compilation time is almost always better on GPU. The exception is with lightmaptest on the GTX 850M; we assume the lengthier time to be from overhead. Speedup varies depending on the map, as there are a multitude of factors that affect how well the lighting algorithm does (such as the shape of the level's BSP structure, and the particular arrangement of triangles in the level), so it's not easy to tie the increased speed to any particular attribute, such as file size or number of models in the map. However, in general, the GTX 1080's achieve 2x speedup or better.
In builds that had included the flawed bouncing light, the light bouncing computations took up the majority of the time, but with only direct lighting, the antialiasing phase turned out to take up the most time. In fact, for the LJM test, the antialiasing phase took 4.5 times longer (9 seconds) than the direct lighting phase (2 seconds).
Further improvements could be made. Valve's BSP data structures have a lot of indirection. Since we essentially ported Valve's classes into CUDA, our code, too, has a lot of indirection, so we are missing out on many benefits of locality.
Source SDK Documentation on map compilation
Ryan: Reverse engineering file format, file read and write, direct and ambient lighting
Lucy: Light bouncing