Friday, August 7, 2009

Hierarchical Cone Tracing (half baked rough sketch)

High quality ray tracing involves the computation of huge numbers of ray-scene intersections. As most scenes are not random, the set of rays traced for a particular frame are highly structured and spatially correlated. Just as real world images feature significant local spatial coherence which can be exploited by image compression, real world scenes feature high spatial coherence which ray tracers can exploit. Current real time ray tracers exploit spatial coherence at the algorithm level through hierachical packet/cone/fustrum tracing, and at the hardware level through wide SIMD lanes, wide memory packet transactions, and so on. Coherence is rather easy to maintain for primary rays and shadow rays, but becomes increasingly difficult with specular reflection and refraction rays in the presence of high frequency surface normals, or the wide dispersion patterns of diffuse tracing. However, taking inspiration from both image compression and collision detection, it should be possible to both significantly reduce the number of traces per pixel and trace all rays (or cones) in a structured, coherent and even heirarchical manner.

A set of rays with similar origins and directions can be conservatively bounded or approximated by a frustum or cone. Using these higher order shapes as direct intersection primitives can replace many individual ray traces, but usually at the expense of much more complex intersections for traditional primitives such as triangles. However, alternative primitives such as voxels (or distance fields) permit cheap sphere intersections, and thus fast approximate cone tracing through spherical stepping. Cones have the further advantage that they permit fairly quick bounding and clustering operations.

Building on cones as the base primitive, we can further improve on ray tracing in several directions. First we can treat the entire set of cone traces for a frame as a large collision detection problem, intersecting a set of cones with the scene. Instead of intersecting each cone individually, HCT builds up a cone heirachy on the fly and uses this to quickly exclude large sections of potential sphere-scene intersections. The second area of improvement is to use clustering approximations at the finer levels to reduce the total set of cones, replacing clusters of similar cones with approximations. Finally, the heiarchy exposed can be navigated in a coherent fashion which is well suited to modern GPUs.

In short sketch, hierachical cone tracing amounts to taking a cluster of cone segments (which in turn are clusters of points+rays), and then building up some sort of hierachical cone tree organization, and then using this tree to more effeciently trace and find intersections. As tracing a cone involves testing a set of spheres along the way, testing enclosing spheres up the hierachy can be used to skip steps lower down in the hierachy. However, instead of building the hierarchy from bottom up (which would require the solution to already be known), the cone tree is built from the top down, using adaptive subdivision. Starting with a single cone (or small set of cones) from the camera, it calculates an intersection slice (or ranges of slices) with the scene. These intersection spheres are tested for bounds on the normals, which allow computation of secondary cones, with origins bounding the interesection volumes and angular widths sufficient to bound the secondary illumination of interest. This space forms a 3D (2 angular + depth) dependency tree structure, which can then be adaptively subdivided, eventually down to the level of near pixel width primary cones which typically intersect a small slice of the scene and have similar normals (small radius normal bounding cone). Refraction secondary cones have a similar width to the incoming cone's width. Specular secondary cones have a width ranging from the incoming width to much wider, depending on the glossiness term. Diffuse bounce secondary cones are essentially equivalent to the widest specular, expanding the cone to maximum width. Subdivision can proceed in several ways at each step, either in primary cone direction (2D), intersection depth, or secondary cone direction (2D) or intersection depth. (considering more bounces in one step would add additional dimensions) The subdivision is error guided, terminating roughly when a low error approximation is reached. This framework is complete, and can simultaneously handle all illumination effects.

First, consider just the simple case of primary rays. Here, hierachical cone tracing is pretty straightforward. You trace the image pyramid from coarse to finest. Each mip trace finds the first intersection or conservative near-z for that mip level, and the next mip level trace uses this coarse near-z to start tracing, instead of starting at the camera. Just even this basic idea can result in more than a 2x speedup vs regular tracing. Further speedup is possible if the lower res mip traces farther in, possibly outputting a set of segments instead of just the 1st intersection. This can be done rather easily with alpha tracing in a voxel system. Considering a range of segments, and not just the 1st intersection amounts to treating it as a 3D subdivision problem instead of 2D. The global bounding cone is approximated by a few dozen spheres, each of which subdivide into about 8 child spheres, and so on. Testing a sphere high up in the heirachy can exclude that entire subtree.

Next lets consider a directional or point light. Starting with a very large cone for the entire scene, we get a set of intersection slices along the entire frustum each of which has a range of normals. Since these regions are so huge, the normals will probably cover the full sphere, so the cones for any secondary effects will basically be larger spheres emenating out across the entire world. Not suprisingly, the orgin of the cone hierachy doesn't tell you much other than you have a camera frustum, it hits a bunch of stuff, and the secondary illumination could come from anywhere in the scene. Subdivision can proceed in several different ways. You could next subdivide the screen, or you could even subdivide in 3D (screen + depth), splitting into 8 child traces, but the farther depth slices are provisional (they may not accumulate anything). However, another dimension of subdivision, which makes sense here, is to subdivide the 2D direction of secondary (illumination) cones. This should be error guided, subdividing cones that contain high frequency illumination (stored in the octree) and contribute the most error. A bright point light will be found this way, or a direct light will just show up as a huge illumination spike in one cone direction. Subdivision of direction will then find the incoming light direction logarithimically. If there is no indirect illumination (all zeroed out in the octree), then the error guidance will expand in the direction dimension until it finds the primary light direction, and ignore the other directions.

How would the error guidance work more specifically? Its goal would seek to minimize the final tree size (and thus total computation). It would do this in a greedy, approximate fashion using the local information available at each step. When subdividing a cone, there are several splitting options to choose from, and one dimension (depth, ie the cone's length) is rather special as it involves a temporal dependency. The closer cone section potentially masks the farther section, so if the near section is completely opaque (blocks all light), the far section can be cut. Tracing a cone as a sphere approximation amounts to fully subdividing along just the depth dimension, and reveals the full alpha interval function along that depth (which could be saved as just a min and max, or in a more explicit format). Subdividing a cone then along either the spatial dimension or angular dimension would depend on the outcome of the trace, illumination info found, and the current angle relative to the spherical origin bound.

Careful error guidance will result in an effecient traversal for the directional light case. Once the secondary cone direction is sufficiently subdivided, finding that a secondary cone trace towards the light direction results in a 0 (fully shadowed) will automatically stop further subdivision of that entire subtree. Likewise, subdividing in secondary cone depth will reveal entire cone subsections that are empty, and do not need to be traced. The fully lit regions will then end up exploring increasingly short cone traces near the surface. Full cone traces down to pixel level are only required near shadow edges, and only when the shadow is near pixel resolution, as the error guidance can terminate softer shadows earlier. Secondary bounce diffuse illumination is similar, but the energy is smeared out (lower frequency), so it explores a broader subtree in cone direction, but can usually terminate much earlier in the spatial dimension. It again ends up terminating with just a few shallow traces at the finest resolution, representing local searches. Specular and reflection traces are handled in a similar fashion, and really aren't that different.


Surface Clustering

Further improvement comes from clustering approximation. The coarse level tests can use bounds on the normals and depths in an intersection region to approximate the intersection and terminate early (for example, finding that the intersection for a 4x4 pixel block is a smooth, nearly coplanar surface section allows early termination and computation of intersection through simple interpolation for the 2 finer levels). This is related to the concept of shading in a compressed space. Consider a simple block based compression scheme, such as DXTC, which essentially uses a PCA approach to compress a set of values (a 4x4 block) by a common 1D line segment through the space (low frequency component) combined with a per pixel distance along the line (high frequency and lower accuracy). The scheme compresses smooth regions of the image with little error, and the error in noisy regions of the image is masked by the noise.

Now lets first apply this compression scheme in the context of a traditional shading. In fact, it is directly applicable in current raster engines for complex but lower frequency shading effects, like screen space AO or GI. Downsampling the depth buffer to compute AO on less samples, and then upsampling with a bilateral filter can be considered a form of compression that exploits the lower frequency dominant AO. A related scheme, closer to DXTC, is to perform a min/max depth downsampling, evaluate the AO on the min&max samples per block, and then use these to upsample - with a dual bilateral or even without. The min/max scheme better represents noisy depth distributions and works much better in those more complex cases. (although similar results could be obtained with only storing 1 z-sample per block and stipple-alternating min and max)

So the generalized form of this spatial clustering reduction is to 1. reduce the set of samples into a compressed, smaller examplar set, often reducing the number of dimensions, 2. run your per-sample algorithm on the reduced exemplar set of samples, and then 3. use a bilateral-type filter to interpolate the results from the exemplars back onto the originals. If the exemplars are chosen correctly (and are sufficient) to capture the structure and frequency of the function on the sample set, there will be little error.

As another example, lets consider somewhat blurry, lower frequency reflections on a specular surface. After primary ray hit points are generated for a block of pixels (or rasterized), you have a normal buffer. Then you run a block compression on this buffer to find a best fit min and max normal for the block, and the per pixel interpolators. Using this block information, you can compute specular illumination on the 2 per block normal directions, and then interpolate the results for all of the pixels in the block. In regions where the normal is smooth, such as a smooth surface like the hood of a car, the reduced block normals are a very close approximation. In regions with high frequency noise in the normals, the noise breaks up any coherent pattern in the specular reflection and hides any error due to the block compression. Of course, on depth edges there is additional error due to the depth/position. To handle this, we need to extend the idea to also block compress the depth buffer, resulting in a multi-dimensional clustering, based on 2 dimensions along a sphere for the normal, and one dimension of depth. This could still be approximated by two points along a line, but there are other possibilities, such as a 3 points (forming a triangle), or even storing 2 depth clusters with 2 normals each (4 points in the 3D space). It would be most effecient to make the algorithm adaptive, using perhaps 1-3 candidate examplars for a block. The later bilateral filtering would pick the appropriate neighbor candidates and weights for each full res sample.

Integrating this concept into the hierachical cone tracer amounts to adding a new potential step when considering expansion sub tasks. As I described earlier about primary rays, you can stop subdivision early if you know that the hit intersection is simple and reducable to a plane. Generalizing that idea, the finer levels of the tree expansion can perform an approximation substitution instead of fully expanding and evaulating in each dimension, terminating early. A plane fit is one simple example, with an examplar set being the more general case. The 5D cone subtree at a particular lower level node (like 4x4), which may have many dozens of fully expanded children can be substituted with a lower dimensional approximation and a few candidate samples. This opens up a whole family of algorithms which adaptively compress and reduce the data and computation space. Triangle like planar primitives can be considered a spot optimization that may be cheaper than subidiving and tracing additional spheres.

Its certainly complex, or potentially complex, but I think it represents a sketch of the direction of what an optimal or near optimal tracer would look like. Just as in image compression, increasing code complexity hits some asymptotic wall at some point. I've only considered a single frame here, going further would require integrating these ideas with temporal coherence. Temporal coherence is something I've discussed earlier, although there's many options to how to approach that in a heirarchical cone tracer.

What are the potential advantages? Perhaps an order of magnitude speedup or more if you really add alot of the special case optimizations. Of course, this speedup is only at the algorithmic level, and really depends very much on implementation details, which are complex. I think the queue task generation models now being explored on GPUs are the way to go here, and could implement an adaptive tree subdivision system like this effeciently. Coherence can be maintained by organizing expansions in spatially related clusters, and enforcing this through clustering.

But the real advantage would be combining all lighting effects into a single unified framework. Everything would cast, receive, bounce and bend light, with no special cases. Lights would just be objects in the scene, and everything in the voxelized scene would store illumination info. A real system would have to cache this and what not, but it can be considered an intermediate computation in this framework, one of many things you could potentially cache.

I've only explored the baby steps of this direction so far, but its interesting to ponder. Of course, the streaming and compression issues in a voxel system are far higher priority. Complex illumination can come later.





No comments:

Followers