Saturday, December 13, 2008

Forward Reprojection for future hardware

The schemes described in my previous post are well suited for ps3/360 level hardware and a mix of CPU/GPU work. Namely, octree-screen tile intersection and point splatting done on CPU threads, and everything else done on the GPU.

But for future hardware, point splatting is less ideal than direct cone tracing. Why? They end up being almost the same actually. The critical scene traversal loop, which I think is best done on the CPU at coarse resolution for the current-gen, would be better done at high resolution to support higher quality illumination. If you want accurate reflections and GI effects, you need to be able to 'rasterize' very small frustums the size of a few pixel blocks - so it amounts to almost the same thing. Whether splatting or tracing (scatter or gather), you are doing fine grained intersections of frustums (pixels/tiles) with a 3D tree structure (octree or what have you).

For cuda based hardware, I think cone tracing into a new type of adaptive distance field structure is the way to go, and some initial prototyping done at home has been very promising. Combine this with the forward projection system, and you can get away with tracing 30k or so pixels per frame on average for primary visibility! For dynamic lighting update effects, you want additional rays per pixel, but with the magic of fast cone tracing (which is far more effecient than rays for soft-area queries) combined with the magic of frame coherent reprojection, I think we can hit uber quality with far less than 30 million cone traces per second, which my prototype can already exceed on an 8800GT.

However, one area which can be improved is the forward projection. Since tracing is relatively expensive, and incoherent tracing is vastly more so, its worth it to really get fine grained accurate projection and improve the coherence.

One simple way to improve the tracing coherence is to reorder the frame buffer after the projection pass. This can be done on the GPU, resulting in a reordered frame buffer with nice coherent blocks of pixels which need to be traced. This doesn't necessarily help for memory coherence, as these rays can still be scattered in space, especially after reflections, but it helps immensely with branch performance, which is critical.

The coherence can also be improved by splatting at finer granularity. Ideally we would want to actually splat individual samples as point splats, but actually using point primitives is way too slow on even new GPU's, as it goes through the terrible polygon rasterizer hardware bottleneck, and doing a few million per frame, although possible on modern GPU's, would eat up a big chunk of the frame time.

We can do better in cuda, which got me thinking about how to do this properly and in parallel. In short, I think a hierachical tile sorting approach is the way to go. First, you project all the points and save out the 2d positions - easy and fast. In the next step, the points are again broken up and assigned to threads, and each thread then builds up its own per-tile list point list of points hitting that tile - essentially sorting its subset of the points onto all the tiles. This results in NumThreads seperate point lists per tile. Then in the next pass, each thread gets a tile, and it merges all the lists for that tile from the 1st pass, resulting in a single list of points for each tile. There's a few details that i'm skipping over, such as parallel memory allocation to build up the lists - but this just requires a seperate counting/histogram pass. The tiles can't be too small as it would eat up too much memory for all the lists. Nor can they be too big, as you need adequate threads.

After the particles are thus sorted into per-tile lists, you can then subdivide and repeat, until you get down to fine-grained tiles (perhaps it only takes a couple of iterations). The fine grained tiles with their particle lists can then be rasterized, wich each thread getting one such tile, so the whole operation is parallel and scaleable, without any memory conflicts. You could store the final microtiles in local memory actually, so that z-buffering can be done without alot of extra memory read-writes.

This is probably similar or related to how they intend to do parallel polygon rasterization in Larrabee, but I didn't read all of that paper yet. Polygon rasterization doesn't really interest me much anymore, for that matter.

Hmm, come to think of it, this multi-stage hierachy sorting can be generalized for any data vs tree structure intersection problem, of which finding point-quad intersections is just one example.

Why is all this useful again? Because hopefully it could be much faster than using the triangle setup engine to render point primitives, and because reprojecting at pixel granularity could better handle the difficult cases with less errors than reprojecting larger tiles, especially for some difficult scenes that have lots of z-edges (like foilage).

Point splatting with cuda in this way is also a potential alternative to tracing, although my current feeling is that its not as well suited to fine granularity searches, for small fustra generated for reflections and advanced illumination. However, it is much better suited to handle animated objects, and may have an advantage there vs dynamic octree construction.

Forward Reprojection - current console hardware

Lately I have been thinking alot about motion compensation inspired forward reprojection schemes, both for current console generation level hardware and for the next generation cone traced engine plans.

For the current generation, I think a reasonable approach is to store and track image macrotiles, just like mpeg does, at say 8x8 pixel granularity. The rasterizer (preferablly an octree hybrid point-splatting/polygon renderer) would be designed to be very effecient at culling scene geometry down to this fine level of granularity. Tiles, once generated, are reused across frames and projected on the GPU, which can render them quite easily as simple quads with a relatively simple pixel shader filter. To handle depth edges, I would render each tile as two quads, one covering the near-pixels, the other the far-pixels. (otherwise these edges will result in large stretched quads, rather than two small quads which move apart) The tiles would be managed with a caching policy, with weights assigned to the number of correct pixels the tiles provide each frame. Old invalidated tiles would then be evicted to make way for new tile generation.

The resulting image will have a small number of error regions that then need to be corrected - some pixels won't be hit by any previous tiles for new regions of the scene revealed by camera motion and occlusion changes. If you also track motion vectors for the tiles, you can have some regions that need to be invalidated because of animation. In some cases moving tiles will happen to project into an occlusion gap but are actually behind something else in the scene (false occlusion). For a static scene, you can treat the result of the forward projection as a conservative z-buffer. Animation errors can be handled then by simply not projecting stored tiles that have too much animation error. A coarse z-pyramid rendering can then reliably identify new screen tiles which need to be regenerated.

The stencil & z-buffer is used to track invalid regions of the screen, resulting in a low-res version of these maps which is read back onto the CPU. The CPU then does heirachical intersection of the image z/stencil pyramid with the scene octrees to rasterize out new tiles which need to be generated. A bias is used to avoid re-rendering onto valid reprojected screen tiles - essentially it searches for octree cells that intersect error pixels or have moved in front of the previous projection. This is particularly well suited to the PS3's SPU's, but could also work reasonably well on the 360 with slightly different tile size tradeoffs. The key is that it is relatively coarse, operating at lower levels of an image pyramid and an octree.

When combined with deferred point splat filtering, this system can tolerate and cover over a few pixels of small occlusion errors, which can improve performance at some small potential error cost. You want to avoid rasterizing a whole tile just because of a couple error pixels. The screen interpolation filtering for the point splatting would fill in missing z-information by propagating splat surfaces using a hierachical push-pull algorithm. In essence, small gaps of a couple of pixels caused by a moving edge would be filled in to match the background surface, and because even texturing is deferred in such a scheme, there would be no noticeable smearing - small gaps can thus easily be filled in. What your left with then is a more coherent error mask and less tiles that need to be refreshed.

Lighting changes would be handled seperately with deferred shading. Static light interactions could be cached in the g-buffer and greatly benefit from the forward projection. So your deferred shading system could seperate static and dynamic lights. Static lights could use the screen error mask so they only need to recompute for the small number of new pixels. There is one slight complication, which is moving shadow casters, but this can be handled by a rough low-res shadow map look up which identifies screen regions that are shadowed by dirty regions of the shadow map, and thus need to be resampled.

This type of fine-grained micro-culling architecture is also exactly what you need to do really high quality outdoor shadows through quadtree shadow maps, and the reprojection scheme can also be employed to speed up the shadow map generation as well. And in this case, there is even more temporal coherence for typical outdoor scenes, as the sun can be treated as a static light (even if time-of-day is simulated, this would only change the projection every minute or so - not an issue). For this case, the only shadowmap regions that need to be regenerated are new quadtree cells as they are expanded in response to scene update, and cells which overlap moving objects. There are some crappy cases like a windy day in a jungle, but for most typical scenes this could be a vastly faster shadowing system that could scale to the ultra-detailed geometry of a frame coherent point-splatting renderer.