Saturday, December 13, 2008

Forward Reprojection for future hardware

The schemes described in my previous post are well suited for ps3/360 level hardware and a mix of CPU/GPU work. Namely, octree-screen tile intersection and point splatting done on CPU threads, and everything else done on the GPU.

But for future hardware, point splatting is less ideal than direct cone tracing. Why? They end up being almost the same actually. The critical scene traversal loop, which I think is best done on the CPU at coarse resolution for the current-gen, would be better done at high resolution to support higher quality illumination. If you want accurate reflections and GI effects, you need to be able to 'rasterize' very small frustums the size of a few pixel blocks - so it amounts to almost the same thing. Whether splatting or tracing (scatter or gather), you are doing fine grained intersections of frustums (pixels/tiles) with a 3D tree structure (octree or what have you).

For cuda based hardware, I think cone tracing into a new type of adaptive distance field structure is the way to go, and some initial prototyping done at home has been very promising. Combine this with the forward projection system, and you can get away with tracing 30k or so pixels per frame on average for primary visibility! For dynamic lighting update effects, you want additional rays per pixel, but with the magic of fast cone tracing (which is far more effecient than rays for soft-area queries) combined with the magic of frame coherent reprojection, I think we can hit uber quality with far less than 30 million cone traces per second, which my prototype can already exceed on an 8800GT.

However, one area which can be improved is the forward projection. Since tracing is relatively expensive, and incoherent tracing is vastly more so, its worth it to really get fine grained accurate projection and improve the coherence.

One simple way to improve the tracing coherence is to reorder the frame buffer after the projection pass. This can be done on the GPU, resulting in a reordered frame buffer with nice coherent blocks of pixels which need to be traced. This doesn't necessarily help for memory coherence, as these rays can still be scattered in space, especially after reflections, but it helps immensely with branch performance, which is critical.

The coherence can also be improved by splatting at finer granularity. Ideally we would want to actually splat individual samples as point splats, but actually using point primitives is way too slow on even new GPU's, as it goes through the terrible polygon rasterizer hardware bottleneck, and doing a few million per frame, although possible on modern GPU's, would eat up a big chunk of the frame time.

We can do better in cuda, which got me thinking about how to do this properly and in parallel. In short, I think a hierachical tile sorting approach is the way to go. First, you project all the points and save out the 2d positions - easy and fast. In the next step, the points are again broken up and assigned to threads, and each thread then builds up its own per-tile list point list of points hitting that tile - essentially sorting its subset of the points onto all the tiles. This results in NumThreads seperate point lists per tile. Then in the next pass, each thread gets a tile, and it merges all the lists for that tile from the 1st pass, resulting in a single list of points for each tile. There's a few details that i'm skipping over, such as parallel memory allocation to build up the lists - but this just requires a seperate counting/histogram pass. The tiles can't be too small as it would eat up too much memory for all the lists. Nor can they be too big, as you need adequate threads.

After the particles are thus sorted into per-tile lists, you can then subdivide and repeat, until you get down to fine-grained tiles (perhaps it only takes a couple of iterations). The fine grained tiles with their particle lists can then be rasterized, wich each thread getting one such tile, so the whole operation is parallel and scaleable, without any memory conflicts. You could store the final microtiles in local memory actually, so that z-buffering can be done without alot of extra memory read-writes.

This is probably similar or related to how they intend to do parallel polygon rasterization in Larrabee, but I didn't read all of that paper yet. Polygon rasterization doesn't really interest me much anymore, for that matter.

Hmm, come to think of it, this multi-stage hierachy sorting can be generalized for any data vs tree structure intersection problem, of which finding point-quad intersections is just one example.

Why is all this useful again? Because hopefully it could be much faster than using the triangle setup engine to render point primitives, and because reprojecting at pixel granularity could better handle the difficult cases with less errors than reprojecting larger tiles, especially for some difficult scenes that have lots of z-edges (like foilage).

Point splatting with cuda in this way is also a potential alternative to tracing, although my current feeling is that its not as well suited to fine granularity searches, for small fustra generated for reflections and advanced illumination. However, it is much better suited to handle animated objects, and may have an advantage there vs dynamic octree construction.

1 comment:

Tomat said...

How do you calculate cone map for image? Bruteforce takes huge time to succees.
Or same question for distance field - even jump flood based takes many steps depending on resolution and you had describe a border for it.