Saturday, July 11, 2009

Voxel Cone Tracing

I'm willing to bet at this point that the ideal rendering architecture for the next hardware generation is going to be some variation of voxel cone tracing. Oh, there are many very valid competing architectures, and with the perfomance we are looking at for the next hardware cycle all kinds of techniques will look great, but this particular branch of the tech space has some special advantages.

Furthemore, I suspect that this will probably be the final rendering architecture of importance. I may be going out on a limb by saying that, but I mean it only in the sense that once that type of engine is fully worked out and running fast, it will sufficiently and effeciently solve the rendering equation. Other approachs may have advantages in dynamic update speed or memory or this or that, but once you have enough memory to store a sub-pixel unique voxelization at high screen resolution, you have sufficient memory. Once you have enough speed to fully update the entire structure dynamically, thats all you need. Then all that matters is end performance for simulating high end illumination effects on extremely detailed scene geometry.

There is and still will be much low-level work on effecient implementation of this architecture, especially in terms of animation, but I highly doubt there will be any more significant high level rendering paradigm shifts or breakthroughs past that. The hardware is already quite sufficient in high end PC GPU's, the core research is mainly there, most of the remaining work is actually building full engines around it, which won't happen in scale for a little while yet as the industry is still heavily focused on the current hardware generation. I actually think a limited voxel engine is almost feasible on current consoles (yes, really), at least in theory, but it would take a huge effort and probably require a significant CPU comittment to help the GPU. But PC hardware is already several times more powerful.

So why do I see this as the final rendering paradigm? It comes down to quality, scalability, generality, and (relative) simplicity. On the quality front, a voxel tracer can hit 'photoreal' quality with sufficent voxel resolution, sampling, and adequate secondary tracing for illumination. Traditional polygon rasterizers can approach this quality level, but only asymptotically. Still, this by itself isn't a huge win. Crysis looks pretty damn good - getting very close to being truly photoreal. On a side note, I think photoreal is an important, objective goal. You have hit photoreal when you can digitally reproduce a real world scene such that human observers can not determine which images were computer generated and which were not. Crysis actually built alot of its world based on real-world scenes and comes close to this goal.

But even if polygon techniques can approach photoreal for carefully crafted scenes, its much more difficult to scale this up to a large world using polygon techniques. Level of detail is trivially inherent and near perfect in a voxel system, and this is its principle perfomance advantage.

Much more importantly, a voxelization pipeline can and will eventually be built around direct 3D photography, and this will dramatically change our art production pipelines. With sufficient high resolution 3D cameras, you can capture massive voxel databases of real world scenes as they actually are. This raw data can then be processed as image data: unlit, assigned material and physical properties, and then packaged into libraries in a way similar to how we currently deal with 2D image content. Compared to the current techniques of polygon modelling, LOD production, texture mapping, and so on, this will be a dramatically faster production pipeline. And in the end, thats what matters most. It something like the transition from vector graphics to raster graphics in the 2D space. We currently use 3D vector graphics, and voxels represent the more natural 3D raster approach.

In terms of tracing vs rasterization or splatting, which you could simplify down to scatter vs gather, scatter techniques are something of a special case optimization for an aligned frustum, a regular grid. For high end illumination effects the queries of interest require cones or frustums down to pixel size, the output space becomes as irregular as the input space, so scatter and gather actually become the same thing. So in the limit, rasterization/scatter becomes indistinguishable from tracing/gather.

Ray tracing research is a pretty active topic right now, and there are several different paths being explored. The first branch is voxels vs triangles. Triangles are still receiving most of the attention, which I think is unwarranted. At the scalability limit (which is all we should care about now), storing and accessing data on a simple regular grid is more effecient in both time and space. Its simpler and faster to sample correctly, mip-map, compress, and so on. Triangles really are a special case optimization for smooth surfaces, and are less effecient for general data sets that break that assumption. Once voxels work at performance, they work for everything with one structure, from surface geometry to foilage to translucent clouds.

Once settled on voxels, there a several choices for what type of acceleration structure to use. I think the most feasible is to use an octree of MxM bricks, as in the GigaVoxel work of Cyril Cassin. I've been investigated and doing some prototyping with these types of structures off and on for about a year, and see it as the most promising path now. Another option is to forgo bricks and trace into a deeper octree that stores single voxels in nodes, as in Jon Olick's work. Even though Olick's technique seemed surprisingly fast, its much more difficult to filter correctly (as evident in his video). The brick tracing allows simple 3D hardware filtering, which simultaneously solves numerous problems. It allows you to do fast approximate cone tracing by sampling sphere steps. This is vastly more effecient than sampling dozens of rays - giving you anistropic filtering, anti-aliasing, translucency, soft shadows, soft GI effects, depth of field, and so on all 'for free' so to speak. I found Cassins papers after I had started working on this, and it was simultaneously invigorating but also slightly depressing, as he is a little ahead of me. I started in cuda with the ambition of tackling dynamic octree generation on the GPU, which it looks like he has moved to more recently.

There are a number of challenges getting something like this working at speed, which could be its own post. Minimizing divergence is very important, as is reducing octree walking time. With the M-brick technique, there are two main paths in the inner loop, stepping through the octree and sampling within bricks. Branch divergence could easily cost half or more perfomance because of this. The other divergence problem is the highly variable step length to ray termination. I think dynamic ray scheduling is going to be a big win, probably using the shared memory store. I've done a little precursor to this by scheduling threads to work on lists of pixels instead of one-thread per pixel, as is typical, and this was already a win. I've also come up with a nifty faster method of traversing the octree itself, but more on that some other time.

Dynamic updating is a challenge, but in theory pretty feasible. The technique I am investigate is based on combining dynamic data generation with streaming (treating them as the same problem) with a single unique octree. The key is that the memory caching management scheme also limits the dynamic data that needs to be generated per frame. It should be a subset of the working set, which in turn is a multiple of the screen resolution. Here a large M value (big bricks) is a disadvantage as it means more memory waste and generation time.

The other approach is to instance directly, which is what it looks like cyril is working on more recently. I look forward to his next paper and seeing how that worked out, but my gut reaction now is that having a two level structure (kd tree or bv tree on top of octree) is going to significantly complicate and slow down tracing. I suspect it will actually be faster to use the secondary, indexed structures only for generating voxel bricks, keeping the primary octree you trace from fully unique. With high end illumination effects, you still want numerous cone traces per pixel, so tracing will dominate the workload and its better to minimize the tracing time and keep that structure as fast as possible.


Tomat said...

Sorry for noising again, but want to ask a question:

How do you calculate cone map for image \ kdtree? Brute force takes huge time to succees.
Or same question for distance field - even jump flood based takes many steps depending on resolution and you had describe a border for it.

Or maybe a good references or paper?

Jake Cannell said...

I haven't tackled dynamic distance field generation yet, although I do plan to try that going forward. My plan relies on using the octree structure itself to accelerate the distance field computation. I think it can be done by checking only a small neighborhood around each voxel combined with interpolating lower resolution values from higher up the octree. If you think about it, this is similar to jump flooding, but more effecient.

Jon Olick said...

In general, I think highly of Cyril's work. He has done a really great job. As for "cannot filter" my voxel approach, thats just silly ;) Its easy to do in a post process.

Jake Cannell said...

Jon - I didn't mean "cannot filter" as in cannot filter the resulting image, but filterable as in discrete tracing vs alpha-accumulation tracing as in cyril's gigavoxels. Using native 3D texture filtering as you accumulate allows soft phenomenon such as clouds to be handled easily, and more importantly allows approximate cones to be traced. Cone tracing allows soft shadows, depth of field, and all kinds of goodness without tracing bunches of rays. This may still be possible with some smart post-filtering, but it seems way more tractable with 3D filtering & alpha accumulation.

Not that there isn't a potential price to pay for filtered cone tracing - the memory requirements for padding the voxel bricks, discrete tracing may be faster, etc. I don't know though - don't have an implementation of your technique to compare ;)