Tuesday, June 9, 2009

Single Pass MSAA Deferred Rendering - Dithered Deferred Rendering Idea

So what we'd like to have is a deferred shading technique that renders in a single pass, with MSAA, but without much additional memory and bandwidth.  In my previous post I describe the MSAA Z Prepass idea, which adds MSAA on top of deferred shading by using only a little extra memory for a MSAA z-buffer (8-16 megs), and a z-only prepass (fairly fast), plus a little up-sampling resolve work at the very end.  I think this is pretty good, and is probably overall better than the inferred lighting idea currently pursued by volition.  As a quick side note, I would also accumulate lighting buffers at a seperate, lower resolution than the screen, just as in the inferred lighting technique, and would not necessarily even render the main DS buffer at full 720p.  (ideally it could even vary dynamically)  Only the intial Z-prepass itself needs full screen resolution with MSAA, as it determines coverage.

The key concept here is one of spatial frequency optimization.  The numerous sub-components of the shading and lighting contributions do not have equal image contributions at all frequency bands - and their cost of computation varies greatly, so there is a huge potential gain by seperating out sub-components of shading and lighting computations and evaluating them at numerous reductions of the full image resolution.  This is a huge speedup if we can quickly up-sample and combine them together later - which is a motivation for a fast bilateral/depth context filter.  A typical example is evaluating SSAO at reduced resolution, but we can apply it more generally to everything.

For Deferred Shading, we need to output numerous shader inputs in the geometry pass, but we'd like to reduce the memory footprint down to reasonable levels.  What we'd really like to do is render directly into a compressed format.  The is exactly what dithered rendering accomplishes.  The key of the dithered deferring idea is inspired by Mpeg/Jpeg's mosiac YUV decomposition, which seperates luminance and chrominance, storing luminance at full resolution and the two chrominance values at reduced resolution on an offset grid fashion.  So the trick is to break up all the individual scalar components and output them on different interlaced grids, allocating space so that just a few of the most important components get full resolution, and the rest are rendered at reduced resolution with different dithering patterns.

With deferred rendering, we typically have per-pixel storage requirements of depth, 3 albedo color components, 2-3 normal components, emmissive, 2-6 specular components, and perhaps a few extra.  Depth needs sub-pixel resolution (as it represents MSAA coverage), luminance and perhaps one of the normal's components needs full pixel resolution, and the rest need less.  Next in importance is probably specular scale and then chrominance - specular power, emissive, and specular chrominance (if present) are usually much lower frequency.  Now we could probably actually pack most of this in two a single 32-bits with dithering, but with 2X MSAA, we actually have 64-bits to work with per pixel, rendering to a 32-bit buffer.

Unfortunately, I'm not aware of a single pass method to output seperate values to the different MSAA samples.  Maybe there is some low-level hackery that could accomplish this?  But regardless, thats actually ok, because not all materials actually use all of the parameters listed above.  So, the idea is to isolate out the more 'rare' components and render them in a 2nd pass that touches just 1 of the MSAA samples.  Albedo, Normal, specular level, and perhaps specular power could fit in the 1st pass, and then emissive and any remaining specular in the 2nd pass.

The dithering would be undone and unpacked during or before lighting into non MSAA (and lower res) textures, hopefully reusing some temp memory as its only needed during the lighting stage.  It may also be worthwhile to still do lighting with a light buffer at lower resolution and then up-sample and combine with albedo to get better lighting fillrate.

Why this is potentially cool:
- single CPU-geometry pass for performance limited stuff like foilage, terrain, anything with a more basic shader
- extra pass for just the objects with fancy shaders with alot of inputs (typically not many objects)
- low memory/bandwidth cost (potentially lowest of all schemes)
- full MSAA w/ fat deferred buffers.  Even 4x MSAA could be feasible, or perhaps 2X MSAA 1080p
- could output seperate stencil in the 2nd pass, so 16 bits of stencil, more with additional passes for special objects

With a little extra work to store ID bits in the stencil or in one of the channels, you could also get this to work with alpha-to-coverage MSAA.  You need some sort of ID in that case as the alpha-to-coverage needs to use all the MSAA samples at once - so the sample's meaning can't just be determined by its grid position.

On the alpha-coverage MSAA note, I've been wondering of late if 4x MSAA with dithering is enough to render full order-independent translucency with DS for things like soft particles.  There's some issues combining the ideas, but its possible with the ID technique as the particles don't need to store much info per pixel at all.  I did some little tests in photoshop with 5 color dithering (equivalent to alpha to coverage output with 4x MSAA), and it is feasible.  A good dither plus a slight post process blur results in a little quality loss, but not too much.  The challenges are in the dithering, post-process depth-order blur, soft particle z-output, and of course performance loss due to a highly randomized z-buffer.  But if feasible, it would be great to have fully lit/shadowed soft particles follow the same path as everything else.  Volumetric lighting and godrays could come out almost for free.

No comments: