Tuesday, June 9, 2009

Single Pass MSAA Deferred Rendering - Dithered Deferred Rendering Idea

So what we'd like to have is a deferred shading technique that renders in a single pass, with MSAA, but without much additional memory and bandwidth.  In my previous post I describe the MSAA Z Prepass idea, which adds MSAA on top of deferred shading by using only a little extra memory for a MSAA z-buffer (8-16 megs), and a z-only prepass (fairly fast), plus a little up-sampling resolve work at the very end.  I think this is pretty good, and is probably overall better than the inferred lighting idea currently pursued by volition.  As a quick side note, I would also accumulate lighting buffers at a seperate, lower resolution than the screen, just as in the inferred lighting technique, and would not necessarily even render the main DS buffer at full 720p.  (ideally it could even vary dynamically)  Only the intial Z-prepass itself needs full screen resolution with MSAA, as it determines coverage.

The key concept here is one of spatial frequency optimization.  The numerous sub-components of the shading and lighting contributions do not have equal image contributions at all frequency bands - and their cost of computation varies greatly, so there is a huge potential gain by seperating out sub-components of shading and lighting computations and evaluating them at numerous reductions of the full image resolution.  This is a huge speedup if we can quickly up-sample and combine them together later - which is a motivation for a fast bilateral/depth context filter.  A typical example is evaluating SSAO at reduced resolution, but we can apply it more generally to everything.

For Deferred Shading, we need to output numerous shader inputs in the geometry pass, but we'd like to reduce the memory footprint down to reasonable levels.  What we'd really like to do is render directly into a compressed format.  The is exactly what dithered rendering accomplishes.  The key of the dithered deferring idea is inspired by Mpeg/Jpeg's mosiac YUV decomposition, which seperates luminance and chrominance, storing luminance at full resolution and the two chrominance values at reduced resolution on an offset grid fashion.  So the trick is to break up all the individual scalar components and output them on different interlaced grids, allocating space so that just a few of the most important components get full resolution, and the rest are rendered at reduced resolution with different dithering patterns.

With deferred rendering, we typically have per-pixel storage requirements of depth, 3 albedo color components, 2-3 normal components, emmissive, 2-6 specular components, and perhaps a few extra.  Depth needs sub-pixel resolution (as it represents MSAA coverage), luminance and perhaps one of the normal's components needs full pixel resolution, and the rest need less.  Next in importance is probably specular scale and then chrominance - specular power, emissive, and specular chrominance (if present) are usually much lower frequency.  Now we could probably actually pack most of this in two a single 32-bits with dithering, but with 2X MSAA, we actually have 64-bits to work with per pixel, rendering to a 32-bit buffer.

Unfortunately, I'm not aware of a single pass method to output seperate values to the different MSAA samples.  Maybe there is some low-level hackery that could accomplish this?  But regardless, thats actually ok, because not all materials actually use all of the parameters listed above.  So, the idea is to isolate out the more 'rare' components and render them in a 2nd pass that touches just 1 of the MSAA samples.  Albedo, Normal, specular level, and perhaps specular power could fit in the 1st pass, and then emissive and any remaining specular in the 2nd pass.

The dithering would be undone and unpacked during or before lighting into non MSAA (and lower res) textures, hopefully reusing some temp memory as its only needed during the lighting stage.  It may also be worthwhile to still do lighting with a light buffer at lower resolution and then up-sample and combine with albedo to get better lighting fillrate.

Why this is potentially cool:
- single CPU-geometry pass for performance limited stuff like foilage, terrain, anything with a more basic shader
- extra pass for just the objects with fancy shaders with alot of inputs (typically not many objects)
- low memory/bandwidth cost (potentially lowest of all schemes)
- full MSAA w/ fat deferred buffers.  Even 4x MSAA could be feasible, or perhaps 2X MSAA 1080p
- could output seperate stencil in the 2nd pass, so 16 bits of stencil, more with additional passes for special objects

With a little extra work to store ID bits in the stencil or in one of the channels, you could also get this to work with alpha-to-coverage MSAA.  You need some sort of ID in that case as the alpha-to-coverage needs to use all the MSAA samples at once - so the sample's meaning can't just be determined by its grid position.

On the alpha-coverage MSAA note, I've been wondering of late if 4x MSAA with dithering is enough to render full order-independent translucency with DS for things like soft particles.  There's some issues combining the ideas, but its possible with the ID technique as the particles don't need to store much info per pixel at all.  I did some little tests in photoshop with 5 color dithering (equivalent to alpha to coverage output with 4x MSAA), and it is feasible.  A good dither plus a slight post process blur results in a little quality loss, but not too much.  The challenges are in the dithering, post-process depth-order blur, soft particle z-output, and of course performance loss due to a highly randomized z-buffer.  But if feasible, it would be great to have fully lit/shadowed soft particles follow the same path as everything else.  Volumetric lighting and godrays could come out almost for free.

Deferred Rendering w/ MSAA - MSAA Z Prepass Idea

Deferred rendering presents something of a challenge to combine with MSAA on current console hardware because of the memory/bandwidth overhead of storing multiple render targets with multiple samples.  The extra memory hurts both PS3 and 360 equally, but the bandwidth effect differs on the two platforms.  On PS3, there is a straightfoward additional bandwidth cost proportional to the per-pixel output memory touched.  This can really hurt for overdraw intensive items, such as foilage.  Killzone 2 shipped with 2x quincox MSAA and a fat deferred shading buffer 64 Megs!  If i recall correctly.  However, it was PS3 only, and killzone 2 essentially has no foilage - the world is desolate & urban.  They certainly aren't rendering jungles.

On 360, the memory situation is even worse because of EDRAM.  Even though the bandwidth while rendering is nearly free, the geometry overhead adds cost and the all the extra resolves eat up time.  The tile padding ends up using up even more total bandwidth.  And during resolve, the rest of the system is essentially idle, destroying parallelism, so its actually somewhat worse than on PS3.

Newer GPU can do MSAA compression while rendering, using simple block based schemes that store a couple of real samples per pixel (like the min/max of DXTC), and then a large number of 'coverage samples' per pixel which are simply a few bits to select an examplar sample.  This compression takes advantage of the fact that really the high frequency information we care about is coverage, and its somewhat wasteful to store all of our buffers at multi-sample resolution.

So based on that, I have a MSAA deferred shading idea that uses 2 passes.  Lets call this technique MSAA Z-prepass.  One pass is rendered with z-only, 2x or even 4x MSAA, and no other buffers active - an MSAA z-pre pass essentially.  The second pass is rendered with your typical DS buffers, but no MSAA.  You then perform lighting/shading as normal, resulting in a pre-final non-MSAA framebuffer.  As a final step, you use a bilateral filter of depth to fill in the missing information and up-sample to MSAA resolution, which can then be resolved back down for the final buffer - naturally this can all be combined in to one fast step.  I'm assuming familiarity with bilateral filtering - but basically here I'm using it as a depth-sensitive up-sample.  The results should be very similar to full MSAA on all the buffers, but without the memory/bandwidth cost - as it uses the same compression principle the newer coverage based MSAA techniques use.

With a careful z-downsample/resolve pass, you can probably use the 1st Z pass to populate the Hi-Z for the 2nd pass and speed up rendering.  Still requires 2 render passes, which is un-optimal as I described in a previous post.

This is still on my wishlist, not something I've had the time to implement, but there was a recent paper by some guys at volition that uses the same principles to combine MSAA with 2 pass deferred lighting.  They decided this warranted a whole new name, dubbing it inferred lighting.

They modify defered lighting to render at different resolutions in the two passes.  Specifically, they use a reduced frame buffer (40% or so) for 1st depth/normal pass, and then use a full size MSAA buffer for the 2nd pass.  The light buffers are up-sampled using a bilateral technique in the shader of the 2nd pass.  By extending this up-sampling technique with some dither knowledge, they also do stippled alpha rendering to get some level of order-independent translucency with deferred shading.  Not enough that you could render a full particle system with that path, but enough for a few layers of glass windows or what not.

Their solution is interesting, but it even worsens the performance problems with alpha tested stuff I mentioned in my earlier post on Deferred Shading vs Deferred Lighting - as the 2nd pass has now gotten considerably more expensive due to the bilateral up-sampling filter.  And worse, since the buffers mis-match, they can not easily use the 1st pass output to prime the z-buffer for the 2nd pass.  (its probably possible to do a conservative screen pass just to populate the HI-Z, but not sure if they are doing that.)  Noticeably, Red Faction takes place on mars, so their engine doesn't have to deal with foilage.

Accumulating lighting at lower resolution (or actually better - multiple resolutions) is something I've been thinking about for a while, and is already well tested at least for AO, and they are using this to great effect in Red Faction to get lots of lights per pixel at speed.

But anyway, this motivated me to try and improve my MSAA Z prepass idea to get it down to a single pass with deferred shading, and also to find a fast method of bilateral or depth-sensitive filtering.

Deferring Techniques

At this point in time, some form of deferred rendering is becoming the standard rendering technique in games.  I've long been a fan of deferred shading, and was quite pleased with the results after converting our forward renderer to deferred on a 360 project a little over a year ago.  More recently, moving to a different project and engine, we went through the forward->deferred transition again, but our lead programmer tried a variation of the idea called deferred lighting.  From the beginning, I wasn't a fan of the technique, for a variety of reasons.  For a longer summary of the idea, and a more in depth comparison of deferred shading vs deferred lighting, check out this extensive post on gameangst.  At this point I am going to assume you are familiar with the techniques.  I mainly agree with Adrian's points, but there's a few issues I think he left out.

Deferred lighting is usually marketed as a more flexible alternative to 'traditional' deferred shading which has the proposed advantages:
- similar performance, perhaps better in terms of light/pixel overdraw cost
- no compromise in terms of material/shader flexibility

In short, I think the first claim is dubious at best, and the 2nd claim actually turns out to be false.  Adrian has an exposition on why the 2nd material pass in deferred lighting actually gives you much less flexibility than you would think.  The simple answer is that material flexibility (or fancy shaders), modify BRDF inputs or they alter the BRDF itself.  Flexibility in terms of modifying the BRDF inputs (multi-layer textures, procedural, animated textures, etc.) can easily be accounted for in traditional deferred shading, so there is no advantage there.  Deferred lighting is quite limited in how it can modify the BRDF because it must use a common function for the incoming light  (irradiance) at each surface point, for all materials.  It only has flexibility for the 2nd half of the BRDF, the exit light (radiance) on the eye path.  Materials with fancy specular (like skin) are difficult to even fake with control only over exit radiance.

Now, there is a solution using stencil techniques that allows multiple shader paths during light accumulation, but traditional deferred shading techniques can use this too to get full BRDF flexibility.  So Deferred Lighting has no advantage in BRDF flexibility.  (more on the stencil techniques in another post)

But the real problem with deferred lighting is in performance - its not similar to deferred shading at all.  The 1st problem is that all else being equal, two full render passes are just always going to be slower.  The extra CPU draw call cost and geometry processing can be significant, especially if you are trying to push the geometry detail limits of the hardware (and shouldn't you?).  The geometry processing could only be 'free' if there was significant pixel shader work to load balance against, and the load balancing was effecient.  On PS3, the load balancing is not effecient, and more importantly, there is not much significant pixel shader work.  Most of the significant pixel shader work is in the light accumulation, which is moved out of any geometry pass in both techniques - so they easily will be geometry limited.  This is the prime disadvantage of any deferred technique right now vs traditional forwad shading.  With forward shading, its much easier to really push the geometry limits of the hardware, as all pixel shading is done in one heavy pass.

Furthermore, the overdraw performance of the two systems is not comparable, and for high overdraw objects, such as foilage, deferred shading has a large advantage.  Foilage objects are typically rendered with alpha-test, and because of this they receive only a partial benefit from the hardware's HI-Z occlusion.  In our engine, the 1st pass in the two techniques for simple foilage is similar, both sample a single texture for albedo/alpha.  The only difference is in DS the 1st pass outputs albedo and normal vs just the normal for DL.  The 2nd pass, unique to DL, must read that same diffuse/albedo texture again, as well as the lighting information, which is often in a 1 or 2 64-bit texture(s).  So its easily 3 times the work per pixel touched.

As a side note:  the problems with Hi-Z and alpha test are manifold.  With 2 pass rendering, you would think the fully populated z-buffer and Hi-Z from the 1st pass will limit overdraw in the 2nd pass to a little over 1.0.  This is largely true for reasonable polygon scenes without alpha test.  The problem with alpha-test is that it creates a large number of depth edges and wide z-variation within each Hi-Z tile.  Now, this wouldn't be such a problem if the Hi-Z tiles stored a min/max z range, because then you could do fast rejection on the 2nd pass with z-equal compares.  But they store a single z-value, either the min or the max, useful only for a greater-equal or less-equal compare test.  Thus, when rendering triangles with alpha-test in the second pass, you get alot of false overdraw for pixels with zero-alpha that still pass the Hi-Z test.

The DS vs DL debate gets a little more complicated when you try to do MSAA, but thats another story.