Sunday, July 12, 2009

Understanding the Effeciency of Ray Traversal on GPUs

I just found this nice little paper by Timo Alia linked on Atom, Timothy Farrar's blog, who incidentally found and plugged my blog recently (how nice).

They have a great analysis of several variations of traversal methods using a standard BVH/triangle ray intersector code, along with simulator results for some potential new instructions that could enable dynamic scheduling. They find that in general, traversal effeciency is limited mainly by SIMD effeciency or branch divergence, not memory coherency - something I've discovered is also still quite true for voxel tracers.

They have a relatively simple scheme to pull blocks of threads from a global store using atomic instructions. I had thought of this but believed that my 8800 GT didn't support the atomics, and I would have to wait until i upgraded to a GT200 type card. I was mistaken though and its the 8800GTX which is only cuda 1.0, my 8800GT is cuda 1.1 so should be good to go with atomics.

I have implemented a simple scheduling idea based on a deterministic up-front allocation of pixels to threads. I also use only a fixed thread pool, but have each thread loop through pixel-rays according to a tiling scheme. This got maybe a 25% improvement or so, but in their system they were seeing closer to a 90-100% improvement, so I could probably improve this further. However, they are scheduling entire blocks of (i think 32x3) pixel-rays at once, while I had each thread loop through pixel-rays independently. I thought having each thread immediately move on to a new pixel-ray would be better as it results in less empty SIMD lanes, but it also causes another point of divergence in the inner loop for the ray initialization step. Right now I handle that by amortizing it - simply doing 3 or so ray iterations and then an iffed ray init, but perhaps their block scheduling approach is even faster.

The particular performance problem which I see involves high glancing angle rays that skirt the edge of a long surface, such as a flat ground plane. For a small set of pixel-rays, there is a disproportionately huge number of voxel intersections, resulting in a few long problem rays that take forever and stall blocks. Ideally you'd want to adaptively reconfigure the thread-pixel/ray mapping to prevent low occupancy blocks from slowing you down.




Saturday, July 11, 2009

Voxel Tracing

I'm pretty convinced at this point that the ideal rendering architecture for the next hardware generation is going to be some variation of voxel cone tracing. Furthemore, I suspect that voxel tracing will probably be the final rendering architecture. There is and still will be much low-level work on effecient implementation of this architecture, especially in terms of animation, but I highly doubt there will be any more significant high level rendering paradigm shifts or breakthroughs past that. The hardware is already quite sufficient in high end PC GPU's, the research is pretty much there as well, most of the remaining work is actually building full engines around it, which won't happen in scale for a little while yet as the industry is still heavily focused on the current hardware generation. I actually think a voxel engine is almost feasible on current consoles (yes, really), at least in theory, but it would take a huge effort and probably require a significant CPU comittment to help the GPU. But PC hardware is already several times more powerful.

So why do I see this as the final rendering paradigm? It comes down to quality, scalability, generality, and (relative) simplicity. On the quality front, a voxel tracer can hit 'photoreal' quality with sufficent voxel resolution, sampling, and adequate secondary tracing for illumination. Traditional polygon rasterizers can approach this quality level, but only asymptotically. Still, this by itself isn't a huge win. Crysis looks pretty damn good - getting very close to being truly photoreal. On a side note, I think photoreal is an important, objective goal. You have hit photoreal when you can digitally reproduce a real world scene such that human observers can not determine which images were computer generated and which were not. Crysis actually built alot of its world based on real-world scenes and comes close to this goal.

But even if polygon techniques can approach photoreal for carefully crafted scenes, its much more difficult to scale this up to a large world using polygon techniques. Level of detail is trivially inherent and near perfect in a voxel system, and this is its principle perfomance advantage.

Much more importantly, a voxelization pipeline can and will eventually be built around direct 3D photography, and this will dramatically change our art production pipelines. With sufficient high resolution 3D cameras, you can capture massive voxel databases of real world scenes as they actually are. This raw data can then be processed as image data: unlit, assigned material and physical properties, and then packaged into libraries in a way similar to how we currently deal with 2D image content. Compared to the current techniques of polygon modelling, LOD production, texture mapping, and so on, this will be a dramatically faster production pipeline. And in the end, thats what matters most.

In terms of tracing vs rasterization or splatting, which you could simplify down to scatter vs gather, scatter techniques are something of a special case optimization for an aligned frustum. For high end illumination effects the queries of interest require cones or frustums down to pixel size, so scatter and gather actually become the same thing. So in the limit, rasterization/scatter becomes indistinguishable from tracing/gather.

Ray tracing research is a pretty active topic right now, and there are several different paths being explored. The first branch is voxels vs triangles. Triangles are still receiving most of the attention, which I think is unwarranted. At the scalability limit (which is all we should care about now), storing and accessing data on a simple regular grid is more effecient in both time and space. Its simpler and faster to sample correctly, mip-map, compress, and so on. Triangles really are a special case optimization for smooth surfaces, and are less effecient for general data sets that break that assumption. Once voxels work at performance, they work for everything with one structure, from surface geometry to foilage to translucent clouds.

Once settled on voxels, there a several choices for what type of acceleration structure to use. I think the most feasible is to use an octree of MxM bricks, as in the GigaVoxel work of Cyril Cassin. I've been investigated and doing some prototyping with these types of structures off and on for about a year, and see it as the most promising path now. Another option is to forgo bricks and trace into a deeper octree that stores single voxels in nodes, as in Jon Olick's work. Even though Olick's technique seemed surprisingly fast, its much more difficult to filter correctly (as evident in his video). The brick tracing allows simple 3D hardware filtering, which simultaneously solves numerous problems. It allows you to do fast approximate cone tracing by sampling sphere steps. This is vastly more effecient than sampling dozens of rays - giving you anistropic filtering, anti-aliasing, translucency, soft shadows, soft GI effects, depth of field, and so on all 'for free' so to speak. I found Cassins papers after I had started working on this, and it was simultaneously invigorating but also slightly depressing, as he is a little ahead of me. I started in cuda with the ambition of tackling dynamic octree generation on the GPU, which it looks like he has moved to more recently.

There are a number of challenges getting something like this working at speed, which could be its own post. Minimizing divergence is very important, as is reducing octree walking time. With the M-brick technique, there are two main paths in the inner loop, stepping through the octree and sampling within bricks. Branch divergence could easily cost half or more perfomance because of this. The other divergence problem is the highly variable step length to ray termination. I think dynamic ray scheduling is going to be a big win, probably using the shared memory store. I've done a little precursor to this by scheduling threads to work on lists of pixels instead of one-thread per pixel, as is typical, and this was already a win. I've also come up with a nifty faster method of traversing the octree itself, but more on that some other time.

Dynamic updating is a challenge, but in theory pretty feasible. The technique I am investigate is based on combining dynamic data generation with streaming (treating them as the same problem) with a single unique octree. The key is that the memory caching management scheme also limits the dynamic data that needs to be generated per frame. It should be a subset of the working set, which in turn is a multiple of the screen resolution. Here a large M value (big bricks) is a disadvantage as it means more memory waste and generation time.

The other approach is to instance directly, which is what it looks like cyril is working on more recently. I look forward to his next paper and seeing how that worked out, but my gut reaction now is that having a two level structure (kd tree or bv tree on top of octree) is going to significantly complicate and slow down tracing. I suspect it will actually be faster to use the secondary, indexed structures only for generating voxel bricks, keeping the primary octree you trace from fully unique. With high end illumination effects, you still want numerous cone traces per pixel, so tracing will dominate the workload and its better to minimize the tracing time and keep that structure as fast as possible.










Tuesday, June 9, 2009

Single Pass MSAA Deferred Rendering - Dithered Deferred Rendering Idea

So what we'd like to have is a deferred shading technique that renders in a single pass, with MSAA, but without much additional memory and bandwidth.  In my previous post I describe the MSAA Z Prepass idea, which adds MSAA on top of deferred shading by using only a little extra memory for a MSAA z-buffer (8-16 megs), and a z-only prepass (fairly fast), plus a little up-sampling resolve work at the very end.  I think this is pretty good, and is probably overall better than the inferred lighting idea currently pursued by volition.  As a quick side note, I would also accumulate lighting buffers at a seperate, lower resolution than the screen, just as in the inferred lighting technique, and would not necessarily even render the main DS buffer at full 720p.  (ideally it could even vary dynamically)  Only the intial Z-prepass itself needs full screen resolution with MSAA, as it determines coverage.

The key concept here is one of spatial frequency optimization.  The numerous sub-components of the shading and lighting contributions do not have equal image contributions at all frequency bands - and their cost of computation varies greatly, so there is a huge potential gain by seperating out sub-components of shading and lighting computations and evaluating them at numerous reductions of the full image resolution.  This is a huge speedup if we can quickly up-sample and combine them together later - which is a motivation for a fast bilateral/depth context filter.  A typical example is evaluating SSAO at reduced resolution, but we can apply it more generally to everything.

For Deferred Shading, we need to output numerous shader inputs in the geometry pass, but we'd like to reduce the memory footprint down to reasonable levels.  What we'd really like to do is render directly into a compressed format.  The is exactly what dithered rendering accomplishes.  The key of the dithered deferring idea is inspired by Mpeg/Jpeg's mosiac YUV decomposition, which seperates luminance and chrominance, storing luminance at full resolution and the two chrominance values at reduced resolution on an offset grid fashion.  So the trick is to break up all the individual scalar components and output them on different interlaced grids, allocating space so that just a few of the most important components get full resolution, and the rest are rendered at reduced resolution with different dithering patterns.

With deferred rendering, we typically have per-pixel storage requirements of depth, 3 albedo color components, 2-3 normal components, emmissive, 2-6 specular components, and perhaps a few extra.  Depth needs sub-pixel resolution (as it represents MSAA coverage), luminance and perhaps one of the normal's components needs full pixel resolution, and the rest need less.  Next in importance is probably specular scale and then chrominance - specular power, emissive, and specular chrominance (if present) are usually much lower frequency.  Now we could probably actually pack most of this in two a single 32-bits with dithering, but with 2X MSAA, we actually have 64-bits to work with per pixel, rendering to a 32-bit buffer.

Unfortunately, I'm not aware of a single pass method to output seperate values to the different MSAA samples.  Maybe there is some low-level hackery that could accomplish this?  But regardless, thats actually ok, because not all materials actually use all of the parameters listed above.  So, the idea is to isolate out the more 'rare' components and render them in a 2nd pass that touches just 1 of the MSAA samples.  Albedo, Normal, specular level, and perhaps specular power could fit in the 1st pass, and then emissive and any remaining specular in the 2nd pass.

The dithering would be undone and unpacked during or before lighting into non MSAA (and lower res) textures, hopefully reusing some temp memory as its only needed during the lighting stage.  It may also be worthwhile to still do lighting with a light buffer at lower resolution and then up-sample and combine with albedo to get better lighting fillrate.

Why this is potentially cool:
- single CPU-geometry pass for performance limited stuff like foilage, terrain, anything with a more basic shader
- extra pass for just the objects with fancy shaders with alot of inputs (typically not many objects)
- low memory/bandwidth cost (potentially lowest of all schemes)
- full MSAA w/ fat deferred buffers.  Even 4x MSAA could be feasible, or perhaps 2X MSAA 1080p
- could output seperate stencil in the 2nd pass, so 16 bits of stencil, more with additional passes for special objects

With a little extra work to store ID bits in the stencil or in one of the channels, you could also get this to work with alpha-to-coverage MSAA.  You need some sort of ID in that case as the alpha-to-coverage needs to use all the MSAA samples at once - so the sample's meaning can't just be determined by its grid position.

On the alpha-coverage MSAA note, I've been wondering of late if 4x MSAA with dithering is enough to render full order-independent translucency with DS for things like soft particles.  There's some issues combining the ideas, but its possible with the ID technique as the particles don't need to store much info per pixel at all.  I did some little tests in photoshop with 5 color dithering (equivalent to alpha to coverage output with 4x MSAA), and it is feasible.  A good dither plus a slight post process blur results in a little quality loss, but not too much.  The challenges are in the dithering, post-process depth-order blur, soft particle z-output, and of course performance loss due to a highly randomized z-buffer.  But if feasible, it would be great to have fully lit/shadowed soft particles follow the same path as everything else.  Volumetric lighting and godrays could come out almost for free.


Deferred Rendering w/ MSAA - MSAA Z Prepass Idea

Deferred rendering presents something of a challenge to combine with MSAA on current console hardware because of the memory/bandwidth overhead of storing multiple render targets with multiple samples.  The extra memory hurts both PS3 and 360 equally, but the bandwidth effect differs on the two platforms.  On PS3, there is a straightfoward additional bandwidth cost proportional to the per-pixel output memory touched.  This can really hurt for overdraw intensive items, such as foilage.  Killzone 2 shipped with 2x quincox MSAA and a fat deferred shading buffer 64 Megs!  If i recall correctly.  However, it was PS3 only, and killzone 2 essentially has no foilage - the world is desolate & urban.  They certainly aren't rendering jungles.

On 360, the memory situation is even worse because of EDRAM.  Even though the bandwidth while rendering is nearly free, the geometry overhead adds cost and the all the extra resolves eat up time.  The tile padding ends up using up even more total bandwidth.  And during resolve, the rest of the system is essentially idle, destroying parallelism, so its actually somewhat worse than on PS3.

Newer GPU can do MSAA compression while rendering, using simple block based schemes that store a couple of real samples per pixel (like the min/max of DXTC), and then a large number of 'coverage samples' per pixel which are simply a few bits to select an examplar sample.  This compression takes advantage of the fact that really the high frequency information we care about is coverage, and its somewhat wasteful to store all of our buffers at multi-sample resolution.

So based on that, I have a MSAA deferred shading idea that uses 2 passes.  Lets call this technique MSAA Z-prepass.  One pass is rendered with z-only, 2x or even 4x MSAA, and no other buffers active - an MSAA z-pre pass essentially.  The second pass is rendered with your typical DS buffers, but no MSAA.  You then perform lighting/shading as normal, resulting in a pre-final non-MSAA framebuffer.  As a final step, you use a bilateral filter of depth to fill in the missing information and up-sample to MSAA resolution, which can then be resolved back down for the final buffer - naturally this can all be combined in to one fast step.  I'm assuming familiarity with bilateral filtering - but basically here I'm using it as a depth-sensitive up-sample.  The results should be very similar to full MSAA on all the buffers, but without the memory/bandwidth cost - as it uses the same compression principle the newer coverage based MSAA techniques use.

With a careful z-downsample/resolve pass, you can probably use the 1st Z pass to populate the Hi-Z for the 2nd pass and speed up rendering.  Still requires 2 render passes, which is un-optimal as I described in a previous post.

This is still on my wishlist, not something I've had the time to implement, but there was a recent paper by some guys at volition that uses the same principles to combine MSAA with 2 pass deferred lighting.  They decided this warranted a whole new name, dubbing it inferred lighting.

They modify defered lighting to render at different resolutions in the two passes.  Specifically, they use a reduced frame buffer (40% or so) for 1st depth/normal pass, and then use a full size MSAA buffer for the 2nd pass.  The light buffers are up-sampled using a bilateral technique in the shader of the 2nd pass.  By extending this up-sampling technique with some dither knowledge, they also do stippled alpha rendering to get some level of order-independent translucency with deferred shading.  Not enough that you could render a full particle system with that path, but enough for a few layers of glass windows or what not.

Their solution is interesting, but it even worsens the performance problems with alpha tested stuff I mentioned in my earlier post on Deferred Shading vs Deferred Lighting - as the 2nd pass has now gotten considerably more expensive due to the bilateral up-sampling filter.  And worse, since the buffers mis-match, they can not easily use the 1st pass output to prime the z-buffer for the 2nd pass.  (its probably possible to do a conservative screen pass just to populate the HI-Z, but not sure if they are doing that.)  Noticeably, Red Faction takes place on mars, so their engine doesn't have to deal with foilage.

Accumulating lighting at lower resolution (or actually better - multiple resolutions) is something I've been thinking about for a while, and is already well tested at least for AO, and they are using this to great effect in Red Faction to get lots of lights per pixel at speed.

But anyway, this motivated me to try and improve my MSAA Z prepass idea to get it down to a single pass with deferred shading, and also to find a fast method of bilateral or depth-sensitive filtering.



Deferring Techniques

At this point in time, some form of deferred rendering is becoming the standard rendering technique in games.  I've long been a fan of deferred shading, and was quite pleased with the results after converting our forward renderer to deferred on a 360 project a little over a year ago.  More recently, moving to a different project and engine, we went through the forward->deferred transition again, but our lead programmer tried a variation of the idea called deferred lighting.  From the beginning, I wasn't a fan of the technique, for a variety of reasons.  For a longer summary of the idea, and a more in depth comparison of deferred shading vs deferred lighting, check out this extensive post on gameangst.  At this point I am going to assume you are familiar with the techniques.  I mainly agree with Adrian's points, but there's a few issues I think he left out.

Deferred lighting is usually marketed as a more flexible alternative to 'traditional' deferred shading which has the proposed advantages:
- similar performance, perhaps better in terms of light/pixel overdraw cost
- no compromise in terms of material/shader flexibility

In short, I think the first claim is dubious at best, and the 2nd claim actually turns out to be false.  Adrian has an exposition on why the 2nd material pass in deferred lighting actually gives you much less flexibility than you would think.  The simple answer is that material flexibility (or fancy shaders), modify BRDF inputs or they alter the BRDF itself.  Flexibility in terms of modifying the BRDF inputs (multi-layer textures, procedural, animated textures, etc.) can easily be accounted for in traditional deferred shading, so there is no advantage there.  Deferred lighting is quite limited in how it can modify the BRDF because it must use a common function for the incoming light  (irradiance) at each surface point, for all materials.  It only has flexibility for the 2nd half of the BRDF, the exit light (radiance) on the eye path.  Materials with fancy specular (like skin) are difficult to even fake with control only over exit radiance.

Now, there is a solution using stencil techniques that allows multiple shader paths during light accumulation, but traditional deferred shading techniques can use this too to get full BRDF flexibility.  So Deferred Lighting has no advantage in BRDF flexibility.  (more on the stencil techniques in another post)

But the real problem with deferred lighting is in performance - its not similar to deferred shading at all.  The 1st problem is that all else being equal, two full render passes are just always going to be slower.  The extra CPU draw call cost and geometry processing can be significant, especially if you are trying to push the geometry detail limits of the hardware (and shouldn't you?).  The geometry processing could only be 'free' if there was significant pixel shader work to load balance against, and the load balancing was effecient.  On PS3, the load balancing is not effecient, and more importantly, there is not much significant pixel shader work.  Most of the significant pixel shader work is in the light accumulation, which is moved out of any geometry pass in both techniques - so they easily will be geometry limited.  This is the prime disadvantage of any deferred technique right now vs traditional forwad shading.  With forward shading, its much easier to really push the geometry limits of the hardware, as all pixel shading is done in one heavy pass.

Furthermore, the overdraw performance of the two systems is not comparable, and for high overdraw objects, such as foilage, deferred shading has a large advantage.  Foilage objects are typically rendered with alpha-test, and because of this they receive only a partial benefit from the hardware's HI-Z occlusion.  In our engine, the 1st pass in the two techniques for simple foilage is similar, both sample a single texture for albedo/alpha.  The only difference is in DS the 1st pass outputs albedo and normal vs just the normal for DL.  The 2nd pass, unique to DL, must read that same diffuse/albedo texture again, as well as the lighting information, which is often in a 1 or 2 64-bit texture(s).  So its easily 3 times the work per pixel touched.

As a side note:  the problems with Hi-Z and alpha test are manifold.  With 2 pass rendering, you would think the fully populated z-buffer and Hi-Z from the 1st pass will limit overdraw in the 2nd pass to a little over 1.0.  This is largely true for reasonable polygon scenes without alpha test.  The problem with alpha-test is that it creates a large number of depth edges and wide z-variation within each Hi-Z tile.  Now, this wouldn't be such a problem if the Hi-Z tiles stored a min/max z range, because then you could do fast rejection on the 2nd pass with z-equal compares.  But they store a single z-value, either the min or the max, useful only for a greater-equal or less-equal compare test.  Thus, when rendering triangles with alpha-test in the second pass, you get alot of false overdraw for pixels with zero-alpha that still pass the Hi-Z test.

The DS vs DL debate gets a little more complicated when you try to do MSAA, but thats another story.




Thursday, April 2, 2009

OnLive, OToy, and why the future of gaming is high in the cloud


For the last six months or so, I have been researching the idea of cloud computing for games, the technical and economic challenges, and the video compression system required to pull it off.

So of course I was shocked and elated with the big OnLive announcement at GDC.

If OnLive or something like it works and has a successful launch, the impact on the industry over the years ahead could be transformative. It would be the end of the console, or the last console. Almost everyone has something to gain out of this change. Consumers gain the freedom and luxury of instant on demand access to ultimately all of the world's games, and finally the ability to try before you buy or rent. Publishers get to cut out the retailer middle-man, and avoid the banes of piracy and used game resales.

But the biggest benefit ultimately will be for developers and consumers in terms of the eventual game development cost reduction and quality increase enabled by the technological leap cloud computing makes possible. Finally developing for one common, relatively open platform (server-side PC) will significantly reduce the complexity in developing a AAA title. But going farther into the future, once we actually start developing game engines specifically for the cloud, we enter a whole new technological era. Its mind-boggling for me to think of what can be done with a massive server farm consisting of thousands or even tens of thousands of densly networked GPUs with shared massive RAID storage. Engines developed for this system will look far beyond anything on the market and will easily support massively multiplayer networking, without any of the usual constraints in physics or simulation complexity. Game development costs could be cut in half, and the quality bar for some AAA titles will eventually approach movie quality, while reducing technical & content costs (but that is the subject for another day).

But can it work? And if so, how well? The main arguments against, as expressed by skeptics such as Richard Leadbetter, boil down to latency, bandwidth/compression, and server economics. Some have also doubted the true value added for the end user: even if it can work technically and economically, how many gamers really want this?

Latency

The internet is far from a guaranteed delivery system, and at first the idea of sending players inputs across the internet, computing a frame on a server, and sending it back across the internet to the user sounds fantastical.
But to assess how feasible this is, we first have to look at the concept of delay from a pyschological/neurological perspective. You press the fire button on a controller, and some amount of time later, the proper audio-visual response is presented in the form of a gunshot. If the firing event and the response event occur close enough in time, the brain processes them as a simultaneous event. Beyond some threshold, the two events desynchronize and are processed distinctly: the user notices the delay. A large amount of research on this subject has determined that the delay threshold is around 100-150ms. Its a fuzzy number obviously, but as a rule of thumb, a delay of under 120ms is essentially not noticeable to humans. This is a simple result of how the brain's parallel neural processing architecture works. It has a massive number of neurons and connections (billions and trillions respectively), but signals propagate across the brain very slowly compared to the speed of light. For more reference I highly recommend "Consciousness Explained" by Daniel C Dennet. Here are some interesting timescale factoids from his book:

saying, "one, Mississippi" 1000msec
umyelinated fiber, fingertip to brain 500msec
speaking a syllable 200msec
starting and stopping a stopwatch 175msec
a frame of television (30fps) 33msec
fast (myelinated) fiber, fingertip to brain 20msec
basic cycle time of a neuron 10msec
basic cycle time of a CPU(2009) .000001msec

So the minimum delay window of 120ms fits very nicely into these stats. There are some strange and interesting consequences of these timings. In the time it takes the 'press-fire' signal to travel from the brain down to the finger muscle, internet packets can travel roughly 4,000 km through fiber! (light moves about 200,000 km/s through fiber, or 200 km/msc * 20 msc) This is about the distance from Los Angeles to New York. Another remarkable fact is that the minimum delay window means that the brain processes the fire event and the response event in only a few dozen neural computation steps.

What really happens is something like this: some neural circuits in the user's brain "make the decision" to press the fire button (although at this moment most of the brain isn't conscious of it), the signal travels down through the fingers to the controller then on to the computer, which then starts processing the response frame. Meanwhile, in the user's brain, the 'button press' event is propagating through the brain, and more neural circuits are becoming aware of the 'button press' event. Remember, each neural tick takes 10ms. Some time later, the computer displays the audio/visual response of the gunshot, and this information hits the retina/cochlea and starts propagating up into the brain. These events connect, and if they are seperated by only a few dozen neural computation steps (120 ms), they are connected and perceived as a single, simultaneous event in time. In another words, there is a minimum time window of around a dozen neural firing cycles where events are propagating around the brain's neural circuits - even though it already happened, it takes time for all of the brain's circuits to become aware of the event. Given the slow speed of neurons, its simply remarkable that humans can make any kind of decisions on sub second timescales, and the 120 ms delay window makes perfect sense.

In the world of computers and networks, 120 ms is actually a long amount of time. Each component of a game system (input connection, processing, output display connection) adds a certain amount of delay, and the total delay must add up to around 120ms or less for good gameplay. Up to 150ms is sometimes acceptable, and beyond 200ms we get quickly into rapid, problematic breakdown in the user experience as every action has noticeable delay.

But how much delay do current games have? Gamasutra has a great article on this. They measure the actual delay of real world games using a high speed digital camera. Of interest for us, they find a "raw response time for GTAIV of 166 ms (200 ms on flat panel TVs)". This is relatively high, beyond the acceptable range, and GTA has received some criticism for sluggish response. And yet this is the grand blockbuster of video games, so it certainly shows that some games can get away with 150-200ms responses and the users simply don't notice or care. Keep in mind this delay time isn't when playing the game over OnLive or anything of that sort: this is just the natural delay for that game with a typical home setup.

If we break it down, the controller might add 5-20ms, the TV can add 10-50ms, but the bulk of the delay comes from the game console itself. Like all modern console games, the GTA engine buffers multiple frames of data for a variety of reasons, and running at 30fps, every frame buffered costs a whopping 30ms of delay. From my home DSL internet in LA, I can get pings of 10-30ms to LA locations, and 30-50ms pings to locations in San Jose. So now you can imagine lengthening the input and video connections out across the internet is not so ridiculous as it first seems at all. It adds additional delay, which you simply need to compensate for somewhere else.

How does OnLive compensate for this delay? The result for existing games is deceptively simple: you just run the game at a mucher higher FPS than the console, and or you reduce internal frame buffering. If the PC version of a console game runs at 120 FPS, and it still keeps 4 frames of internal buffering, you get a delay of only 32 ms. If you reduce the internal buffering to 2, you get a delay of just 16ms! If you combine that with a very low latency controller and a newer low latency TV, suddenly it becomes realistic for me to play a game in LA from a server residing in San Jose. Not only is it realistic, but the gameplay experience could actually be better! In fact, with a fiber FIOS connection and good home equipment, you could conceivably play from almost anywhere in the US, in theory. The key reason is that many console games have already maxxed out the maximum delay (when running on the console), and modern GPU's are many times faster.

Video Compression/Bandwidth

So we can see that in principle, from purely a latency standpoint, the OnLive idea is not only possible, but practical. However, OnLive can not send a raw, uncompressed frame buffer directly to the user (at least, not at any acceptable resolution on today's broadband). For this to work, they need to squeeze those frame buffers down to acceptably tiny sizes, and more importantly, they need to do this rapidly or near instantly. So is this possible? What is the state of the art in video compression?

For a simple, dumb solution, you can just send raw jpegs, or better yet, wavelet compressed frames, and perhaps get acceptable 720p images down to 1 Mbit or even 500Kbit for more advanced wavelets, using more or less off the shelf algorithms. With a wavelet approach, this would allow you to get 10fps with a 5Mbit connection. But of course we can do much better using a true video codec like H.264, which can squeeze 720p60fps video down to 5Mbit easily, or even considerably less, especially if we are willing to lower the fps in some places and or the quality.

H.264 and other modern video codecs work by sending improved JPEG key frames, and then sending motion vectors which allow predicted frames to be delta-encoded in far less bits, getting 10-30X improvement over sending raw JPGs, depending on the motion. But unfortunately, motion compensation means spikes in the bitrate - scene cuts or frames with rapid motion receive little benefit from motion compensation.

But H.264 encoders typically buffer up multiple frames of video to get good compression. OnLive has much less leeway here. Ideally, you would like a zero-latency encoder. H.264 and its predecessors have been designed to be used in video tele-conferencing systems, which demand low-latency. So there is already a predecent, and a modified version of the algorithm that avoids sending complete JPEG key frame images. Instead, using this low latency mode, small blocks of the image are periodically refreshed, but it never sends a complete JPEG key frame down the pipe, as this would take too long - creating multiple frames of delay.

There are in fact some new, interesting off the shelf H.264 hardware solutions which have near zero (1ms) or so delay, and are relatively cheap (in cost and power) - perhaps practical for OnLive. In particular, there is the PureVu family of video processors, from Cavium Networks. I have not seen them in action, but I imagine that with 720p60 at 5MBits/s, you are going to see some artifacts and glitches, especially with fast motion. But at least we are getting close, with off the shelf solutions.

But of course, OnLive is not using an off the shelf system(they have special encoding hardware and a plugin decoder), and improved video compression specific to the demands of remote video gaming is their central tech, so you can expect they have created an advancement here, but it doesn't have to be revolutionary, as the off the shelf stuff is already close.

So the big problem is the variation in bitrate/compressibility from one frame to the next. If the user rapidly spins around, or teleports, you simply can not do better than sending a complete frame. So you either send these 'key' frames at lower quality, and or you spend a little longer on them, introducing some extra delay. In practise some combination of the two is probably ideal. With a wavelet codec or a specialized H.264 variant, key frames can simply be sent at lower resolution, and then the following frames will use motion compensation to start adding detail to the image. The appearance would be a blurred image for the first frame or so when you rapidly spin the camera, which would then quickly up-res in to full detail over the next several frames. With this technique, and some trade off of lowering the frame rate or adding delay a bit on fast motion, I think 5Mbps is not only achievable, but beatable using state of the art compression coming out of research right now.

The other problem with compression is the CPU cost for compression itself. But again, if the PureVu processor is indicative, off the shelf hardware solutions are possible right now with H.264 at very low power, encoding multiple H.264 streams with near zero latency.

But here is where the special nature of game video or computer generated graphics allows us to make some huge effeciency gains over natural video. The most complex CPU task in video encoding is motion vector search - finding the matching image regions from previous frames that allow the encoder to send motion vectors and do effecient delta compression. But for a video stream rendered with a game engine, we can output the exact motion vectors directly. This is a potential problem in that not all games necessarily have motion vectors available, which may require modifying the game's graphics engine. However, motion blur is very common now in game engines (everybody's doing it, you know), and the motion blur image filter computes motion vectors (very cheaply). Motion blur gives an additional benefit for video compression in that it generates blurrier images in fast motion, which are the worst case for video compression.

So if I was doing this, I would require the game to use motion blur, and output the motion vector buffer to my (specialized, not off the shelf) video encoder.

Some interesting factoids: it apparently takes roughly 2 weeks to modify the game for OnLive, and at least 2 of the 16 announced titles (Burnout and Crysis) are particularly known for their beautiful motion blur - and all of them, with the exception of World of Goo - are recent action or racing games that probably use motion blur.

There is however, an interesting and damning problem that I am glossing over. The motion vectors are really only valid for the opaque frame buffer. What does this mean? The automatic 'free' motion vectors are valid for the solid geometry, not all the alpha-blended or translucent effects, such as water, fire, smoke, etc. So these become problem areas. Its interesting that several of the GDC commentors pointed out ugly compression artifacts when fire or smoke effects were prominent in BioShock running OnLive.

However, many games already render their translucent effects at lower resolution (SD and even lower in modern console engines), so it would make sense perhaps to simply send these regions at lower resolution/quality, or blur them out (which a good video encoder would probably do anyway).

But in short, the video compression is the central core tech problem, but they haven't pulled a miracle here - at best they have some good new tech which exploits some of the special properties of game video. And furthemore, I can even see a competitor with a 2x better compression system coming along and trying to muscle them out.

There's one other little issue which is worth mentioning slightly, which is packet loss. The internet is not perfect, and sometimes packets are lost or late. I didn't mention this earlier because it has well known and relatively simple technical solutions for real time systems. Late packets are treated as dropped, and dropped packets and errors are corrected through bit level redundancy. You send small packet streams in groups using bit association techniques such that any piece of lost data can be recovered, at the cost of some redundancy. For example, you send 10 packets worth of data using 11 packets, and any single lost packet can be fully reconstructed. More advanced schemes adaptively adjust the redundancy based on measured packet loss, but this tech is alreadly standard, its just not always use or understood. Good game networking engines already employ these packet loss mitigation techniques, and work fine today over real networks.

The worst case is simply a dropped connection, which you just can't do anything about - OnLive's video stream would immediately break and notify you of a connection problem. Of course, the cool thing about OnLive is that it could potentially keep you in the game or reconnect you once you get your connection back.

Server Economics

So if OnLive is at least possible from a technical perspective (which it clearly is), the real question comes down to one of economics. What is the market for this service in terms of the required customer bandwidth? How expensive are these data centers going to be, and how much revenue can they generate?

Here is where I begin to speculate a little beyond my areas of expertise, but I'll use whatever data I've been able to gather from the web.

A few google searches will show you that US 'broadband' penetration is around 80-90%, and the average US broadband bandwidth is somewhere around 2-3 Mbps. This average is somewhat misleading, because US broadband is roughly split between cable (25 million subscribers), and DSL (20 million subscribers), with outliers like fiber (2-3 million subscribers currently) and the DSL users often have several times lower bandwidth than the cable. At this point in time, the great majority of American gamers already have at least 1.5 Mbps, perhaps half have over 5 Mbps, and almost all have a 5 Mbps option in their neighborhood, if they want it. So OnLive is in theory will have a large potential market, it really comes down to cost. How many gamers already have the required bandwidth? And for those who don't, how cheap is OnLive when you factor in the extra $ users may have to pay to upgrade? And to point out, the upgrade really will be for the HD option, as the great majority of gamers already have 1.5 Mbps or more.

BandWidth Caps

There's also the looming threat of American telcos moving towards bandwidth caps. As of now, Time Warner is the only American telco experiementing with caps low enough to effect OnLive (40 Gigs/Month for their highest tier). Remember that using the HD option, 5 Mbps is the peak bandwidth, the average useage is half that or less, according to OnLive. So Comcast's cap of 250 Gigs/Month isn't really relevant. Time Warner is currently still testing its new policy in only a few areas, so the future is uncertain. However, there is one interesting fact to throw into the mix: Warner Bros, the Time Warner subsidary, is OnLive's principle investor. (the other two are AutoDesk and Maverick Capital) Now conser that Warner cable is planning some sort of internet video system for television based on a new wireless cable modem, and consider that Perlman's other company was Digeo, the creator of Moxi. I think there will be more OnLive suprises this year, but suffice to say, I doubt OnLive will have to worry about bandwidth caps from Time Warner. I suspect Time Warner's caps really are more about a grand plot to control all digital services in the home, by either direclty providing them or charging excess useage fees that will kill enemy services. But OnLive is definetly not their enemy. In the larger picture, the fate of OnLive is entertwined into the larger battle for net neutrality and control over the last mile pipes.


Bandwidth Cost

OnLive is going to have to partner with backbones and telcos, just like the big boys such as Akamai, Google and YouTube do, in what are called either transit or peering arrangements. A transit arrangement is basically bandwidth wholesale, and we'll start with that assumption. A little google searching reveals that wholesale mass transit bandwidth can be had for around or under 10$ per Megabit/s per month (comparable to end broadband customer cost, actually). Further searching suggests that in some places like LA it can be had for under 5$ per Mbs/month. This is for a dedicated connection or peak useage charge.

Now we need some general model assumptions. The exact subscriber numbers don't really matter, what critically matters are a couple of stats: how many hours a month does each subscriber play, and more directly, what is the typical peak fraction of users online at a given time. The data I've found suggests that 10 hours per week is a rough gamer average, or 20 hours per week for an MMO, 10% occupancy is typical for regular games and 20% peak occupancy is typical for some MMOs. Using the 20% peak occupancy means that you need to provide enough peak bandwidth for 20% of your user base to be online at a time - a worst case. In a potential worse case scenario, every user wants HD at 5 Mbits/s and the peak occupancy is 20%, so you need essentially a dedicated 1 Megabit/s for each user or $10/month per user in bandwidth cost alone. Assuming a perhaps more realistic scenario, the average user bandwidth is 3Mbps (not everyone can have or wants HD), peak occpuancy is 10%, and you get $3 per month in bandwidth cost per user.

Remember, in rare peak moments, OnLive can gracifully and slowly degrade video quality - so the service will never fail if they are smart. The worst case at terrible peak times is just a little lower image quality or resolution.

So roughly, we can estimate bandwidth will cost anywhere from $3-10 per month per user with transit arrangements. Whats also possible, and more complex, are peering arragnements. If OnLive partners directly with providers near its data centers, it can get substantially reduced rates (or even free) if the traffic stays with just that provider. So realistically, i think $5 per month in bandwidth per user is a reasonable upper limit on OnLive's bandwidth charges based on today's economic climate - and this will only go down. But 1080p would be significantly more expensive, and it would make sense to charge customer's extra. I wouldn't be surprised if they have a tiered charge based on resolution - as most of their fixed costs scale linearly with resolution.

Dataroom Expense

The main expense is probably not the bandwidth, but the per server cost to run a game - a far more demanding task than what most servers do. Lets start with the worst case and assume that OnLive needs at least one decent CPU/GPU combination per logged on user. OnLive is not stupid, so they are not going to use typical high end, expensive big iron, but nor are they going to use off the shelf PC's. Instead I predict that following in the footsteps of google they will use midrange, cheaper, power effecient components, and get significant bulk discounts. Lets start with the basic cost of a CPU/motherboard/RAM/GPU combo. You don't need a monitor and the storage system can be shared between a very large number of servers - as they are all running the same library of installed games.

So lets take a quick look on pricewatch:
Core 2 Quad Q6600 Cpu fan + - 4GB RAM DDR2 $260
GeForce GTX280 1 GB 512-Bit DDR3 602/2214 Fansink HDCP Video Card $260

These components are actually high end, far more than sufficient to run the PC versions of most existing games at 90-150fps at 720p, and yes even crysis at near 60fps at 720p.

If we consider that they may have researched a little longer and undoubtedly get bulk discounts, we can take $500 per server unit as a safe upper limit. Amortize this over 2 years and you get $20 per month. Factor in the 20% peak demand occupancy, and we get a server cost of $4 per user per month.

This finally leaves us with power/cooling requirements. Lets make an over-assumption of 600watt continous power draw. With power at about $0.10 per kilowatt/hour, and 720 hours in a month, we get roughly $40 a month per server in power draw. Factor in the 20% peak demand occupancy, and we get $8 per user per month. However, this is an over-assumption because the servers are not constantly using power. The 20% peak demand figure means they need enough servers for 20% of their users to be logged in at once - but most of the time not all of the servers are active. The power required would scale with the average demand, not the peak, so its closer to $4 per user per month in this example (assuming a high average 10% occupancy). Cooling cost is harder to estimate, but some google searching reveals its roughly equivalent to the power cost, assuming modern datacenter design (and they are building brand new ones). So this leaves us with around $12 per user per month as an upper limit in server, power, and cooling cost.


However, OnLive is probably more effecient than this. My power/cooling numbers are high because OnLive probably spends a little extra on more expensive but power effecient GPU's that save power/cooling cost to hit the right overall sweet spot. For example, nvidia's more powerful GTX 295 is essentially two GTX 280 cores on a single die. Its almost twice as expensive, but provides twice the performance (so similar performance per $) and draws only a little more power (twice as power effecient). Another interesting development is that Nvidia (OnLive's hardware partner), recently announced virtualization support so that multi-GPU systems can fully support multiple concurrent program instances. So what it really comes down to is how many CPU cores and or GPU cores you need to run games at well over 60fps. Based on what I can see from recent benchmarks, two modern intel cores and a single GPU are more than sufficient (most console games only have enough threads to push 2 CPU cores). Nvidia's server line of GPU's are more effecient and only draw 100-150 watts per GPU, so 600 watts is a high over-estimate of the power required per connected user.

But remember, you need a high FPS to defeat the internet latency - or you need to change the game to reduce internal buffering. There are many trade offs here - and I imagine OnLive picked low-delay games for their launch titles. Apparently Onlive is targeting 60fps, but that probably means most games usually get even higher average fps to reduce delay.

Overall, I think its reasonable, using the right combination of components (typically 2 intel CPU cores and one modern nvidia GPU, possibly as half of a single motherboard system using virtualization) to have the per user power cost down to something more like 200 watts to drive a game at 60-120fps (remember, almost every game today is designed primarily to run at 30fps on the xbox 360 at 720p, and a single modern nvidia GPU is almost 4 times as powerful). Some really demanding games (crysis), get the whole system - 4 cpus and 2 GPU's - 400 watts. This is what I think OnLive is doing.

So adding it all up, I think 10$ per month per user is a safe upper limit for OnLive's expenses, and its perhaps as low as 5$ per month or less, assuming they typically need two modern intel CPUs and one nvidia GPU per user logged on, adequate bandwidth and servers for a peak occupancy of 20%, and power/cooling for an average occupancy of 10%.

Clearly, all of the numbers scale with the occupancy rates. I think this is why OnLive is at least initially not going for MMOs - they are too addictive and have very high occupancy. More ideal would be single player games and casual games that are played less often. Current data suggests the average gamer plays 10 hours a week, and the average MMO players plays 20 hours per week. The average non-MMO player is thus probably playing less than 10 hours per week. This works out to something more like 5% typical occupancy, but we are interested more in peak occupancy, so my 10%/20% numbers are a reasonable over-estimate of average/peak. Again, you need enough hardware & bandwidth for peak occupancy, but the power & cooling cost is determined by average occupancy.

$10 per month may seem like a high upper limit in monthly expense per user, but even at these expense rates OnLive could be profitable, because this is still less than the cost to the user of running comparable hardware at home.

Here's the simple way of looking at it. That same $600 server rig would cost $1000-1500 for an end user, because they need extra components like a hard drive, monitor, etc which OnLive avoids or gets cheaper, and OnLive buys in bulk. But most importantly, the OnLive hardware is amortized and shared over a number of users. The user's high end gaming rig sits idle most of the time. So the end user's cost to play at home on an even cheap $600 machine amortized over 2 years is still over $30 per month, three times the worst case per user expense of OnLive. And that doesn't even factor in extra power expense for gaming at home. OnLive's total expense is probably more comparable to that of xbox 360. A $500 machine (include necessary periphials) amortized over 5 years is a little under $10 per month. And then xbox live gold service is another $5 a month on top of that. OnLive can thus easily cover its costs and still be less expensive than 360 and PS3, and considerably less expensive than PC gaming.


The game industry post Cloud

In reality, I think that OnLive's costs will be considerably less than $10 per user per month, and will be increasingly less over time. Just like the console makers periodically update their hardware to make the components cheaper, OnLive will be constantly expanding its server farms and always buying the current sweet spot combination of CPU's and GPU's. But Nvidia and Intel refresh their lineups at least twice a year, so OnLive can really ride moore's law continously. Every year OnLive will become more economical and or provide higher FPS and less delay and or support more powerful games.

So its seems possible, even inevitable that OnLive can be economically viable charging a relatively low subscription fee to cover their fixed costs - comparable to Xbox Live's subscription fee (about 5$/month for xbox live gold) . Then they make their real profit on taking a console/distributor like cut of each game sale or rental. For highly anticipated releases, they could even use a pay to play model initially, followed up by traditional purchase or rental later on, just like the movie industry does. Remember the madness that surrounded the Warcraft3 Beta, and think how many people would pay to play Starcraft2 multiplayer ahead of time. I know I would.

If you scale OnLive's investment requirements to support the entire US gaming population, you get a ridiculous hardware investment cost of billions of dollars, but this is no different than a new console launch, which is exactly what OnLive must be viewed as. The Wii has sold 22 million units in the Americas, the 360 is close behind at 17 million. I think these numbers represent majority penetration of the console market in the Americas. To scale to that user base, OnLive will need several million (virtual) servers, which may cost a billion dollars or more, but the investment will pay for itself as it goes - just as it did for Sony and Microsoft. Or they simply will be bought up by some big deep pocket entity which will provide the money, such as Google, or Verizon, or Microsoft.




The size and quantity of the datarooms OnLive will have to build to support even just the US gaming populations is quite staggering. We are talking about perhaps millions of servers in perhaps a dozen different data center locations, drawing the combined power output of an entire large power plant. And thats just for the US. However, we already have a very successful example of a company that has built up a massive distributed network of roughly 500,000 servers in over 40 data centers.

Yes, that company is Google.

To succeed, OnLive will have to build an even bigger and more massive supercomputer system. But I imagine Google makes less money per month for each of its servers than OnLive will eventually make for each of its gaming servers. Just how much money can OnLive eventually make? If OnLive could completley conquer the gaming market, than it stands to completely replace both the current consoles manufacturers AND the retailers. Combined, these entities take perhaps 40-50% of the retail price of a game. Even assuming OnLive only takes a 30% cut, it could thus eventually take in almost 30% of the game industry - estimated at around $20 billion per year in the US alone, and $60 billion world-wide, eventually turning it into another Google.

Another point to consider is that most high end PC sales are mainly used for gaming, and thus the total real gaming market (in terms of total money people spend for gaming) is even larger, perhaps as large as 100 billion worldwide, and OnLive stands to rake a chunk of this in and change the whole industry - further reducing the end consumer PC market and shifting that money into OnLive subscriptions, game charges, etc. part of which in turn covers the centralized hardware cost. NVIDIA and ATI will still get a cut, but perhaps less than they do now. In other words, in the brave new world of OnLive, gamers will only ever need a super-cheap microconsole or netbook to play games, so saving money on consoles and rigs will allow them to buy more games, and all this money gets sucked into OnLive.

Now consider that the game market has consistently grown 20% per year for many years and you can understand why investors have funnelled hundreds of millions into OnLive in order to make it work. And eventually, OnLive can find new ways to 'monetize' gaming (using Google's term), such as ads and so on. Eventually, it should make as much or more per user hour as television does.

Now this is the fantasy of course, but I doubt OnLive will grow to become a Google any time soon, mainly because Nintendo, Sony, Microsoft, and the like aren't going to suddenly dissappear, bringing me to my final point.



But What about the games?

In the end people use a console to play games and thus the actual titles are all that really matters. In one sense part of the pitch of OnLive - 'run high end PC games on your netbook' - is a false premise. Most of OnLive's lineup is current gen console games, and even though OnLive will probably run them at higher fps, this is mainly to compensate for latency. Video compression and all the other factors discussed above will result in an end user experience no better, and often worse than simply playing the console version. (especially if you are far from the data center) OnLive's one high end PC title - crysis - is probably twice as expensive for them to run, and will be seen as somewhat inferior to gamers who have high end rigs and have played the game locally. It will be more like the console version of Crysis. But unfortunately, Crytek's already working on that.

This is really the main obstacle that I think could hold OnLive back - 16 titles at launch is fine, but they are already available on other platforms. Nintendo dominated this current console generation because of its cheap, innovative hardware and a lineup of unique titles that exploit it. I think Nintendo of America's president Reggie Aime was right on the money:

Based on what I’ve seen so far, their opportunity may make a lot of sense for the PC game industry where piracy is an issue. But as far as the home console market goes, I’m not sure there is anything they have shown that solves a consumer need

What does OnLive really offer the consumer? Brag Clips? The ability to spectate any player? Try before you buy? Rent? These are nice(especially the latter two), but can they amount to a system seller?. Its a little cheaper, but is that really important considering most gamers already have a system? It seems that PC games could be where OnLive has more potential, but how much can it currently add over Steam? If OnLive's offerings expanded to include almost all current games, then it truly could acheive a high market penetration, as the successor of Steam (with the ultimate advantage of free trial and rental - which steam can never do). But Valve does have the significant advantage of having a variety of exclusive games built on the Source Engine, which all together (Left for Dead, CounterStrike, Team Fortress 2, Day of Defeat, etc) make up a good chunk of the PC multiplayer segment.

The real opportunity with OnLive is to have exclusive titles, which takes advantage of OnLive's unique super-computer power to create a beyond next gen experience. This is the other direction in which the game industry expands, by slowly moving into the blockbuster story experiences of movies. And this expansion is heavily tech driven.

If such a mega-hit was made, such as a beyond next gen Halo, or GTA, it could rapidly drive OnLive's expansion, because OnLive requires very little user investment to play. At the very least, everyone would be able to try or play the game on some sort of PC they already have, and the microconsole to play on your TV will probably only cost as much as a game itself. So this market is a very different beast than the traditional consoles, where the market for your game is determined by the number of users who own the console. Once OnLive expands its datacenter capacity sufficiently, the market for an exclusive OnLive game is essentially any gamer. So does OnLive have an exclusive in the works? That would be the true game changer.

This is also where OnLive's less flashy competitor, OToy & LivePlace, may be going in a better direction. Instead of building the cloud and a business based first on existing games, you build the cloud and a new cloud engine for a totally new, unique product, which is specifically designed to harness the cloud's super resources and has no similar competitor.

Without either exclusives or a vast, retail competitive game lineup, OnLive won't take over the industry.










Saturday, December 13, 2008

Forward Reprojection for future hardware

The schemes described in my previous post are well suited for ps3/360 level hardware and a mix of CPU/GPU work. Namely, octree-screen tile intersection and point splatting done on CPU threads, and everything else done on the GPU.

But for future hardware, point splatting is less ideal than direct cone tracing. Why? They end up being almost the same actually. The critical scene traversal loop, which I think is best done on the CPU at coarse resolution for the current-gen, would be better done at high resolution to support higher quality illumination. If you want accurate reflections and GI effects, you need to be able to 'rasterize' very small frustums the size of a few pixel blocks - so it amounts to almost the same thing. Whether splatting or tracing (scatter or gather), you are doing fine grained intersections of frustums (pixels/tiles) with a 3D tree structure (octree or what have you).

For cuda based hardware, I think cone tracing into a new type of adaptive distance field structure is the way to go, and some initial prototyping done at home has been very promising. Combine this with the forward projection system, and you can get away with tracing 30k or so pixels per frame on average for primary visibility! For dynamic lighting update effects, you want additional rays per pixel, but with the magic of fast cone tracing (which is far more effecient than rays for soft-area queries) combined with the magic of frame coherent reprojection, I think we can hit uber quality with far less than 30 million cone traces per second, which my prototype can already exceed on an 8800GT.

However, one area which can be improved is the forward projection. Since tracing is relatively expensive, and incoherent tracing is vastly more so, its worth it to really get fine grained accurate projection and improve the coherence.

One simple way to improve the tracing coherence is to reorder the frame buffer after the projection pass. This can be done on the GPU, resulting in a reordered frame buffer with nice coherent blocks of pixels which need to be traced. This doesn't necessarily help for memory coherence, as these rays can still be scattered in space, especially after reflections, but it helps immensely with branch performance, which is critical.

The coherence can also be improved by splatting at finer granularity. Ideally we would want to actually splat individual samples as point splats, but actually using point primitives is way too slow on even new GPU's, as it goes through the terrible polygon rasterizer hardware bottleneck, and doing a few million per frame, although possible on modern GPU's, would eat up a big chunk of the frame time.

We can do better in cuda, which got me thinking about how to do this properly and in parallel. In short, I think a hierachical tile sorting approach is the way to go. First, you project all the points and save out the 2d positions - easy and fast. In the next step, the points are again broken up and assigned to threads, and each thread then builds up its own per-tile list point list of points hitting that tile - essentially sorting its subset of the points onto all the tiles. This results in NumThreads seperate point lists per tile. Then in the next pass, each thread gets a tile, and it merges all the lists for that tile from the 1st pass, resulting in a single list of points for each tile. There's a few details that i'm skipping over, such as parallel memory allocation to build up the lists - but this just requires a seperate counting/histogram pass. The tiles can't be too small as it would eat up too much memory for all the lists. Nor can they be too big, as you need adequate threads.

After the particles are thus sorted into per-tile lists, you can then subdivide and repeat, until you get down to fine-grained tiles (perhaps it only takes a couple of iterations). The fine grained tiles with their particle lists can then be rasterized, wich each thread getting one such tile, so the whole operation is parallel and scaleable, without any memory conflicts. You could store the final microtiles in local memory actually, so that z-buffering can be done without alot of extra memory read-writes.

This is probably similar or related to how they intend to do parallel polygon rasterization in Larrabee, but I didn't read all of that paper yet. Polygon rasterization doesn't really interest me much anymore, for that matter.

Hmm, come to think of it, this multi-stage hierachy sorting can be generalized for any data vs tree structure intersection problem, of which finding point-quad intersections is just one example.

Why is all this useful again? Because hopefully it could be much faster than using the triangle setup engine to render point primitives, and because reprojecting at pixel granularity could better handle the difficult cases with less errors than reprojecting larger tiles, especially for some difficult scenes that have lots of z-edges (like foilage).

Point splatting with cuda in this way is also a potential alternative to tracing, although my current feeling is that its not as well suited to fine granularity searches, for small fustra generated for reflections and advanced illumination. However, it is much better suited to handle animated objects, and may have an advantage there vs dynamic octree construction.

Followers