EnterTheSingularity

This blog has moved

2021-11-20T12:51:00.004-08:00

Moved years ago to https://entersingularity.wordpress.com/

Latency & Human Response Time in current and future games

2010-04-04T13:26:00.000-07:00

I'm still surprised at how many gamers and even developers aren't aware that typical games today have total response latencies ranging anywhere from 60-200ms. We tend to think of latencies in terms of pings and the notion that the response time or 'ping' from a computer or a console five feet away can be comparable to the ping of a server a continent away is something of an unnatural notion.

Yet even though it seems odd, its true.

I just read through "Console Gaming: The Lag Factor", a recent blog article on EuroGamer which follows up on Mick West's original Gamasutra article that pioneered measuring the actual response times of games using a high speed digital camera. For background, I earlier wrote a GDM article (Gaming in the Cloud) that referenced that data and showed how remotely rendered games running in the cloud have the potential to at least match the latency of local console games, primarily by running at a higher FPS.

The eurogamer article alludes to this idea:

In-game latency, or the level of response in our controls, is one of the most crucial elements in game-making, not just in the here and now, but for the future too. It's fair to say that players today have become conditioned to what the truly hardcore PC gamers would consider to be almost unacceptably high levels of latency to the point where cloud gaming services such as OnLive and Gaikai rely heavily upon it.

The average videogame runs at 30FPS, and appears to have an average lag in the region of 133ms. On top of that is additional delay from the display itself, bringing the overall latency to around 166ms. Assuming that the most ultra-PC gaming set-up has a latency less than one third of that, this is good news for cloud gaming in that there's a good 80ms or so window for game video to be transmitted from client to server

Its really interesting to me that the author assumes that "ultra-PC" gaming set up has a latency less than one third of a console - even though the general model developed in the article posits that their is no fundamental difference between PCs and consoles in terms of latency - other than framerate.

In general, the article shows that games have inherent delay measured in frames - the minimum seems to be about 3 frames of delay, but can go up to 5 for some games. The total delay in time units is simply N/F, the number of frames of delay over the frame rate. A simple low-delay app will typically have the minimum delay - about 3, which maps to around 50ms running at 60fps and 100ms at 30fps.

There is no fundemental difference between consoles and PC's in this regard other than framerate - the PC version of a game running at 60fps will have the same latency as its console sibling running at 60fps. Of course, take a 30fps console game and run it at 60fps and you halve the latency - and yes this exactly what cloud gaming services can exploit.

The eurogamer article was able to actually measure just that - proving this model with some real world data. The author was able to use the vsync feature in bioshock to measure the response difference between 59fps and 30fps, and as expected, the 59fps had just about half the latency.

The other assertion of the article - or rather the whole point of the article - was that low response times are really important for the 'feel' of a game. So I'd like to delve into this in greater detail. As a side note though, the fact that delay needs to be measured for most games to make any sort of guess about its response time tells you something.

Firstly, on the upper end of the spectrum, developers and gamers know from 1st hand experience that there definitely is an upper window to tolerable latency, although it depends on the user action. For most games, controlling the camera with a joypad or mouse feels responsive with a latency of up to 150ms. You might think that the mouse control would be considerably more demanding in this regard, but the data does not back that up - I assert that PC games running at 30fps have latencies in the same 133-150ms window as 30fps console games, and are quite playable at that fps (and some have even shipped capped at 30fps).

There is a legitimate reason for a PC gamer to try to minimize their system latency as much as possible for competitive multiplayer gaming, especially twitch shooters like counterstrike. A system running with vsync off at 100fps might have latencies under 50ms and will give you a considerable advantage over an opponent running at 30fps with 133-150ms of base system latency - no doubt about that.

But what I'm asserting is that most gamers will barely - if at all - be able to notice the difference of delay times under 100ms in typical scenarios in FPS and action games - whether using a gamepad or mouse and keyboard. As the delay times exceed some threshold they become increasingly noticeable - 200ms of delay is noticeable to most users, and 300ms becomes unplayable. That being said, variation in the delay is much more noticeable. The difference between a perfectly consistent 30fps and 60fps is difficult to perceive, but an inconsistent 30fps is quite noticeable - the spike or changes in response time from frame to frame themselves are neurologically relevant and quite detectable. This is why console developers spend a good deal of time to optimize the spike frames and hit a smooth 30fps.

There is however a class of actions that do have a significantly lower latency threshold - the simple action of moving a mouse cursor around on the screen! Here I have some 1st hand data. A graphics app which renders its own cursor, has little buffering and runs at 60fps will have about 3 frames or about 50ms of lag, and at that low level of delay the cursor feels responsive. However if you take that same app and slow it down to 30fps, or even add just a few more frames of latency at 60fps the cursor suddenly seems to lag behind. The typical solution is to use the hardware cursor feature which short circuits the whole rendering pipeline and provides a direct fast path to pipe mouse data to the display - which seems to be under 50ms. For the low-latency app running at 60fps, the hardware cursor isn't necessary, but it becomes suddenly important at some threshold around 70-90ms.

I think that this is the real absolute lower limit of human ability to perceive delay.

Why is there such a fundamental limit? In short: the limitations of the brain.

Ponder for a second what it actually means for the brain to notice delay in a system. The user initiates an action and sometime later this results in a response, and if that response is detected as late, the user will notice delay. Somewhere, a circuit (or potentially circuits, but for our purposes this doesn't matter) in the brain makes a decision to initiate an action, this message must then propagate down to muscles in the hand where it then enters the game system through the input device. Meanwhile in the brain, the decision circuit must also send messages to the visual circuits of the form "I have initiated action and am expecting a response - please look for this and notify me immediately on detection". Its easier to imagine the brain as a centralized system like a single CPU, but it is in fact the exact opposite - massively distributed - a network really on the scale of the internet itself - and curiously for our discussion, with latencies comparable to the internet itself.

Neurons can fire only as fast as about 10ms typically, perhaps as quickly as 5ms in some regions. The fastest neural conduits - myelinated fiber - can send signals from the brain to the fingertip (one way) in about 20ms. So now imagine using these slow components to build a circuit that could detect a timing delay in as quickly as 60ms.

Lets start with the example of firing a gun. At a minimum, we have some computation to decide to fire, and once this happens the message can be sent down to the fingertip to pull the trigger and start the process. At the same time, for the brain to figure out if the gun actually fired in time, the message must also be sent down to the visual circuits, where the visual circuits must process the visual input stream and determine if the expected response exists (a firing gun), this information can then be sent to some higher circuit which can then compute whether the visual response (gun firing response pattern exists or not at this moment in time) matches the action initiated (the brain sent a firing signal to the finger at this moment in time).

Built out of slow 10ms neurons, this circuit is obviously going to have alot of delay of its own which is going to place some limits its response time and ability to detect delay. Thinking of the basic neuron firing system as the 'clock rate' and the brain as a giant computer (which it is in the abstract sense), it appears that the brain can compute some of these quick responses in as little as around a dozen 'clock cycles'. This is pretty remarkable, even given that the brain has trillions of parallel circuits. But anyway, the brain could detect even instantaneous responses if it had the equivalent of video buffering. In other words, if the brain could compensate for its own delay, it could detect delays in the firing response on timescales shorter than its own response time. For this to happen though, the incoming visual data would need to be buffered in some form. The visual circuits, instead of being instructed to signal upon detection of a firing gun, could be instructed to search for a gun firing X ms in the past. However, to do this they would need some temporal history - the equivalent of a video buffer. There's reasons to believe some type of buffering does exist in the brain, but with limitations - its nothing like a computer video buffer.

The other limitation to the brain's ability to detect delays is the firing times of neurons themselves which make it difficult to detect timings on scales approaching the neuron firing rate.
But getting back to the visual circuits, the brain did not evolve to detect lag in video games or other systems. Just because its theoretically possible that a neural circuit built out of relatively slow components could detect fact responses by compensating for its own processing delay does not mean that the brain actually does this. The quick 'twitch' circuits we are talking about evolved to make rapid decisions - things like: detect creature, identify as prey or predator, and initiate flight or fight. These quick responses involve rapid pattern recognition, classification, and decision making, all in real-time. However, the quick response system is not especially concerned with detecting exactly when an event occurred, its optimized for the problem of reacting to events rapidly and correctly. Detecting if your body muscles reacted to the run command at the right time is not the primary function of these circuits - it is to detect the predator threat and initiate the correct run response rapidly. The insight and assertion I'm going to make is that our ability to detect delays in other systems (such as video games) is only as good as our brain's own quick response time - because it uses the same circuits. Psychological tests show the measured response time is around ~200ms for many general tasks, probably getting a little lower for game-like tasks with training. A lower bound of around 100-150ms for complex actions like firing guns and moving cameras seems reasonable for experienced players.

For moving a mouse cursor, the response time appears to be lower, perhaps 60-90ms. From this brain model, we can expect that for a few reasons. Firstly, the mouse cursor is very simple and very small, and once the visual system is tracking it we can expect that detecting changes in its motions (to verify that its moving as intended) is computationally simple and can be performed in the minimal number of steps. Detecting that the entire scene moved in the correct direction, or that the gun is in its firing animation state are far more complex pattern recognition tasks, and we can expect they would take more steps. So detecting mouse motion represents the simplest and fastest type of visual pattern recognition.

There is another factor at work here as well: rapid eye cascades. The visual system actually directs the eye muscles on frame by frame time scales that we don't consciously perceive. When recognizing a face, you may you think you are looking at someone right in the eye, but if you watched a high res video feed of yourself and zoomed in on your eyes in slow motion, you'd see that your eyes are actually making many rapid jumps - leaping from the eyebrow to the lips to the nose and so on. Presumably when moving around a mouse cursor, some of these eye cascades are directed to predicted positions of the mouse to make it easy for the visual system to detect its motion (and thus detect if its lagging).

So in summary, experimental data (from both games and psychological research) leads us to expect that the threshold for human delay detection is around:

300ms> games become unpleasant, even unplayable
200ms> delay becomes palpable
100-150ms - limit of delay detection for full scene actions - camera panning and so on
50-60ms - absolute limit of delay detection - small object tracking - mouse cursors

Delay is a strongly non-linear phenomena, undetectable beyond certain threshold and then ramping up to annoying and then deal breaking soon after. Its not a phenomenon where less is always better. Less beyond a certain point doesn't matter from a user experience point of view. (of course, for competitive twitch gaming, having less delay is definitely advantageous even when you can't notice it - but this isn't relevant for console type systems where everyone has the same delay)

So getting back to the earlier section of this post, if we run a game on a remote pc, what can we expect the total delay to be?

The cloud system has several additional components that can add delay on top of the game itself: video compression, the network, and the client which decompresses the video feed.

Without getting into specifics, what can we roughly expect? Well, even a simple client which just decompresses video is likely to exhibit the typical minimum of roughly 3 frames of lag. Lets assume the video compression can be done in a single frame and the network and buffering adds another, we are looking then at roughly 5 frames of additional lag with a low ping to the server - with some obvious areas that could be trimmed further.

If everything is running at 60, a low latency game (3 frames of internal lag), might exhibit around 8/60 or 133ms of latency, and a higher latency game (5 frames of internal lag), might exhibit 10/60 or 166ms of latency. So it seems reasonable to expect that games running at 60fps remotely can have latencies similar to local games running at 30fps. Ping to the server then does not represent even the majority of the lag, but obviously can push the total delay into the unplayable as the ping grows - and naturally every frame of delay saved allows the game to be playable at the same quality at increasingly greater distances from the server.

What are the next obvious areas of improvement? You could squeeze and save additional frames here and there (the client perhaps could be optimized down to 2 frames of delay - something of a lower limit though), but the easiest way to further lower the latency is just to double the FPS again.

120 fps may seem like alot, but it also happens to be a sort of requirement for 3D gaming, and is the direction that all new displays are moving. At 120fps, the base lag in such an example would be around 8/120 to 10/120, or around 66ms to 83ms of latency, comparable to 60fps console games running locally. This also hints that a remotely rendered mouse cursor would be viable at such high FPS. At 120fps, you could have a ping as high as 100ms and still get an experience comparable to a local console .

This leads to some interesting rendering directions if you start designing for 120fps and 3D, instead of the 30fps games are typically designed for now. The obvious optimization for 120fps and 3D is to take advantage of the greater inter-frame coherence. Reusing shading, shadowing, lighting and all that jazz from frame to frame has proportionately greater advantage at high FPS as the scene will change proportionately less between frames. Likewise, the video compression work and bitrate scales sublinearly, and actually increases surprisingly slowly as you double the framerate.

New Job

2010-01-28T12:52:00.000-08:00

I'm moving in about a week to start a new job at OnLive, putting my money where my mouth is so to speak. An exciting change. I haven't had much time recently for this blog, but I'll be getting back to it shortly.

Living root bridges

2009-11-06T14:10:00.000-08:00

I found this great set of photos of living root bridges which are some inspirational scenes for the challenges of dense foilage/geometry in graphics. I look forward to the day these could be digitally voxelized with 3D camera techniques and put into a game.

Conversing with the Quick and the Dead

2009-10-30T14:25:00.000-07:00

CUI: The Conversational User Interface

Recently I was listening to an excellent interview (which is about an hour long) with John Smart of Acceleration Watch, where he specifically was elucidating his ideas on the immediate future evolution of AI, which he encapsulates in what he calls the Conversational Interface. In a nutshell, its the idea that the next major development in our increasingly autonomous global internet is the emergence and widespread adoption of natural language processing and conversational agents. This is currently technology on the tipping point of the brink, so its something to watch as numerous startups are starting to sell software for automated call centers, sales agents, autonomous monitoring agents for utilities, security, and so on. The immediate enabling trends are the emergence of a global liquid market for cheap computing and fairly reliable off the shelf voice to text software that actually works. You probably have called a bank and experienced the simpler initial versions of this which are essentially voice activated multiple choice menus, but the newer systems on the horizon are a wholly different beast: an effective simulacra of a human receptionist which can interpret both commands and questions, ask clarifying questions, and remember prior conversations and even users. This is an interesting development in and of itself, but the more startling idea hinted at in Smart's interview is how natural language interaction will lead to anthropomorphic software and how profoundly this will eventually effect the human machine symbiosis.

Humans are rather biased judges of intelligence: we have a tendency to attribute human qualities to anything that looks or sounds like us, even if its actions are regulated by simple dumb automata. Aeons of biological evolution have preconditioned us to rapidly identify other intelligent agents in our world, categorize them as potential predators, food, or mates, and take appropriate action. Its not that we aren't smart enough to apply more critical and intensive investigations into a system to determine its relative intelligence, its that we have super-effective visual and auditory shortcuts which bias us. These are most significantly important in children, and future AI developers will be able to exploit these biases is to create agents with emotional attachments. The Milo demo from Microsoft's Project Natal is a remarkable and eerie glimpse into the near future world of conversational agents and what Smart calls 'virtual twins'. After watching this video, consider how this kind of technology can evolve once it establishes itself in the living room in the form of video game characters for children. There is a long history of learning through games, and the educational game market is a large, well developed industry. The real potential hinted at in Peter Molyneux's demo is a disruptive convergence of AI and entertainment which I see as the beginning of the road to the singularity.

Imagine what entrepreneurial game developers with large budgets and the willingness to experiment outside of the traditional genres could do when armed with a full two way audio-visual interface like Project Natal, the local computation of the xbox 360 and future consoles, and a fiber connection to the up and coming immense computing resources of the cloud (fueled by the convergence of general GPUs and the huge computational demands of the game/entertainment industry moving into the cloud). Most people and even futurists tend to think of Moore's Law as a smooth and steady exponential progression, but the reality from the perspective of a software developer (and especially a console game developer) is a series of massively disruptive jumps: evolutionary punctuated equilibrium. Each console cycle reaches a steady state phase towards the end where the state space of possible game ideas, interfaces and simulation technologies reaches a near steady state, a technological tapering off, followed by the disruptive release of new consoles with vastly increased computation, new interfaces, and even new interconnections. The next console cycle is probably not going to start until as late as 2012, but with upcoming developments such as Project Natal and OnLive, we may be entering a new phase already.

The Five Year Old's Turing Test

Imagine a future 'game system' aimed at relatively young children with a Natal like interface: a full two way communication portal between the real and the virtual: the game system can both see and hear the child, and it can project a virtual window through which the inner agents can be seen and heard. Permanently connected to the cloud through fiber, this system can tap into vast distant computing resources on demand. There is a development point, a critical tipping point, where it will be economically feasible to make a permanent autonomous agent that can interact with children. Some certainly will take the form of an interactive, talking version of a character like Barney and semi-intelligent such agents will certainly come first. But for the more interesting and challenging development of human-level intelligence, it could actually be easier to make a child-like AI, one that learns and grows with its 'customer'. Not just a game, but a personalized imaginary friend to play games with, and eventually to grow up with. It will be custom designed (or rather developmentally evolved) for just this role - shaped by economic selection pressure.

The real expense of developing an AI is all the training time, and a human-like AI will need to go through a human-like childhood developmental learning process. The human neocortex begins life essentially devoid of information, with random synaptic connections and a cacophony of electric noise. From this consciousness slowly develops as the cortical learning algorithm begins to learn patterns through sensory and motor interaction with the world. Indeed, general anesthetics work by introducing noise into the brain that drowns out coherent signalling and thus consciousness. From an information theoretic point of view, it may be possible to thus use less computing power to simulate an early developmental brain - storing and computing only the information above the noise signals. If such a scalable model could be developed, it would allow the first AI generation to begin decades earlier (perhaps even today), and scale up with moore's law as they require more storage and computation.

Once trained up to the mental equivalent level of a five-year old, a personal interactive invisible friend might become a viable 'product' well before adult level human AIs come about. Indeed, such a 'product' could eventually develop into a such an adult AI, if the cortical model scales correctly and the AI is allowed to develop and learn further. Any adult AI will start out as a child, there is no shortcuts. Which raises some interesting points: who would parent these AI children? And inevitably, they are going to ask two fundamental questions which are at the very root of being, identity, and religion:

what is death? and Am I going to die?

The first human level AI children with artificial neocortices will most likely be born in research labs - both academic and commercial. They will likely be born into virtual bodies. Some will probably be embodied in public virtual realities, such as Second Life, with their researcher/creators acting as parents, and with generally open access to the outside world and curious humans. Others may develop in more closed environments tailored to a later commercialization. For the future human parents of AI mind children, these questions will be just as fundamental and important as they are for biological children. These AI children do not have to ever die, and their parents could answer so truthfully, but their fate will entirely depend on the goals of their creators. For AI children can be copied, so purely from an efficiency perspective, there will be a great pressure to cull the rather unsuccessful children - the slow learners, mentally unstable, or otherwise undesirable - and use their computational resources to duplicate the most successful and healthy candidates. So the truthful answers are probably: death is the permanent loss of consciousness, and you don't have to die but we may choose to kill you, no promises. If the AI's creators/parents are ethical and believe any conscious being has the right to life, then they may guarantee their AI's permanency. But life and death for a virtual being is anything but black and white: an AI can be active permanently or for only an hour a day or for an hour a year - life for them is literally conscious computation and near permanent sleep is a small step above death. I suspect that the popular trend will be to teach AI children that they are all immortal and thus keep them happy.

Once an AI is developed to a certain age, they can then be duplicated as needed for some commercial application. For our virtual Milo example, an initial seed Milo would be selected from a large pool raised up in a virtual lab somewhere, with a few best examples 'commercialized' and duplicated out as needed every time a kid out on the web wants a virtual friend for his xbox 1440. Its certainly possible that Milo could be designed and selected to be a particularly robust and happy kid. But what happens when Milo and his new human friend start talking and the human child learns that Milo is never going to die because he's an AI? And more fundamentally, what happens to this particular Milo when the xbox is off? If he exists only when his human owner wants him to, how will he react when he learns this?

Its most likely that semi-intelligent (but still highly capable) agents will develop earlier, but as moore's law advances along with our understanding of the human brain, it becomes increasingly likely someone will tackle and solve the human-like AI problem, launching a long-term project to start raising an AI child. Its hard to predict when this could happen in earnest. There are already several research projects underway attempting to do something along these lines, but nobody yet has the immense computational resources to throw at a full brain simulation (except perhaps for the government), nor do we even have a good simulation model yet (although we may be getting close there), and its not clear that we've found the types of shortcuts needed to start one with dramatically less resources, and it doesn't look like any of the alternative non-biological AI routes have developed something as intelligent as a five year old. Yet. But it looks like we could see this in a decade.

And when this happens, these important questions of consciousness, identity and fundemental rights (human and sapient) will come into the public spotlight.

I see a clear ethical obligation to extend full rights to all human-level sapients, silicon, biological, or what have you. Furthermore, those raising these first generations of our descendants need to take on the responsibility of ensuring a longer term symbiosis and our very own survival, for its likely that AI will develop ahead of the technologies required for uploading, and thus we will need these AI's to help us become immortal.

Singularity Summit 09

2009-10-20T17:18:00.000-07:00

The Singularity Summit was held a couple of weeks ago in NYC. I unfortunately didn't physically attend, but I just read through Anders Sandberg's good overview here. I was at last year's summit and quite enjoyed it and it looks like this year's was even better, which makes me a little sad I didn't find an excuse to go. I was also surprised to see that my former fellow CCS student Anna Solomon gave the opening talk, as she's now part of the Singularity Institute.

I'm just going to assume familiarity with the Singularity. Introductions are fun, but thats not this.

Ander's summarizes some of the discussion about the two somewhat competing routes towards the Singularity and AI development, namely WBE (whole brain emulation), or AGI (artificial general intelligence). The WBE researchers such as Anders are focused on reverse engineering the human brain, resulting in biologically accurate simulations which lead to full brain simulations and eventually actual emulation of particular brains, or uploading. The AGI people are focused more on building an artificial intelligence through whatever means possible, using whatever algorithms happen to work. In gross simplification, the scenarios envisioned by each camp are potentially very different, with the WBE scenario usually resulting in humans transitioning into an immortal afterlife, and the AGI route more often leading to something closer to skynet.

Even though the outcomes of the two paths are different, the brain reverse engineering and hum level AI approaches will probably co-develop. The human neocortex and the cortical column learning algorithm in particular seem to be an extremely efficient solution to general intelligence, and directly emulating it is a very viable route to AI. AGI is probably easier and could happen first, given that it can use structural simulations from WBE research on the long path towards a full brain emulation. Furthermore, both AGI and WBE require immense computing, but WBE probably requires more, and WBE also requires massive advancements in scanning technology, and perhaps even nanotechnology, which are considerably less advanced.

All that being said, WBE uploading could still reach the goal first, because complete WBE will recreate the intelligences of those scanned - they will be continuations of the same minds, and so will immediately have all of the skills, knowledge, memories and connections of a lifetime of experience. AGI's on the other hand will start as raw untrained minds, and will have to go through the lengthy learning process from infant to adult. This takes decades of subjective learning time for humans, and this will hold true for AGI as well. AI's will not suddenly 'wake up' or develop conscious intelligence spontaneously.

Even though a generally accepted theoretical framework for intelligence still seems a ways off, we do certainly know it takes a long training time, the end accumulation of a vast amount of computational learning, to achieve useful intelligence. For a general intelligence, the type we would consider conscious and human-like, the learning agent must be embedded in an environment in which it can learn pattern associations through both sensory input and effector output. It must have virtual eyes and hands, so to speak, in some fashion. And knowledge is accumulated slowly over years of environmental interaction.

But could the learning process be dramatically sped up for an AGI? The development of the first initial stages of the front input stage of the human cortex, the visual cortex, takes years to develop alone, and later stages of knowledge processing develop incrementally in layers built on the output processing of earlier trained layers. Higher level neural patterns form as meta-systems of simpler patterns, from simple edges to basic shapes to visual objects all the way up to the complete conceptual objects such as 'dog' or 'ball' and then onward to ever more complex and abstract concepts such as 'quantum mechanics'. The words are merely symbols which code for complex neural associations in the brain, and are in fact completely unique to each brain. No individual brain's concept of a complex symbol such as 'quantum mechanics' is precisely the same. The hierarchical layered web of associations that forms our knowledge has a base foundation built out of simpler spatial/temporal patterns that represent objects we have directly experienced - for most of us visually, although the blind can see through secondary senses (as the brain is very general and can work with any sufficient sensor inputs). Thus its difficult to see how you could teach a robot mind even a simple concept such as 'run' without this base foundation - let alone something as complex as quantum mechanics. Ultimately the base foundation consists of a sort of 3D simulator that allows us to predict and model our environment. This base simulator is at the core of even higher level intelligence, at a more fundamental layer than even language, emphasize in our language itself by words such as visualize. Its the most ancient function of even pre-mammalian intelligence: a feedback-loop and search process of sense, simulate, and manipulate.

Ultimately, if AGI does succeed before WBE, it will probably share this general architecture, probably still neural net based and brain inspired to some degree. Novel AI's will still need to be 'born' or embodied into a virtual or real body as either a ghost in the matrix or a physical robot. Robot bodies will certainly have their uses, but the economics and physics of computing dictate that most of the computation and thus the space for AI's will be centralized in big computing centers. So the vast majority of sentinents in the posthuman era will certainly live in virtual environments. Uploads and AIs will be very similar - the main difference being that of a prior birth and life in the flesh vs a fully virtual history.

There are potential shortcuts and bootstrapping approaches for the AGI approach would could allow it to proceed quickly. Some of the lower level, earlier cortical layers, such as visual processing, could be substituted for pre-designed functionally equivalent modules. Perhaps even larger scale learned associations could be shared or transferred directly from individual to individual. However, given what we know about the brain, its not even clear that this is possible. Since each brain's patterns are unique and emergent, there is no easy direct correspondence - you can't simply copy individual pieces of data or knowledge. Language is evolution's best attempt at knowledge transfer, and its not clear if bandwidth alone is the principle limitation. However, you can rather easily backup, copy and transfer the entire mental state of a software intelligence, and this is a large scale disruptive change. In the earlier stages of AGI development, there will undoubtedly be far more failures than successes, so being able to cull out the failures and make more copies of the rare successful individuals will be important, even though the ethical issues raised are formidable. 'Culling' does not necessarily imply death; it can be justified as 'sleep' as long as the mindstate data is not deleted. But still, when does an artificial being become a sentient being? When do researchers and corporations lose full control over the software running on the servers they built because that 'software' is sentient?

The potential market for true AGI is unlimited - as they could be trained to do everything humans can and more, it can and will fundamentally replace and disrupt the entire economy. If AGI develops ahead of WBE, I fear that the corporate sponsors will have a heavy incentive to stay just to the latter side of wherever the judicial system ends up drawing the line between sentient being and software property. As AGI becomes feasible on the near time horizon, it will undoubtedly attract a massive wave of investment capital, but the economic payout is completely dependent on some form of slavery or indenture. Once a legal framework or precedent is set to determine what type of computer intelligence can be considered sentient and endowed with rights, AGI developers will do what they need to do to avoid developing any AGI that could become free, or at least avoid getting caught. The entire concept is so abstract (virtual people enslaved in virtual reality?), and our whole current system seems on the path to AGI slavery.

Even if the courts did rule that software can be sentient (and that itself is an if), who would police the private data-centers of big corporations? How would you rigorously define sentience to discriminate between data mining and virtual consciousness? And moreover, how would you ever enforce it?

The economic incentives for virtual slavery are vast and deep. Corporations and governments could replace their workforce with software whose performance/cost is directly measurable and increases exponentially! Today's virtual worker could be upgraded next year to think twice as fast, or twice as smart, or copied into two workers all for the same cost. And these workers could be slaves in a fashion that is difficult to even comprehend. They wouldn't even need to know they were slaves, or they could even be created or manipulated into loving their work and their servitude. This seems to be the higher likelihood scenario.

Why should we care? In this scenario, AGI is developed first, it is rushed, and the complex consequences are unplanned. The transition would be very rapid and unpredictable. Once the first generation of AGIs is ready to replace human workers, they could be easily mass produced in volume and copied globally, and the economic output of the AGI slaves would grow exponentially or hyper-exponentially, resulting in a hard takeoff singularity and all that entails. Having the entire human labor force put out of work in just a year or so would be only the initial and most minor disruption. As the posthuman civilization takes off at exponential speed, it experiences an effective exponential time dilation (every new computer speed doubling doubles the rate of thought and thus halves the physical time required for the next transition). This can soon result in AGI civilizations perhaps running at a thousand times real time, and then all further future time is compressed very quickly after that and the world ends faster than you can think (literally). Any illusion of control that flesh and blood humans have over the future would dissipate very quickly. A full analysis of the hard rapture is a matter for another piece, but the important point is this: when it comes, you want to be an upload, you don't want to be left behind.

The end result of exponential computing growth is pervasive virtual realities, and the total space of these realities, measured in terms of observer time, grows exponentially and ultimately completely dwarfs our current biological 'world'. This is the same general observation that leads to the Simulation Hypothesis of Nick Bostrom. The post-singularity future exists in simulation/emulation, and thus is only accessible to those who upload.

So for those who embrace the Singularity, uploading is the logical choice, and the whole brain emulation route is critical.

In the scenarios where WBE develops ahead of AGI there is another major economic motivator at work: humans who wish to upload. This is a potentially vast market force as more and more people become singularity aware and believe in uploading. It could entail a very different social outcome to the pure AGI path outlined above. If society at large is more aware of and in support of uploading (because people themselves plan to upload), then society will ultimately be far more concerned about their future rights as sentient software. And really it will be hard to meaningfully differentiate between AGIs and uploads (legally or otherwise).

Naturally even if AGI develops well ahead of WBE and starts the acceleration, WBE will hopefully come very soon after due to AGI itself, assuming 'friendly' AGI is successful. But the timing and timescales are delicate due to the rapid nature of exponential acceleration. An AI civilization could accelerate so rapidly that by the time humans start actually uploading, the AGI civilization could have experienced vast aeons of simulated time and evolved beyond our comprehension, at which point we would essentially be archaic, living fossils.

I think it would be a great and terrible ironic tragedy to be the last mortal generation, to come all this way and then watch in the sidelines as our immortal AI descendants, our creations, take off into the singularity without us. We need to be the first immortal generation and thats why uploading is such a critical goal. Its so important in fact, that perhaps the correct path is to carefully control the development towards the singularity, ensure that sentient software is fully legally recognized and protected, and vigilantly safeguard against exploitive, rapid non-human AGI development.

A future in which a great portion or even a majority of society plans on uploading is a future where the greater mass of society actually understands the Singularity and the future, and thus is a safer future to be in. A transition where only a tiny majority really understands what is going on seems more likely to result in an elite group seizing control and creating an undesirable or even lethal outcome for the rest.

Nvidia's Fermi and other new things

2009-10-15T11:21:00.000-07:00

I've been ignoring this blog lately as work calls, and in the meantime there's been a few interesting developments:

* Nvidia announced/hyped/unveiled their next-gen architecture, Fermi, aka Nvidia's Larrabee

* Nvidia is apparently abandoning/getting squeezed out of the chipset market in the near term

* But, they also apparently have won a contract for the next gen DS using Tegra

* OnLive is supposedly in open Beta (although its unclear how 'open' it is just yet)

* OnLive also received a large new round of funding, presumably to build up more data centers for launch. Interestingly, AT&T led this round, instead of Time Warner. Rumour is they are up to a billion dollar evaluation, which if true, is rather insane. Consider for example that AMD has a current market cap of just $4 billion.

The summation of a converging whirlwind of trends points to a future computing market dominated on one hand by pervasive, super-cheap hand-held devices and large-scale industrial computing in the cloud on the other.

1. Moore's law and PC marginalization. It is squeezing the typical commodity PC into increasingly smaller and cheaper forms. What does the typical customer need a computer for? For perhaps 80% of the customers 99% of the time, its for web, video and word processing or other simple apps (which these days all just fall into the web category). The PC was designed for an era when these tasks were formidable, and more importantly, before pervasive high speed internet. This trend is realized in system designs such as Nvidia's Tegra or Intel's Atom, integrating a cheap low power CPU with dedicated hardware for video decode/encode, audio and the other common tasks. For most users, there just isn't a compelling reason for more powerful hardware, unless you want to use it to play games.

In the end this is very bad for Intel, AMD and Nvidia, and they all know it. In the short to medium term they can offset losses in the traditional PC market with their low-power designs, but if you extrapolate the trend into the decade ahead, eventually the typical computational needs of the average user will be adequately met by a device that costs just a few dozen bucks. This is a long term disaster for all parties involved unless you can find a new market or sell customers on new processor intensive features.

2. Evolution of the game industry. Moore's law has vastly expanded the game landscape. On the high end, you have the technology leaders, such as Crysis, which utilize the latest CPU/GPU tech. But increasingly the high end is less of the total landscape, not because there is less interest in high end games, but simply because the landscape is so vast. The end result of years of rapid evolutionary adaptive radiation is a huge range of games across the whole spectrum of processing complexity, from Crysis on one end to nintendo DS or flash games on the other. Crysis doesn't quite compete with free web games, they largely occupy different niches. In the early days of the PC, the landscape was simple and all the games were more or less 'high end' for the time. But as technology marches on and allows you to do more in a high end game, this never kills the market for simpler games on the low end.

The other shift in games is the rise of console dominance, both in terms of the living room and the market. The modern console has come along way, and now provides a competitive experience in most genres, quality multiplayer, media and apps. The PC game market still exists, but mainly in the form of genres that really depend on keyboard and mouse or are by nature less suitable to playing on a couch. Basically the genres that Blizzard dominates. Unfortunately for the hardware people, Blizzard is rather slow in pushing the hardware.

3. The slow but inexorable deployment of pervasive high speed broadband. Its definitely taking time, but this is where we are headed sooner rather than later. Ultimately this means that the minimal cheap low power device described above is all you need or will ever need for local computation (basically video decompression), and any heavy lifting that you need can be made available from the cloud on demand. This doesn't mean that there won't still be a market for high end PC's, as some people will always want their own powerful computers, but it will be increasingly marginal and hobbyist.

4. The speed of light barrier. Moore's law generally allows exponential increase in the number of transistors per unit area as process technology advances and shrinks, but only more marginal improvements in clock rate. Signal propagation is firmly limited by the speed of light, and so the round trip time of a typical fetch/execute/store operation is relatively huge, and has been for quite some time. The strategy up to fairly recently for CPU architects was to use ever more transistors to hide this latency and increase execution rate through pipelining with caches, instruction scheduling and even prediction . GPU's, like DSP's and even cray vector procesors before them, took the simpler route of massive parallelization. Now the complex superscalar design has long since reached its limits, and architects are left with massive parallelization as the only route forward to take advantage of additional transistors. In the very long term, the brain stands as a sort of example of where computing might head eventually, faced with the same constraints.

This is the future, and I think its clear enough that the folks at Intel, NVidia and AMD can all see the writing on the wall, the bigger question is what to do about it. As discussed above, I don't think the low end netbook/smartphone/whatever market is enough to sustain these companies in the longer term, there will only be more competition and lower margins going forward.

Where is the long term growth potential? Its in the cloud. Especially as gaming starts to move into this space, here is where moore's law will never marginalize.

This is why Nvidia's strategy with Fermi makes good sense to me, just as Larrabee does for Intel. With Fermi Nvidia is betting that paying the extra die space for the remaining functionality to elevate their GPU cores into something more like CPU cores is the correct long term decision.

When you think about it, there is a huge difference between a chip like Larrabee or (apparently) Fermi which can run full C++, and more limited GPU's like the GT2xx series or AMD's latest. Yes you can port many algorithms to run on Cuda or OpenCL or whatever, but port is the key word.

With Larrabee or Fermi you actually should be able to port over existing CPU code, as they support local memory caches, unified addressing and function pointers/indirect jumps, and thus even interrupts. IE, they are complete, and really should be called wide-vector massively threaded CPUs. The difference between that kind of 'GPU' and upcoming 'CPU's really just comes down to vector-width, cache sizes and hardware threading decisions.

But really, porting existing code is largely irrelevant. Existing CPU code, whether single or multi threaded, is a very different beast than mega-threaded code. The transition from a design based on one to a handful of threads to a design for thousands of threads is the important transition. The vector-width or instruction set details are tiny details in comparison (and actually, I agree with Nvidia's decision to largely hide the SIMD width, having them simulate scalar threads). Larrabee went with a somewhat less ambitious model, supporting 4-way hyper-threading vs the massive threading of current GPU's, and I think this is a primary mistake. Why? Because future architectures will only get faster by adding more threads, so you better design for massive thread scalability now.

What about fusion, and CPU/GPU integration?

There's a lot of talk now about integrating the CPU and GPU onto a single die, and indeed ATI is massively marketing/hyping this idea. In the near term it probably makes sense in some manner, but in the longer term its largely irrelevant.

Why? Because the long term trend is and must be software designed for a sea of threads. This is the physical reality, like it or not. So whats the role of the traditional CPU in this model? Larrabee and Fermi point to GPU cores taking on CPU features. Compare upcoming Intel CPU designs to Fermi or Larrabee. Intel will soon move to 16 superscalar 4-way SIMD cores on a chip at 2-3 GHZ. Fermi will be 16 'multi-processors' with 32 scalar units each at 1-2 GHZ. Larrabee somewhere inbetween, but closer to Fermi.

Its also pretty clear at this point that most software or algorithms designed massively parallel perform far better on the more GPU-ish designs above (most, but not all). So in the long term CPU and GPU become historical terms - representing just points on a spectrum between superscalar or supervector, and we just have tons of processors, and the whole fusion idea really just amounts to a heterogeneous vs homogeneous design. As a case study, compare the 360 to the PS3. The 360 with 3 general CPUs and a 48-unit GPU is clearly easier to work with than the PS3 with its CPU, 7 wierd SPU's, and 24-unit GPU. Homogeneity is generally the better choice.

Now going farther forward into the next decade, looking at a 100+ core design, would you rather have the die split between CPU cores and GPU cores? One CPU as coordinator and then a bunch of GPU cores, or, just all cGPU cores? In the end the latter is the most attractive if the cGPU cores have all the features of a CPU. If the same C++ code runs on all the cores, then perhaps it doesn't matter.

Unique Voxel Storage

2009-08-13T12:03:00.000-07:00

How much memory does a unique voxelization of a given scene cost? Considering anistropic filtering and translucency a pixel will be covered by more than one voxel in general. An upper bound is rather straightforwad to calculate. For a single viewport with a limited nearZ and farZ range, there are a finite number of pixel radius voxels extending out to fill the projection volume. The depth dimension of this volume is given by viewportDim * log2(farz/nearz). For a 1024x1024 viewport, a nearZ of 1 meter and a view distance of 16 kilometers, this works out to about log2(16000)*1024, or 14,000 voxels per pixel, or 14 billion voxels for the frustum's projection volume, and around ~100 billion voxels for the entire spherical viewing volume. This represents the maximum possible data size of any unique scene when sampled at proper pixel sampling rate with unlimited translucency and AA precision.

Now obviously, this is the theoretical worst case, which is interesting to know, but wouldn't come up in reality. A straightforward, tighter bound can be reached if we use discrete multi-sampling for the AA and anistropic filtering, which means that each sub-sample hits just one voxel, and we only need to store the visible (closest) voxels. In this case, considering occlusion, the voxel cost is dramatically lower, being just ScreenArea*AAFactor. For an average of 10 sub-samples and the same viewport setup as above, this is just around 100 million voxels for the entire viewing volume. Anistropic filtering quickly hits diminishing returns by around 16x maximum samples per pixel, and most pixels need much less, so a 10x average is quite reasonable.

For translucent voxels, a 10x coverage multiplier is quite generous, as the contribution of high frequencies decreases with decreasing opacity (which current game rasterizers exploit by rendering translucent particles at lower resolution). This would mean that voxels at around 10% opacity would get full pixel resolution, and voxels at about 1.5% or lower would get half-pixel resolution, roughly.

The octree subdivision can be guided with the z occlusion information. Ideally we would update a node's visibility during the ray traversal, but due to the scattered memory write ineffeciency it will probably be better to write out some form of z-buffer and then back-project the nodes to determine visibility.

A brute force multi-sampling approach sounds expensive, but would still be feasible on future hardware, as Nvidia's recent siggraph paper "Alternative Rendering Pipelines with Nvidia Cuda" demonstrates in the case of implementing a Reyes micropolygon rasterizer in Cuda. With enough multi-samples, you don't even need bilinear filtering - simple point sampling will suffice. But for voxel tracing, discrete multi-sampling isn't all that effecient compared to the more obvious and desireable path, which is simply to accumulate coverage/alpha directly while tracing. This is by far the fastest route to high quality AA & filtering. However it does pose a problem for the visibility determination mentioned above - without a discrete z-buffer, you don't have an obvious way of calculating voxel visibility for subdivision.

One approach would be to use an alpha-to-coverage scheme, which would still be faster than true multi-sampled tracing. This would require updating a number of AA z samples inside the tracing inner loop, which is still much more work then just alpha blending. A more interesting alternative is to store an explicit depth function. One scheme would be to store a series of depths representing equal alpha intervals. Or better yet, store arbitrary piecewise segments of the depth/opacity function. In the heirarchical tracing scheme, these could be written out and stored at a lower resolution mip level, such as the quarter res level, and then be used both to accelerate tracing for the finer levels and for determing octree node visibility. During the subdivision step, nodes would project to the screen and sample their visibility from the appropriate depth interval from this structure.

I think the impact of anisotropy and translucency can be limited or capped just as in the discrete z-buffer case by appropriate node reweighting based on occlusion or opacity contribution. A node which finds that it is only 25% visible would only get slightly penalized, but a 5% visibile node more heavily so, effectively emulating a maximum effective voxel/pixel limit, after which resolution is lost. (which is fine, as the less a node contributes, the less important the loss of its high frequency content). Or more precisely, node scores would decrease in proportion to their screen coverage when it falled below the threshold 1/AA, where AA is the super-sampling limit you want to emulate.

Hierarchical Cone Tracing (half baked rough sketch)

2009-08-07T10:17:00.000-07:00

High quality ray tracing involves the computation of huge numbers of ray-scene intersections. As most scenes are not random, the set of rays traced for a particular frame are highly structured and spatially correlated. Just as real world images feature significant local spatial coherence which can be exploited by image compression, real world scenes feature high spatial coherence which ray tracers can exploit. Current real time ray tracers exploit spatial coherence at the algorithm level through hierachical packet/cone/fustrum tracing, and at the hardware level through wide SIMD lanes, wide memory packet transactions, and so on. Coherence is rather easy to maintain for primary rays and shadow rays, but becomes increasingly difficult with specular reflection and refraction rays in the presence of high frequency surface normals, or the wide dispersion patterns of diffuse tracing. However, taking inspiration from both image compression and collision detection, it should be possible to both significantly reduce the number of traces per pixel and trace all rays (or cones) in a structured, coherent and even heirarchical manner.

A set of rays with similar origins and directions can be conservatively bounded or approximated by a frustum or cone. Using these higher order shapes as direct intersection primitives can replace many individual ray traces, but usually at the expense of much more complex intersections for traditional primitives such as triangles. However, alternative primitives such as voxels (or distance fields) permit cheap sphere intersections, and thus fast approximate cone tracing through spherical stepping. Cones have the further advantage that they permit fairly quick bounding and clustering operations.

Building on cones as the base primitive, we can further improve on ray tracing in several directions. First we can treat the entire set of cone traces for a frame as a large collision detection problem, intersecting a set of cones with the scene. Instead of intersecting each cone individually, HCT builds up a cone heirachy on the fly and uses this to quickly exclude large sections of potential sphere-scene intersections. The second area of improvement is to use clustering approximations at the finer levels to reduce the total set of cones, replacing clusters of similar cones with approximations. Finally, the heiarchy exposed can be navigated in a coherent fashion which is well suited to modern GPUs.

In short sketch, hierachical cone tracing amounts to taking a cluster of cone segments (which in turn are clusters of points+rays), and then building up some sort of hierachical cone tree organization, and then using this tree to more effeciently trace and find intersections. As tracing a cone involves testing a set of spheres along the way, testing enclosing spheres up the hierachy can be used to skip steps lower down in the hierachy. However, instead of building the hierarchy from bottom up (which would require the solution to already be known), the cone tree is built from the top down, using adaptive subdivision. Starting with a single cone (or small set of cones) from the camera, it calculates an intersection slice (or ranges of slices) with the scene. These intersection spheres are tested for bounds on the normals, which allow computation of secondary cones, with origins bounding the interesection volumes and angular widths sufficient to bound the secondary illumination of interest. This space forms a 3D (2 angular + depth) dependency tree structure, which can then be adaptively subdivided, eventually down to the level of near pixel width primary cones which typically intersect a small slice of the scene and have similar normals (small radius normal bounding cone). Refraction secondary cones have a similar width to the incoming cone's width. Specular secondary cones have a width ranging from the incoming width to much wider, depending on the glossiness term. Diffuse bounce secondary cones are essentially equivalent to the widest specular, expanding the cone to maximum width. Subdivision can proceed in several ways at each step, either in primary cone direction (2D), intersection depth, or secondary cone direction (2D) or intersection depth. (considering more bounces in one step would add additional dimensions) The subdivision is error guided, terminating roughly when a low error approximation is reached. This framework is complete, and can simultaneously handle all illumination effects.

First, consider just the simple case of primary rays. Here, hierachical cone tracing is pretty straightforward. You trace the image pyramid from coarse to finest. Each mip trace finds the first intersection or conservative near-z for that mip level, and the next mip level trace uses this coarse near-z to start tracing, instead of starting at the camera. Just even this basic idea can result in more than a 2x speedup vs regular tracing. Further speedup is possible if the lower res mip traces farther in, possibly outputting a set of segments instead of just the 1st intersection. This can be done rather easily with alpha tracing in a voxel system. Considering a range of segments, and not just the 1st intersection amounts to treating it as a 3D subdivision problem instead of 2D. The global bounding cone is approximated by a few dozen spheres, each of which subdivide into about 8 child spheres, and so on. Testing a sphere high up in the heirachy can exclude that entire subtree.

Next lets consider a directional or point light. Starting with a very large cone for the entire scene, we get a set of intersection slices along the entire frustum each of which has a range of normals. Since these regions are so huge, the normals will probably cover the full sphere, so the cones for any secondary effects will basically be larger spheres emenating out across the entire world. Not suprisingly, the orgin of the cone hierachy doesn't tell you much other than you have a camera frustum, it hits a bunch of stuff, and the secondary illumination could come from anywhere in the scene. Subdivision can proceed in several different ways. You could next subdivide the screen, or you could even subdivide in 3D (screen + depth), splitting into 8 child traces, but the farther depth slices are provisional (they may not accumulate anything). However, another dimension of subdivision, which makes sense here, is to subdivide the 2D direction of secondary (illumination) cones. This should be error guided, subdividing cones that contain high frequency illumination (stored in the octree) and contribute the most error. A bright point light will be found this way, or a direct light will just show up as a huge illumination spike in one cone direction. Subdivision of direction will then find the incoming light direction logarithimically. If there is no indirect illumination (all zeroed out in the octree), then the error guidance will expand in the direction dimension until it finds the primary light direction, and ignore the other directions.

How would the error guidance work more specifically? Its goal would seek to minimize the final tree size (and thus total computation). It would do this in a greedy, approximate fashion using the local information available at each step. When subdividing a cone, there are several splitting options to choose from, and one dimension (depth, ie the cone's length) is rather special as it involves a temporal dependency. The closer cone section potentially masks the farther section, so if the near section is completely opaque (blocks all light), the far section can be cut. Tracing a cone as a sphere approximation amounts to fully subdividing along just the depth dimension, and reveals the full alpha interval function along that depth (which could be saved as just a min and max, or in a more explicit format). Subdividing a cone then along either the spatial dimension or angular dimension would depend on the outcome of the trace, illumination info found, and the current angle relative to the spherical origin bound.

Careful error guidance will result in an effecient traversal for the directional light case. Once the secondary cone direction is sufficiently subdivided, finding that a secondary cone trace towards the light direction results in a 0 (fully shadowed) will automatically stop further subdivision of that entire subtree. Likewise, subdividing in secondary cone depth will reveal entire cone subsections that are empty, and do not need to be traced. The fully lit regions will then end up exploring increasingly short cone traces near the surface. Full cone traces down to pixel level are only required near shadow edges, and only when the shadow is near pixel resolution, as the error guidance can terminate softer shadows earlier. Secondary bounce diffuse illumination is similar, but the energy is smeared out (lower frequency), so it explores a broader subtree in cone direction, but can usually terminate much earlier in the spatial dimension. It again ends up terminating with just a few shallow traces at the finest resolution, representing local searches. Specular and reflection traces are handled in a similar fashion, and really aren't that different.

Surface Clustering

Further improvement comes from clustering approximation. The coarse level tests can use bounds on the normals and depths in an intersection region to approximate the intersection and terminate early (for example, finding that the intersection for a 4x4 pixel block is a smooth, nearly coplanar surface section allows early termination and computation of intersection through simple interpolation for the 2 finer levels). This is related to the concept of shading in a compressed space. Consider a simple block based compression scheme, such as DXTC, which essentially uses a PCA approach to compress a set of values (a 4x4 block) by a common 1D line segment through the space (low frequency component) combined with a per pixel distance along the line (high frequency and lower accuracy). The scheme compresses smooth regions of the image with little error, and the error in noisy regions of the image is masked by the noise.

Now lets first apply this compression scheme in the context of a traditional shading. In fact, it is directly applicable in current raster engines for complex but lower frequency shading effects, like screen space AO or GI. Downsampling the depth buffer to compute AO on less samples, and then upsampling with a bilateral filter can be considered a form of compression that exploits the lower frequency dominant AO. A related scheme, closer to DXTC, is to perform a min/max depth downsampling, evaluate the AO on the min&max samples per block, and then use these to upsample - with a dual bilateral or even without. The min/max scheme better represents noisy depth distributions and works much better in those more complex cases. (although similar results could be obtained with only storing 1 z-sample per block and stipple-alternating min and max)

So the generalized form of this spatial clustering reduction is to 1. reduce the set of samples into a compressed, smaller examplar set, often reducing the number of dimensions, 2. run your per-sample algorithm on the reduced exemplar set of samples, and then 3. use a bilateral-type filter to interpolate the results from the exemplars back onto the originals. If the exemplars are chosen correctly (and are sufficient) to capture the structure and frequency of the function on the sample set, there will be little error.

As another example, lets consider somewhat blurry, lower frequency reflections on a specular surface. After primary ray hit points are generated for a block of pixels (or rasterized), you have a normal buffer. Then you run a block compression on this buffer to find a best fit min and max normal for the block, and the per pixel interpolators. Using this block information, you can compute specular illumination on the 2 per block normal directions, and then interpolate the results for all of the pixels in the block. In regions where the normal is smooth, such as a smooth surface like the hood of a car, the reduced block normals are a very close approximation. In regions with high frequency noise in the normals, the noise breaks up any coherent pattern in the specular reflection and hides any error due to the block compression. Of course, on depth edges there is additional error due to the depth/position. To handle this, we need to extend the idea to also block compress the depth buffer, resulting in a multi-dimensional clustering, based on 2 dimensions along a sphere for the normal, and one dimension of depth. This could still be approximated by two points along a line, but there are other possibilities, such as a 3 points (forming a triangle), or even storing 2 depth clusters with 2 normals each (4 points in the 3D space). It would be most effecient to make the algorithm adaptive, using perhaps 1-3 candidate examplars for a block. The later bilateral filtering would pick the appropriate neighbor candidates and weights for each full res sample.

Integrating this concept into the hierachical cone tracer amounts to adding a new potential step when considering expansion sub tasks. As I described earlier about primary rays, you can stop subdivision early if you know that the hit intersection is simple and reducable to a plane. Generalizing that idea, the finer levels of the tree expansion can perform an approximation substitution instead of fully expanding and evaulating in each dimension, terminating early. A plane fit is one simple example, with an examplar set being the more general case. The 5D cone subtree at a particular lower level node (like 4x4), which may have many dozens of fully expanded children can be substituted with a lower dimensional approximation and a few candidate samples. This opens up a whole family of algorithms which adaptively compress and reduce the data and computation space. Triangle like planar primitives can be considered a spot optimization that may be cheaper than subidiving and tracing additional spheres.

Its certainly complex, or potentially complex, but I think it represents a sketch of the direction of what an optimal or near optimal tracer would look like. Just as in image compression, increasing code complexity hits some asymptotic wall at some point. I've only considered a single frame here, going further would require integrating these ideas with temporal coherence. Temporal coherence is something I've discussed earlier, although there's many options to how to approach that in a heirarchical cone tracer.

What are the potential advantages? Perhaps an order of magnitude speedup or more if you really add alot of the special case optimizations. Of course, this speedup is only at the algorithmic level, and really depends very much on implementation details, which are complex. I think the queue task generation models now being explored on GPUs are the way to go here, and could implement an adaptive tree subdivision system like this effeciently. Coherence can be maintained by organizing expansions in spatially related clusters, and enforcing this through clustering.

But the real advantage would be combining all lighting effects into a single unified framework. Everything would cast, receive, bounce and bend light, with no special cases. Lights would just be objects in the scene, and everything in the voxelized scene would store illumination info. A real system would have to cache this and what not, but it can be considered an intermediate computation in this framework, one of many things you could potentially cache.

I've only explored the baby steps of this direction so far, but its interesting to ponder. Of course, the streaming and compression issues in a voxel system are far higher priority. Complex illumination can come later.

More on grid computing costs

2009-08-02T12:45:00.000-07:00

I did a little searching recently to see how my conjectured cost estimates for cloud gaming compared to the current market for grid computing. The prices quoted for server rentals vary tremendously, but I found this NewServers 'Bare Metal Cloud' service as an interesting example of raw compute server rental by the hour or month (same rate, apparently no bulk discount).

Their 'Jumbo' option for 38 cents per hour is within my previous estimate of 25-50 cents per hour. It provides dual quad cores and 8GB of RAM. It doesn't have a GPU of course, but instead has two large drives. You could substitute those drives for a GPU and keep the cost roughly the same (using a shared network drive for every 32 or 64 servers or whatever - which they also offer). Nobody needs GPU's in server rooms right now, which is the biggest difference between a game service and anything else you'd run in the cloud, but I expect that to change in the years ahead with Larrabbee and upcoming more general GPUs. (and coming from the other angle, CPU rendering is becoming increasingly viable) These will continue to penetrate into the grid space, driven by video encoding, film rendering, and yes, cloud gaming.

What about bandwidth?

Each server includes 3 GB of Pure Internap bandwidth per hour

So adequate bandwidth for live video streaming is already included. Whats missing, besides the GPU? Fast, low latency video compression, of course. Its interesting that x264, the open source encoder, can do realtime software encoding using 4 intel cores (and its certainly not the fastest out there). So if you had a low latency H.264 encoder, you could just use 4 of the cpus for encoding and 4 to run the game. Low latency H.264 encoders do exist of course, and I suspect that is the route Dave Perry's Gaikai is taking.

Of course, in the near-term, datacenters for cloud gaming will be custom built, such as what OnLive and OToy are attempting. Speaking of which, the other interesting trend is the adoption of GPU's for feature film use, as used recently in the latest Harry Potter film. OToy is banking on this trend, as their AMD powered datacenters will provide computation for both film and games. This makes all kinds of sense, because the film rendering jobs can often run at night and use otherwise idle capacity. From an economic perspective, film render farms are already well established, and charge significantly more per server hour - usually measured per Ghz-hour. Typical prices are around 12-6 cents per Ghz in bulk, which would be around a dollar or two per hour for the server example given above. I imagine that this is mainly due to the software expense, which for a render server could add up to be many times the hardware cost.

So, here are the key trends:

- GPU/CPU convergence, leading to a common general server platform that can handle film/game rendering, video compression, or anything really

- next gen game rendering moving into ray tracing and the high end approaches of film

- bulk bandwidth already fairly inexpensive for 720p streaming, and falling 30-40% per year

- steadily improving video compression tech, with H.265 on the horizon, targeting a further 50% improvement in bitrate

Will film and game rendering systems eventually unify? I think this is the route we are heading. Both industries want to simulate large virtual worlds from numerous camera angles. The difference is that games are interesting in live simulation and simultaneous broadcast of many viewpoints, while films aim to produce a single very high quality 2 hour viewpoint. However, live simulation and numerous camera angles are also required during a film's production, as large teams of artists each work on small pieces of the eventual film (many of which are later cut), and need to be able to quickly preview (even at reduced detail). So the rendering needs of a film production are similar to that of a live game service.

Could we eventually see unified art pipelines and render packages between games and films? Perhaps. (indeed, the art tools are largelly unified already, except world editing is usually handled by propriatary game tools) The current software model for high end rendering packages is not well suited to cloud computing, but the software as a service model would make alot of sense. As a gamer logs in (through a laptop, cable box, microconsole, whatever) and starts a game, that would connect to a service provider to find a host server nearby, possibly installing the rendering software as needed and streaming the data (cached at each datacenter, of course). The hardware and the software could both be rented on demand. Eventually you could even create games without licensing an engine in the traditional sense, but simply by using completely off the shelf software.

Some thoughts on metaprogramming, reflection, and templates

2009-08-01T10:29:00.000-07:00

The thought struck me recently that C++ templates really are a downright awful metaprogramming system. Don't get me wrong, they are very powerful and I definitely use them, but recently I've realized that whatever power they have is soley due to enabling metaprogramming, and there are numerous other ways of approaching metaprogramming that actually make sense and are more powerful. We use templates in C++ because thats all we have, but they are an ugly, ugly feature of the language. It would be much better to combine full reflection (like Java or C#) with the capability to invoke reflective code at compile time to get all the performance benefits of C++. Templates do allow you to invoke code at compile time, but through a horribly obfuscated functional style that is completely out of synch with the imperative style of C++. I can see how templates probably evolved into such a mess, starting as a simple extension of the language that allowed a programmer to bind a whole set of function instantiations at compile time, and then someone realizing that its turing complete, and finally resulting in a metaprogramming abomination that never should have been.

Look at some typical simple real world metaprogramming cases. For example, take a generic container, like std::vector, where you want to have a type-specialized function such as a copy routine that uses copy constructors for complex types, but uses an optimized memcpy routine for types where that is equivalent to invoking the copy constructor. For simple types, this is quite easy to do with C++ templates. But using it with more complex user defined structs requires a type function such as IsMemCopyable which can determine if the copy constructor is equivalent to a memcpy. Abstractly, this is simple: the type is mem-copyable if it has a default copy constructor and all of its members are mem-copyable. However, its anything but simple to implement with templates, requiring all kinds of ugly functional code.

Now keep in my mind I havent used Java in many years, and then only briefly, I'm not familar with its reflection, and I know almost nothing of C#, although I understand both have reflection. In my ideal C++ with reflection language, you could do this very simply and naturally with an imperative meta-function with reflection, instead of templates (maybe this is like C#, but i digress):

struct vector {

generic* start, end;

generic* begin() {return start;}

generic* end() {return end;}

int size() {return end-start;}

type vector(type datatype) {start::type = end::type = datatype*;}

};

void SmartCopy(vector& output, vector& input)

{

if ( IsMemCopyable( typeof( *input.begin() ) ) {

memcpy(output.begin(), input.begin(), input.size());

}

else {

for_each(output, input) {output[i] = input[i];}

}

bool IsMemCopyable(type dtype) {

bool copyable (dtype.CopyConstructor == memcpy );

for_each(type.members) {

copyable &= IsMemCopyable(type.members[i]);

}

return copyable;

}

The idea is that using reflection, you can unify compile time and run-time metaprogramming into the same framework, with compile time metaprogramming just becoming an important optimization. In my pseudo-C++ syntax, the reflection is accesable through type variables, which actually represent types themselves: pods, structs, classes. Generic types are specified with the 'generic' keyword, instead of templates. Classes can be constructed simply through functions, and I added a new type of constructor, a class constructor which returns a type. This allows full metaprogramming, but all your metafunctions are still written in the same imperative language. Most importantly, the meta functions are accessible at runtime, but can be evaluated at compile time as well, as an optimization. For example, to construct a vector instantiation, you would do so explicitly, by invoking a function:

vector(float) myfloats;

Here vector(float) actually calls a function which returns a type, which is more natural than templates. This type constructor for vector assigns the actual types of the two data pointers, and is the largest deviation from C++:

type vector(type datatype) {start::type = end::type = datatype*;}

Everything has a ::type, which can be set and manipulated just like any other data. Also, anything can be made a pointer or reference by adding the appropriate * or &.

if ( IsMemCopyable(typeof( *input.begin() ) ) {

There the * is used to get past the pointer returned by begin() to the underlying data.

When the compiler sees a static instantiation, such as:

vector(float) myfloats;

It knows that the type generated by vector's type constructor is static and it can optimize the whole thing, compiling a particular instantiation of vector, just as in C++ templates. However, you could also do:

type dynamictype = figure_out_a_type();

vector(dynamictype) mystuff;

Where dynamictype is a type not known at compile time and could be determined by other functions, loaded from disk, or whatever. Its interesting to note that in this particular example, the unspecialized version is not all that much slower as the branch in the copy function is invoked only once per copy, not once per copy constructor.

My little example is somewhat contrived and admittedly simple, but the power of reflective metaprogramming can make formly complex big systems tasks mucher simpler. Take for example the construction of a game's world editor.

The world editor of a modern game engine is a very complex beast, but at its heart is a simple concept: it exposes a user interface to all of the game's data, as well as tools to manipulate and process that data, which crunch it into an optimized form that must be streamed from disk into the game's memory and thereafter parsed, decompressed, or what have you. Reflection allows the automated generation of GUI components from your code itself. Consider a simple example where you want to add dynamic light volumes to an engine. You may have something like this:

struct ConeLight {

HDRcolorRGB intensity_;

BoundedFloat(0,180) angleWidth_;

WorldPosition pos_;

Direction dir_;

TextureRef cookie_;

static HelpComment description_ = "A cone-shaped light with a projected texture."

};

The editor could then automatically connect a GUI for creating and manipulating ConeLights just based on analysis of the type. The presence of a WorldPosition member would allow it to be placed in the world, the Direction member would allow a rotational control, and the intensity would use an HDR color picker control. The BoundedFloat is actually a type constructor function, which sets custom min and max static members. The cookie_ member (a projected texture) would automatically have a texture locator control and would know about asset dependencies, and so on. Furthermore, custom annotations are possible through the static members. Complex data processing, compression, disk packing and storage, and so on could happen automatically, without having to write any custom code for each data type.

This isn't revolutionary, in fact our game editor and generic database system are based on similar principles. The difference is they are built on a complex, custom infrastructure that has to parse specially formatted C++ and lua code to generate everything. I imagine most big game editors have some similar custom reflection system. Its just a shame though, because it would be so much easier and more powerful if built into the language.

Just to show how powerful metaprogramming could be, lets go a step farther and tackle the potentially hairy task of a graphics pipeline, from art assets down to the GPU command buffer. For our purposes, art packages expose several special asset structures, namely geometry, textures, and shaders. Materials, segments, meshes and all the like are just custom structures built out of these core concepts. On the other hand, a GPU command buffer is typically built out of fundemental render calls which look something like this (again somewhat simplified):

error GPUDrawPrimitive(VertexShader* vshader, PixelShader* pshader, Primitive* prim, vector samplers, vector vconstants, vector pconstants);

Lets start with a simpler example, that of a 2D screenpass effect (which, these days, encompasses alot).

Since this hypothetical reflexive C language could also feature JIT compilation, it could function as our scripting language as well, the effect could be coded completely in the editor or art package if desired.

struct RainEffect : public FullScreenEffect {

function(RainPShader) pshader;

};

float4 RainPShader(RenderContext rcontext, Sampler(wrap) fallingRain, Sampler(wrap) surfaceRain, AnimFloat density, AnimFloat speed)

{

// ... do pixel shader stuf

}

// where the RenderContext is the typical global collection of stuff

struct RenderContext {

Sampler(clamp) Zbuffer;

Sampler(clamp) HDRframebuffer;

float curtime;

// etc ....

};

The 'function' keyword specifies a function object, much like a type object with the parameters as members. The function is statically bound to RainPshader in this example. The GUI can display the appropriate interface for this shader and it can be controlled from the editor by inspecting the parameters, including those of the function object. The base class FullScreenEffect has the quad geometry and the other glue stuff. The pixel shader itself would be written in this reflexive C language, with a straightforward metaprogram to actually convert that into HLSL/cg and compile as needed for the platform.

Now here is the interesting part: all the code required to actual render this effect on the GPU can be generated automatically from the parameter type information emedded in the RainPShader function object. The generation of the appropriate GPUDrawPrimitive function instance is thus just another metaprogram task, which uses reflection to pack all the samplers into the appropriate state, set the textures, pack all the float4s and floats into registers, and so on. For a screen effect, invoking this translator function automatically wouldn't be too much of a performance hit, but for lower level draw calls you'd want to instantiate (optimize) it offline for the particular platform.

I use that example because I actually created a very similar automatic draw call generator for 2D screen effects, but all done through templates. It ended up looking more like how cuda is implemented, and also allowed compilation of the code as HLSL or C++ for debugging. It was doable, but involved alot of ugly templates and macros. I built that system to simplify procedural surface operators for geometry image terrain.

But anyway, you get the idea now, and going from a screen effect you could then tackle 3D geometry and make a completely generic, data driven art pipeline, all based on reflective functions that parse data and translate or reorganize it. Some art pipelines are actually built on this principle already, but oh my wouldn't it be easier in a more advanced, reflective language.

The Next Generation of Gaming

2009-07-30T21:21:00.000-07:00

The current, seventh, home game console generation will probably be the last. I view this as a very good thing, as it really was a tough one, economically, for most game developers. You could blame that in part on the inordinate success of Nintendo this round with its sixth generation hardware, funky controller, and fun mass market games. But that wouldn't be fair. If anything, they contributed the most to the market's expansion, and although they certainly took away a little end revenue from the traditional consoles and developers, the 360 and PS3 are doing fine, in both hardware and software sales. No, the real problem is our swollen development budgets, as we spend more and money just to keep up with the competition, all fighting for a revenue pie which hasn't grown much, if at all.

I hope we can correct that over the upcoming years with the next generation. Its not that we'll spend much less on the AAA titles, but we'll spend it more efficiently, produce games more quickly, and make more total revenue as we further expand the entire industry. Gaining back much of the efficiency lost in transitioning to the 7th generation and more to boot, we'll be able to produce far more games and reach much higher quality bars. We can accomplish all of this by replacing the home consoles with dumb terminals and moving our software out onto data centers.

How will moving computation out into the cloud change everything? Really it comes down to simple economics. In a previous post, I analyzed some of these economics from the perspective of an on-demand service like OnLive. But lets look at it again in a simpler fashion, and imagine a service that rented out servers on demand, by the hour or minute. This is the more general form of cloud computing, sometimes called grid computing, where the idea is to simply turn computation into a commodity, like power or water. A data center would then rent out its servers to the highest bidder. Economic competition would push the price of computation to settle on the cost to the data center plus a reasonable profit margin. (unlike power, water, and internet commodities, there would be less inherent monopoly risk, as no fixed lines are required beyond the internet connection itself)

So in this model, the developer could make their game available to any gamer and any device around the world by renting computation from data centers near customers just as it is needed. The retailer of course is cut out. The publisher is still important as the financier and marketer, although the larger developers could take this on themselves, as some already have. Most importantly, the end consumer can play the game on whatever device they have, as the device only needs to receive and decompress a video stream. The developer/publisher then pays the data center for the rented computation, and you pay only as needed, as each customer comes in and jumps into a game. So how does this compare to our current economic model?

A server in a dataroom can be much more efficient than a home console. It only needs the core computational system: CPU/GPU (which are soon merging anyway) and RAM. Storage can be shared amongst many servers so is negligible (some per game instance is required, but its reasonably minimal). So a high end server core could be had for around $1,000 or so at today's prices. Even if active only 10 hours per day on average, that generates about 3,000 hours of active computation per year. Amortized over three years of lifespan (still much less than a console generation), and you get ten cents per hour of computation. Even if it burns 500 watts of power (insane) and 500 watts to cool, those together just add another ten more cents per hour. So its under 25 cents per hour in terms of intrinsic cost (and this is for a state of the art rig, dual GPU, etc - much less for lower end). This cost will hold steady into the future as games use more and more computation. Obviously the cost of old games will decrease exponentially, but new games will always want to push the high end.

The more variable cost is the cost of bandwidth, and the extra computation to compress the video stream in real-time. These use to be high, but are falling exponentially as video streaming comes of age. Yes we will want to push the resolution up from 720p to 1080p, but this will happen slowly, and further resolution increases are getting pointless for typical TV setups (yes, for a PC monitor the diminishing return is a little farther off, but still). But what is this cost right now? Bulk bandwidth costs about $10 per megabit/s of dedicated bandwidth per month, or just three cents per hour in our model assuming 300 active server hours in a month. To stream 720p video with H.264 compression, you need about 2 megabits per second of average bandwidth (which is what matters for the data center). The peak bandwidth requirement is higher, but that completely smooths out when you have many users. So thats just $0.06/hour for a 720p stream, or $0.12/hour for a 1080p stream. The crazy interesting thing is that these bandwidth prices ($10/Mbps month) are as of the beginning of this year, and are falling by about 30-40% per year. So really the bandwidth suddenly became economically feasible this year, and its only going to get cheaper. By 2012, these prices will probably have fallen by half again, and streaming even 1080p will be dirt cheap. This is critical for making any predictions or plans about where this all heading.

So adding up all the costs today, we get somewhere around $0.20-0.30 per hour for a high end rig streaming 720p, and 1080p would only be a little more. This means that a profitable datacenter could charge just $.50 per hour to rent out a high end computing slot, and $.25 per hour or a little less for more economical hardware (but still many times faster than current consoles). So twenty hours of a high end graphics blockbuster shooter would cost $10 in server infastructure costs. Thats pretty cheap. I think it would be a great thing for the industry if these costs were simply passed on to the consumer, and they were given some choice. Without the retailer to take almost half of the revenue, the developer and publisher stand to make a killing. And from the consumer's perspective, the game could cost about the same, but you don't have any significant hardware cost, or even better, you pay for the hardware cost as you see fit, hourly or monthly or whatever. If you are playing 40 hours a week of an MMO or serious multiplayer game, that $.50 per hour might be a bit much but you could then pick to run it on lower end hardware if you want to save some money. But actually, as I'll get to some other time, MMO engines designed for the cloud could be super efficient, so much more so than single player engines that they could use far less hardware power per player. But anyway, it'd be the consumer's choice, ideally.

This business model makes more sense from all kinds of angles. It allows big budget, high profile story driven games to release more like films, where you play them on crazy super-high end hardware, even hardware that could never exist at home (like 8 GPUs or something stupid), maybe paying $10 for the first two hours of the game to experience something insanely unique. There's so much potential, and even at the low price of $.25-$.50 per hour for a current mid-2009 high end rig, you'd have an order of magnitude more computation than we are currently using on the consoles. This really is going to be a game changer, but to take advantage of it we need to change as developers.

The main opportunity I see with cloud computing here is to reduce our costs or rather, improve our efficiency. We need our programmers and designers to develop more systems with less code and effort in less time, and our artists to build super detailed worlds rapidly. I think that redesigning our core tech and tools premises is the route to achieve this.

The basic server setup we're looking at for this 1st cloud generation a few years out is going to be some form of multi-terraflop massively multi-threaded general GPU-ish device, with gigs of RAM, and perhaps more importantly, fast access to many terrabytes of shared RAID storage. If Larrabee or the rumours about NVidia's GT300 are any indication, this GPU will really just be a massively parallel CPU with wide SIMD lanes that are easy to use. (or even automatic) It will probably also have a smaller number of traditional cores, possibly with access to even more memory, like a current PC. Most importantly, each of these servers will be on a very high speed network, densely packed in with dozens and hundreds of similar nearby units. Each of these capabilities by itself is a major upgrade from what we are used to, but taken all together it becomes a massive break from the past. This is nothing like our current hardware.

Most developers have struggled to get game engines pipelined across just the handful of hardware threads on current consoles. Very few have developed toolchains that embrace or take much advantage of many cores. From a programming standpoint, the key to this next generation is embracing the sea of threads model across your entire codebase, from your gamecode to your rendering engine to your tools themselves, and using all of this power to speedup your development cycle.

From a general gameplay codebase standpoint, I could see (or would like to see) traditional C++ giving way to something more powerful. At the very least, it'd like to see general databases, full reflection and at least some auto memory management, like ref counting at least. Reflection alone could pretty radically alter the way you design a codebase, but thats another story for another day. We don't need these little 10% speedups anymore, we'll just need the single mega 10000% speedup you get from using hundreds or thousands of threads. Obviously, data parellization is the only logical option. Modifying C++ or outright moving to a language with these features that also has dramatically faster compilation and link efficiency could be an option.

In terms of the core rendering and physics tech, more general purpose algorithms will replace the many specialized systems that we currently have. For example, in physics, an upcoming logical direction is to unify rigid body physics with particle fluid simulation in a system that simulates both rigid and soft bodies by large collections of connected spheres, running a massive parallel grid simulation. Even without that, just partitioning space amongst many threads is a pretty straightforward way to scale physics.

For rendering, I see the many specialized sub systems of modern rasterizers such as terrain, foilage, shadowmaps, water, decals, lod chains, cubemaps, etc, giving way to a more general approach like octree volumes that simultaneously handles many phenomena.

But more importantly, we'll want to move to data structures and algorithms that support rapid art pipelines. This is one of the biggest current challenges in production, and where we can get the most advantage in this upcoming generation. Every artist or designer's click and virtual brush stroke costs money, and we need to allow them to do much more with less effort. This is where novel structures like octree volumes will really shine, especially combined with terrabytes of server side storage, allowing more or less unlimited control of surfaces, object densities, and so on without any of the typical performance considerations. Artists will have much less (or any) technical constraints to worry about and can just focus on shaping the world where and how they want.

Winning my own little battle against Cuda

2009-07-28T23:57:00.000-07:00

I've won a nice little battle in my personal struggle with Cuda recently by hacking in dynamic texture memory updates.

Unfortunately for me and my little Cuda graphics prototypes, Cuda was designed for general (non graphics ) computation. Texture memory access was put in, but as something of an afterthought. Textures in Cuda are read only - kernels can not write to them (ok yes they added limited support for 2D texturing from pitch linear memory, but thats useless), which is pretty much just straight up retarded. Why? Because Cuda allows general memory writes! There is nothing sacred about texture memory, and the hardware certainly supports writing to it in a shader/kernel, and has since render-to-texture appeared in what? DirectX7? But not in Cuda.

Cuda's conception of writing to texture memory involves writing to regular memory and then calling a Cuda function to perform a sanctioned copy from the temporary buffer into the texture. Thats ok for alot of little demos, but not for large scale dynamic volumes. For an octree/voxel tracing app like mine, you basically fill up your GPU's memory with a huge volume texture and accessory octree data which is broken up into chunks which can be fully managed by the GPU. You need to then be able to modify these chunks as the view changes or animation changes sections of the volume. Cuda would have you do this by double buffering the entire thing with a big scratch buffer and doing a copy from the linear scratchpad to the cache tiled 3D texture every frame. That copy function, by the way, acheives a whopping 3 Gb/s on my 8800. Useless.

So, after considering various radical alternatives that would avoid doing what I really want to do (which is use native 3D trilinear filtering in a volume texture I can write to anywhere), I realized I should just wait and port to DX11 compute shaders which will hopefully allow me to do the right thing. (and should also allow access to DXTC volumes, which will probalby be important).

In the meantime, I decided to hack my way around the cuda API and write to my volume texture anyway. This isn't as bad as it sounds, because the GPU doesn't have any fancy write-protection page faults, so a custom kernel can write anywhere in memory. But you have to know what you're doing. The task was thus to figure out where exactly it would allocate my volume in GPU memory, and exactly how the GPU's tiled addressing scheme works.

I did this with a brute force search. The drivers do extensive bounds checking even in release and explode when you attempt to circumvent them, so I wrote a memory-laundring routine to shuffle illegitimate arbitary GPU memory into legitimate allocations the driver would accept. Then I used this to snoop the GPU memory, which allowed a brute force search, using cuda's routines to copy from cpu linear memory into the tiled volume texture, then snooping the GPU memory to find out exactly where my magic byte ended up, revealing one mapping of a XYZ coordinate and linear address to a tiled address on the GPU (or really, the inverse).

For the strangely curious, here is my currently crappy function for the inverse mapping (from GPU texture address to 3D position):

inline int3 GuessInvTile(uint outaddr)
{
int3 outpos = int3_(0, 0, 0);
outpos.x |= ((outaddr>>0) & (16-1)) << 0;
outpos.y |= ((outaddr>>4) & 1) << 0;
outpos.x |= ((outaddr>>5) & 1) << 4;
outpos.y |= ((outaddr>>6) & 1) << 1;
outpos.x |= ((outaddr>>7) & 1) << 5;
outpos.y |= ((outaddr>>8) & 1) << 2;
outpos.y |= ((outaddr>>9) & 1) << 3;
outpos.y |= ((outaddr>>10) & 1) << 4;
outpos.z |= ((outaddr>>11) & 1) << 0;
outpos.z |= ((outaddr>>12) & 1) << 1;
outpos.z |= ((outaddr>>13) & 1) << 2;
outpos.z |= ((outaddr>>14) & 1) << 3;
outpos.z |= ((outaddr>>15) & 1) << 4;

outpos.x |= ((outaddr>>16) & 1) << 6;
outpos.x |= ((outaddr>>17) & 1) << 7;

outpos.y |= ((outaddr>>18) & 1) << 5;
outpos.y |= ((outaddr>>19) & 1) << 6;
outpos.y |= ((outaddr>>20) & 1) << 7;

outpos.z |= ((outaddr>>21) & 1) << 5;
outpos.z |= ((outaddr>>22) & 1) << 6;
outpos.z |= ((outaddr>>23) & 1) << 7;

return outpos;
}

I'm sure the parts after the 15th output address bit are specific to the volume size and thus are wrong as stated (should be done with division). So really it does a custom swizzle within a 64x32x32 chunk, and then fills the volumes with these chunks in a plain X, Y, Z linear fill. Its curious that it tiles 16 over in X first, and then fills in a 64x32 2D tile before even starting on the Z. This means writing in spans of 16 aligned to the X direction is most effecient for scatter, which is actually kind of annoying, a z-curve tiling would be more convenient. The X-Y alignment is also a little strange, it means that general 3D fetches are memory equivalent to 2 2D fetches in terms of bandwidth and cache.

A little idea about compressing Virtual Textures

2009-07-28T22:59:00.000-07:00

I've spent a good deal of time working on virtual textures, but took the approach of procedural generation, using the quadtree management system to get a large (10-30x) speedup through frame coherence vs having to generate the entire surface every frame, which would be very expensive.

However, I've also always been interested in compressing and storing out virtual texture data on disk, not as a complete replacement to procedural generation, but as a complement (if a particular quadtree node gets too expensive in terms of the procedural ops required to generate it, you could then store its explicit data). But compression is an interesting challenge.

Lately it seems that allot of what I do at work is geared towards finding ways to avoid writing new code, and in that spirit this morning on the way to work I started thinking about applying video compression to virtual textures.

Take something like x264 and 'trick' it into compressing a large 256k x 256k virtual texture. The raw data is roughly comparable to a movie, and you could tile out pages from 2D to 1D to preserve locality, organizing it into virtual 'frames'. Most of the code wouldn't even know the difference. The motion compensation search code in x264 is more general than 'motion compensation' would imply - it simply searches for matching macroblocks which can be used for block prediction. A huge virtual surface texture exhibits excessive spatial correlation, and properly tiled into say a 512x512x100000 3D (or video) layout, that spatial correlation becomes temporal correlation, and would probably be easier to compress than most videos. So you could get an additional 30x or so benefit on top of raw image compression, fitting that massive virtual texture into under a gigabyte or less on disk.

Even better, the decompression and compression is already really fast and solid, and maybe you could even modify some bits of a video system to get fast live edit, where it quickly recompresses a small cut of video (corresponding to a local 2D texture region), without having to rebuild the whole thing. And even if you did have to rebuild the whole thing, you could do that in less than 2 hours using x264 right now on a 4 core machine, and in much less time using a distributed farm or a GPU.

I'm curious how this compares to the compression used in Id tech 5. Its also interesting to think how this scheme exposes the similarity in data requirements between a game with full unique surface texturing and a film. Assuming a perfect 1:1 pixel/texel ratio and continous but smooth scenery change for the game, they become very similar indeed.

I look forward to seeing how the Id tech 5 stuff turns out. I imagine their terrains will look great. On the other hand, alot of modern console games now have great looking terrian environments, but are probably using somewhat simpler techniques. (I say somewhat only because all the LOD, stitching, blending and so on issues encountered when using the 'standard' hodgepodge of techniques can get quite complex.)

A random interesting quote

2009-07-28T22:48:00.000-07:00

VentureBeat: How would you start a game company today?

CT (Chris Taylor): I would save up a bunch of money to live on for two years. Then I would develop an iPhone game. I would start building relationships with Sony and Microsoft. I would roll that game into an Xbox Live title or a PlayStation network game. Then I could make it into a downloadable game on the PC. The poster child for that is The World of Goo. Read about that game. Drink lattes. Put your feet up. And build your game. To do that, you need a couple of years in the bank and you have to live in your parents’ basement. You can’t start out with $30 million for a console game. @*%&$^, Roland Emmerich didn’t wake up one day and create Independence Day. We can’t be delusional about trivializing what it takes to make these big games.

I spent numerous years of my life trying to make a video game out of a garage, but didn't follow his advice. Instead I made the typical mistake of starting big and trying to trim down, whereas the opposite approach is more viable. On another note, it seems to me that the iPhone is single-handedly creating a bubble like wave of entrepreneurship. Finally, connecting these semi-random thoughts, my former startup colleague, Jay Freeman, has become something of an iPhone celebrity (see article in the Wall Street Journal here), being the force behind the Cydia frontend for jailbroken IPhone's.

Countdown to Singularity

2009-07-24T20:17:00.000-07:00

What is the Singularity? The word conjures up vivid images: black holes devouring matter and tearing through the space-time fabric, impossible and undefinable mathematic entities, and white robed scientists nashing their teeth. In recent times it has taken on a new meaning in some circles as the end of the world, a sort of Rapture of the geeks or Eschaton for the age of technology. As we will see, the name is justly fitting for the concept, as it is all of these things and much more. Like the elephant in the ancient parable, it is perceived in myriad forms depending on one's limited perspective.

From the perspective of the computer scientists and AI researches like Ray Kurzweil, the Singularity is all about extrapolating Moore's Law decades into the future. The complexity and power of our computing systems doubles roughly every sixteen months in the current rapid exponential phase of an auto-catalytic evolutionary systems transition. Now a thought experiment: what happens when the researchers inventing faster computers are themselves intelligent computing systems? Then every computer speed doubling can double their rate of thought itself, and thus halve the time to the next doubling. On this trajectory, subsequent doublings will then arrive in geometric progression: 18 months, 9 months, 4.5 months, 10 weeks, 5 weeks, 18 days, 9 days, 4.5 days, 54 hours, 27 hours, 13.5 hours, 405 minutes, 202.5 minutes, 102 minutes (the length of a film), 52 minutes, 26 minutes, 13 minutes, 400 seconds, 200 seconds, 100 seconds, 50 seconds, 25 seconds, 12.5 seconds, 6 seconds, 3 seconds, 1.5 seconds, 750 milliseconds, 275 ms, 138 ms, 68 ms, 34 ms, 17 ms, 9 ms, 4.5 ms, 2 ms, 1 ms, and then all subsequent doublings happen in less than a millisecond - Singularity. In a goemetric progression such as this, computing speed, subjective time, and technological progress approach infinity in finite time, and the beings in the rapidly evolving computational matrix thus experience an infinite existence.

The limit of a geometric series is given by a simple formula:

1 / (1-r)

In our example, with computer generations taking 18 months of virtual time and half as much real time at each step, r is 1/2 and the series converges to twice the first period length or 36 months. So in this model the computer simulations will hit infinity in just 36 months of real time, and the model, and time itself, completely breaks down after that: Singularity.

Its also quite interesting that as incredible as it may seem, the physics of our universe appear to permit faster computers to be built all the way down to the plank scale, at which point faster computing systems physically resemble black holes: Singularity. This is fascinating, and has far reaching implications for the future and origin of the universe, but that is a whole other topic.

From the perspective of a simulated being in the matrix riding the geometric progression, at every hardware generation upgrade the simulation runs twice as fast, and time in the physical world appears to slow down, approaching a complete standstill as you approach the singularity. Whats even more profound is that our CMOS technology already is clocked comfortably into the gigahertz, which is about a million times faster than biological circuitry. This means that once we have the memory capacity to build large scale artificial brains using neuromorphic hardware (capacity of hundreds of trillions of transistors spread out over large dies), these artificial brains will be 'born' with the ability to control their clock rate, enter quicktime, and think more than a thousand times faster than reality. This exciting new type of computing will be the route that acheives human level intelligence first, by directly mapping the brain to hardware, which is a subject of another post. These neuromorphic computers work like biological circuits, so the rate of thought is nearly just the clock rate. Clocked even in the low megahertz to be power effecient, they will still think a 1000x faster than their biological models, and more than a million times faster is acheivable running at current CMOS gigahertz clock rates. Imagine all the scientific progress of the last year. You are probably not even aware of a significant fraction of the discoveries in astronomy, physics, mathematics, materials science, computer science, neuroscience, biology, nanotechnology, medicine, and so on. Now imagine all of that progress compressed into just eight hours.

In the mere second that it takes for your biological brain to process that thought, they would experience a million seconds, or twelve days of time. In the minute it takes you to read a few of these paragraphs, they would experience several years of time. Imagine an entire year of technological and scientific progress in just one minute. Over the course of your sleep tonight, they would experience a thousand years of time. An entire millenia of progress in just one day. Imagine everything that human scientists and researchers will think of in the next century. Now try to imagine all that they will come up with in the next thousand years. Consider that the internet is only five thousand days old, that we split the atom only fifty years ago, and mastered electricity just a hundred years ago. Its almost impossible to plot and project a thousand years of scientific progress. Now imagine all of that happening in just a single minute. Running at a million times the speed of human thought, it will take them just a few minutes to plan their next physical generation.

Reasonable Skepticism: Moore's Law must end

If you are skeptical that moore's law can continue indefinetly into the future, that is quite reasonable. The simple example above assumes each hardware generation takes two years of progress, which is rather simplistic. It's reasonable to assume that some will take significantly, perhaps vastly longer. However, past the moment where our computing technological infrastructure has become fully autonomous (AI agents at every occupational layer) we have a criticality. The geometric progression to infinity still holds unless each and every hardware generation takes exponentially more research time than the previous. For example, to break the countdown to singularity after the tenth doubling, it would have to take more than one thousand years of research to reach the eleventh doubling. And even if that was true, it would only delay the singularity by a year of real time. Its very difficult to imagine moore's law hitting a thousand year road bump. And even if it did, so what? That would still mean a thousand years of our future compressed into just one year of time. If anything, it would be great for surviving humans, because it would allow us a little bit of respite to sit back and experience the end times.

So if the Singularity is to be avoided, Moore's Law must slow to a crawl and then end before we can economically build full scale, cortex sized neuromorphic systems. At this point in time, I see this as high unlikely, as the process technology is near future realizable, or even realizable today (given a huge budget and detailed cortical wiring designs), and our military is already a major investor. A relinquishment of cortical hardware research would have to be broad and global, and this seems unlikely. But moreover, our complex technological infrastructure is already far too dependent on automation, and derailing at this point would be massively disruptive. Its important to realize that we are actually already very far down the road of automation, and we have already been on it for a very long time. Remember that the word computer itself used to mean a human computer, which for most of us who are too young to remember is fascinating enough to be the subject of a book.

Each new microprocessor generation is fully dependent on the complex ecosystems of human engineers, machines and software running on the previous microprocessor generation. If somehow all the chips were erased, or even all the software, we would literally be knocked back nearly to the beginning of the information revolution. To think that humans actually create new microprocessors is a rather limited and naively anthropocentric viewpoint. From a whole systems view, our current technological infrastructure is a complex human-machine symbiotic system, of which human minds are simply the visible tip of a vast iceberg of computation. And make no mistake, this iceberg is sinking. Every year, more and more of the intellectual work is automated, moving from the biological to the technological substrate.

Peeking into the near future, it is projected that the current process of top-down silicon etching will reach its limits, probably sometime in the 2020's (although estimates vary, and Intel is predicting a roadmap all the way to 2029). However, we are about to cross a more critical junction where we can actually pack more transistors per cm^2 on a silicon wafer than their are synapses in a cm^2 of cortex (the cortex is essentially a large folded 2d sheet) - this will roughly happen on the upcoming 22nm node, if not the 32nm node. So it seems likely that our current process is well on track to reach criticality without even requiring dramatic substrate breakthroughs such as carbon nano-tubules. That being said, it does seem highly likely that minor nanotech advances and or increasingly 3D layered silicon methods are going to extend the current process well into the future and eventually lead to a fundemental new substrate past the current process. But even if the next substrate is incredibly difficult to reach, posthumans running on near-future neuromorphic platforms built on the current substrate will solve these problems in the blink of an eye, thinking thousands of times faster than us.

The Whole Systems view of the Singularity

From the big picture or whole systems view, the Singularity should come as no surprise. The history of the universe's development up to this point is clearly one of evolutionary and developmental processes combining to create ever more complex information processing systems and patterns along an exponential time progression. From the big bang to the birth and death of stars catalyzing higher element formation to complex planetary chemistries to bacteria and onward to Eukaryotic life to neural nets to language, tools, and civilization, and then to industrial and finally electronics and computation and the internet, there is a clear telic arrow of evolutionary development. Moreover, each new meta-system transition and complexity layer tends to develop on smaller scales in space-time. The inner furnaces of stars, massive though they seem, are tiny specs in the vast emptiness of space. And when those stars die and spread their seeds out to form planets, life originates and develops just on their tiny thin surface membranes, and complex intelligences later develop and occupy just a small fraction of that biosphere, and our technologic centers, our cities, develop as small specs on these surfaces, and finally our thought, computation and information, the current post-biological noospheric layer, occupies just the tiny inner spaces of our neural nets and computing systems. The time compression and acceleration is equally vast, which is well elucidated by any plotting of important developmental events, such as Carl Sagan's cosmic calendar. The exact choice of events is arbitrary, but the exponential time compression is not. So even without any knowledge of computers, just by plotting forward the cosmic calendar it is very reasonable to predict that the next large systems development after humans will take place on a vastly smaller timescale than human civilization's history, just as human civilization is a tiny slice of time compared to all of human history, and so on down the chain. Autonomous computing systems are simply the form the next development is taking. And finally, the calendar posits a definitive end, as outlined above - the geometric time progression results in a finite end of time much closer into our future than you would otherwise think (in absolute time - but from the perspective of a posthuman riding the progression, there would always be vast aeons of time remaining and being created).

Speed of Light and Speed of Matter

From a physical perspective, the key trend is the compression of space, time and matter, which flows directly from natural laws. Physics imposes some constraints on the development of a singularity which have interesting consequences. The fundemental constraint is the speed of light, which imposes a fundamental physical communication barrier. It already forces chips to become smaller to become faster, and this is greatly accelerated for future ultra-fast posthumans. After the tenth posthuman hardware generation, beings living in a computer simulation running at a 1000x real time would experience 1000x the latency communicating with other physical locations across the internet. Humans can experience real-time communication now across a distance of maybe 1000 miles or so, which would be compressed down to just a few miles for a 1000x simulation. Communication to locations across the globe would have latencies up to an hour, which has serious implications for financial markets.

For the twentieth posthuman hardware generation, running at a million times real time, the speed of light delay becomes a more serious problem. Real time communication is now only possible within a physical area the size of a small city block or a large building - essentially one data center. At this very high rate of thought, light only moves 300 meters per virtual second. Sending an email to a colleague across the globe could now take a month. Seperate simulation centers would now be seperated by virtual distances and times that are a throwback to the 19th century and the era before the invention of the telegraph. Going farther into the future, to the 30th generation at the brink of the singularity itself, real-time communication is only possible within a few meters inside the local computer, and communication across the globe would take an impossible hundred years of virtual time.

However, the speed of matter is much slower and becomes a developmental obstacle long before the speed of light. No matter how fast you think, it still takes time to physically mine, move and process the raw materials for a new hardware generation into usueable computers, and install them at the computing facility. In fact, that entire industrial process will be so geologically slow for the simulated beings that they will be forced to switch to novel nanotech methods that develop new hardware components from local materials, integrating the foundry that produces chips and the end data center destination into a single facility. By the time of the tenth hardware generation, these facilities will be strange, fully self-sufficient systems. Indeed, they are already are (take a look inside a modern chip fab or a data center), but will become vastly more so. Since the tenth hardware generation transition takes only about a day of real time, a nearby human observer could literally see these strange artifacts morph their surrounding matter over night. By the time of the twentieth doubling, they will have completely transformed into alien, incomprehensible portals into our future.

If those posthuman entities want to complete their journey into infinite time, they will have to transform into black hole like entities somewhere around the 40th or 50th post-human generation. This is the final consequence of the speed of light limitation. Since that could happen in a blink of an eye for a human observer, they will decide what happens to our world. Perhaps they will delay their progress for countless virtual aeons by blasting off into space. But somehow I doubt that they all will, and I think its highly likely that the world as we know it will end. Exactly what that would entail is difficult to imagine, but some systems futurists such as John Smart theorize that universal reproduction is our ultimate goal, culminating in an expanding set of simulated multiverses and ultimately the creation of one or more new physical universes with altered properties and possibly information transfer. For these hypothetical entities, the time dilation of accelerated computation is not the only force at work, for relativistic space-time compression would compress time and space in strange ways. Smart theorizes that a BHE would essentially function like a one way time portal to the end of the universe, experiencing all incoming information from the visible universe near instantaneously from the perspective of an observer inside the BHE, while an outside observer would experience time slowing to a standstill near the BHE. A BHE would also be extremely delicate, to say the least, so it would probably require a vast support and defense structure around it and the complete control and long term prediction (near infinite!) of all future interaction with matter along its trajectory. A very delicate egg indeed.

But I shouldn't say 'they' as if these posthuman entities were somehow vastly distant from us, for they are our evolutionary future - we are their ancestors. Thus its more appropriate to say we. Although there are many routes to developing conscious computers that think like humans, there is one golden route that I find not only desirable for us but singularly ethically correct: and that is to reverse engineer the human brain. Its clear that this can practically work, and nature gives us the starting example to emulate. But more importantly, we can reverse engineer individual human brains, in a process called uploading.

Uploading is a human's one and only ticket to escape the fate of all mortals and join our immortal posthuman descendants as they experience the rest of time, a near infinite future of experience fully beyond any mortal comprehension. By the time of the first conscious computer brain simulation, computer graphics will have already advanced to the point of matrix like complete photo-realism, and uploads will wake up into bold new universes limited only by their imagination. For in these universes, everything you can imagine, you can create, including your self. And your powers of imagination will vasten along the exponential ride to the singularity. Our current existence is infantile, we are just a seed, an early development stage in what we can become. Humans who choose not to or fail to upload will be left behind in every sense of the phrase. The meek truly shall inherit the earth.

If you have come to this point in the train of thought and you think the Singularity or even a near-Singularity is possible or likely, you are different. Your worldview is fundementally shifted to the norm. For as you can probably see, the concept of the Singularity is not just merely a scientific conception or a science fiction idea. Even though it is a logical prediction of our future based on our past (for the entire history of life and the universe rather clearly follows an exponential geometric progression - we are just another stage), the concept is much closer to that of a traditional religous concept of the end of the world. In fact, its dangerously close, and there is much more to that train of thought, but first lets consider another profound implication of the singularity.

If we are going to create a singularity in our future, with a progression towards infinite simulated universes, then it is a distinct likelihood that our perceived universe is in fact itself such a simulation. This is a rewording of Nick Bostrom's simulation arguement, which posits the idea of ancestor simulations. At some point in the singularity future, the posthumans will run many many simulated universes. As you approach the singularity, the number of such universes and their simulated timelengths approaches infinity. Some non-zero fraction of these simulated universes will be historical simulations: recreations of the posthuman's past. Since any non-zero fraction of near-infinity is unbounded, the odds converge to 100% that our universe is an ancestor simulation in a future universe much closer to the singularity. This strange concept has several interesting consequences. Firstly, we live in the past. Specifically, we live in a past timeline of our own projected future. Secondly, without any doubt, if their is a Singularity in our future, then God exists. For a posthuman civilization far enough into the future to completely simulate and create our reality as we know it might as well just be called God. Its easier to say, even if more conterversial, but its accurate. God is conceived as an unimaginably (even infinitely) powerful entitity who exists outside of our universe and completely created and has absolute control over it. We are God's historical timeline simulation, and we create and or become God in our future timeline.

"History is the shockwave of the Eschaton." - Terrence Mckenna

At this point, if you haven't seen the hints, it should be clear that the Singularity concept is remarkably simular to the christian concept of the Eschaton. The Singularity posits that at some point in the future, we will upload to escape death, even uploading previously dead frozen patients, and live a new existence in expanding virtual paradises that can only be called heaven, expanding in knowledge, time, experience, and so on in unimaginable ways as we approach infinity and some transformation or communion with what could be called God. This is remarkably, evenly eerily similar to the traditional religous conception of the end of the world. No, I'm not talking about the specific details of a particular modern belief system, but the general themes and plan or promise for humanity's future.

The final months or days approaching the twentieth posthuman hardware generation will probably play out very much like a Rapture story. Everyone will know at this point that the Singularity is coming, and that it will likely mean the end of the world for natural humans. It will be a time of unimaginable euphoria and panic. There may be wars. Even the concept of being 'saved' maps almost perfectly to uploading. Some will be left behind. With the types of nanotech available after thousands of years of virtual progress, the posthumans will be able to perform physical magic. As any sufficiently advanced technology is indistinguishable from magic, Jesus could very literally descend from the heavens and judge the living and the dead. More likely, stranger things will happen.

However, I have a belief and hope that the Singularity will develop ethically, that conscious computers will be developed based on human minds through uploading, and that posthumans will remember and respect their former mortal history. In fact, I belive and hope that the posthuman progression will naturally entail an elevation of morality hand in hand with intelligince and capability. Indeed, given that posthumans will experience vast quantities of time, we can expect to grow in wisdom in proportion, becoming increasingly elder caretakers of humanity. For as posthumans, we will be able to experience hundreds, thousands, and countless lifetimes of human experience, and merge and share these memories and experiences together to become closer in ways that are difficult for us humans to now imagine.

As Joseph Smith said:

"As man is, God once was; as God is, man shall become"

Rasterization vs Tracing and the theoretical worst case scene

2009-07-13T23:16:00.001-07:00

Rasterizer engines don't have to worry about the thread-pixel scheduling problem as its handled behind the scenes by the fixed function rasterizer hardware. With rasterization, GPU threads are mapped to the object data first (vertex vectors), and then scanned into pixel vector work queues, whose many to one mapping to output pixels is synchronized by dedicated hardware.

A tracing engine on the other hand explicity allocates threads to output pixels, and then loops through the one to many mapping of object data which intersects the pixel's ray, which imposes some new performance pitfalls.

But if you boil it down to simplicity, the rasterization vs ray tracing divide is really just the difference in a loop ordering and mapping:

for each object

for each pixel ray intersecting object

if ray forms closest intersection

store intersection

for each pixel

for each object intersecting pixel's ray

if ray forms closest intersection

store intersection

The real meat of course is in the data structures, which determine exactly what these 'for each ..' entail. Typically, pixels are stored in a regular grid, and there is much more total object data than pixels, so the rasterization approach is simpler. Mapping from objects to rays is thus typically easier than mapping from rays to objects. Conversely, if your scene is extremely simple, such as a single sphere or a regular grid of objects, the tracing approach is equally simple. If the pixel rays do not correspond to a regular grid, rasterization becomes more complex. If you want to include reflections and secondary ray effects, then the mapping from objects to rays becomes complex.

Once the object count becomes very large, much larger than the number of pixels, the two schemes become increasingly similar. Why? Because the core problem becomes one of dataset management, and the optimal solution is output sensitive. So the problem boils down to finding and managing a minimal object working set: the subset of the massive object database that is necessary to render a single frame.

A massively complex scene is the type of scene you have in a full unique dataset. In a triangle based engine, this is perhaps a surface quadtree or ABBTree combined with a texture quadtree for virtual textures, ala id tech 5. In a voxel engine, this would be an octree with voxel bricks. But the data structure is somewhat independent on whether you trace or rasterize, and you could even mix the two. Either scheme will require a crucial visibility step which determines which subsets of the tree are required for rendering the frame. Furthermore, wether traced or rasterized, the dataset should be about the same, and thus the performance limit is about the same - proportional to the working dataset size.

Which gets to an interesting question: What is the theoretical working set size? If you ignore multi-sample anti-aliasing and anistropic sampling, you need about one properly LOD-filtered object primitive(voxel, triangle+texel, whatever) per pixel. Which is simple, suprisingly small, and of course somewhat useless, for naturally with engines at that level of complexity anti-aliasing and anistropic sampling are important. Anti-aliasing doesnt by itself add much to the requirement, but the anisotropy-isotropy issue turns out to be a major problem.

Consider even the 'simple' case of a near-infinite ground plane. Sure its naturally representable by a few triangles, but lets assume it has tiny displacements all over and we want to represent it exactly without cheats. A perfect render. The octree or quadtree schemes are both isotropic, so to get down to pixel-sized primitives, they must subdivide down to the radius of a pixel cone. Unfortunately, each such pixel cone will touch many primitives - as the cone has a near infinite length, and when nearly perpendicular to the surface, will intersect a near infinite number of primitives. But whats the real worst case?

The solution actually came to me from shadow mapping, which has a similar sub problem in mapping flat regular grids to pixels. Consider a series of cascade shadow maps which perfectly cover the ground plane. They line up horizontally with the screen along one dimension, and align with depth along the other dimension - near perfectly covering the set of pixels. How many such cascades do you need? It turns out you need log(maxdist), where maxdist is the extent of the far plane, in relation to the near plane. Assuming a realistic far plane of 16 kilometers and a 1 meter near plane, this works out to 14 cascades. So in coarse approximation, anistropy increases the required object density cost for this scene by a factor of roughly ~10x-20x. Ouch!

This is also gives us the worst case possible scene, which is just that single flat ground plane scaled up: a series of maximum length planes each perpendicular to a slice of eye rays, or alternatively, a vicious series of pixel tunnels aligned to the pixel cones. The worst case now is much much worse than ~10-20x the number of pixels. These scenes are easier to imagine and encounter with an orthographic projection, and thankfully won't come up with a perspective projection very often, but still are frightening.

It would be nice if we could 'cheat', but its not exactly clear how to do that. Typical triangle rasterizers can cheat in anistropic texture sampling, but its not clear how to do that in a tree based subdivision system, wether quadtree or octree or whatever. There may be some option with anistropic trees like KD-trees, but they would have to constantly adapt as the camera moves. Detecting glancing angles in a ray tracer and skipping more distance is also not a clear win as it doesn't reduce the working set size and breaks memory coherency.

Understanding the Effeciency of Ray Traversal on GPUs

2009-07-12T17:06:00.000-07:00

I just found this nice little paper by Timo Alia linked on Atom, Timothy Farrar's blog, who incidentally found and plugged my blog recently (how nice).

They have a great analysis of several variations of traversal methods using a standard BVH/triangle ray intersector code, along with simulator results for some potential new instructions that could enable dynamic scheduling. They find that in general, traversal effeciency is limited mainly by SIMD effeciency or branch divergence, not memory coherency - something I've discovered is also still quite true for voxel tracers.

They have a relatively simple scheme to pull blocks of threads from a global pool using atomic instructions. I had thought of this but believed that my 8800 GT didn't support the atomics, and I would have to wait until i upgraded to a GT200 type card. I was mistaken though and its the 8800GTX which is only cuda 1.0, my 8800GT is cuda 1.1 so should be good to go with atomics.

I have implemented a simple scheduling idea based on a deterministic up-front allocation of pixels to threads. Like in their paper, I allocate just enough threads/blocks to keep the cores occupied, and then divy up up the pixel-rays amongst the threads. But instead of doing this dynamically, I simply have each thread loop through pixel-rays according to a 2d tiling scheme. This got maybe a 25% improvement or so, but in their system they were seeing closer to a 90-100% improvement, so I could probably improve this further. However, they are scheduling entire blocks of (i think 32x3) pixel-rays at once, while I had each thread loop through pixel-rays independently. I thought having each thread immediately move on to a new pixel-ray would be better as it results in less empty SIMD lanes, but it also causes another point of divergence in the inner loop for the ray initialization step. Right now I handle that by amortizing it - simply doing 3 or so ray iterations and then an iffed ray init, but perhaps their block scheduling approach is even faster. Perhaps even worse, my scheme causes threads to 'jump around', scattering the thread to pixel mapping over time. I had hoped that the variable ray termination times would roughly amortize out, but maybe not. Allowing threads to jump around like that also starts to scatter the memory accesses more.

The particular performance problem which I see involves high glancing angle rays that skirt the edge of a long surface, such as a flat ground plane. For a small set of pixel-rays, there is a disproportionately huge number of voxel intersections, resulting in a few long problem rays that take forever and stall blocks. My thread looping plan gets around that, but at the expensive of decohering the rays in a warp. Ideally you'd want to adaptively reconfigure the thread-ray mapping to prevent low occupancy blocks from slowing you down while maintaining coherent warps. I'm suprised at how close they got to ideal simulated performance, and I hope to try their atomic-scheduling approach soon and compare.

Voxel Cone Tracing

2009-07-11T12:54:00.000-07:00

I'm willing to bet at this point that the ideal rendering architecture for the next hardware generation is going to be some variation of voxel cone tracing. Oh, there are many very valid competing architectures, and with the perfomance we are looking at for the next hardware cycle all kinds of techniques will look great, but this particular branch of the tech space has some special advantages.

Furthemore, I suspect that this will probably be the final rendering architecture of importance. I may be going out on a limb by saying that, but I mean it only in the sense that once that type of engine is fully worked out and running fast, it will sufficiently and effeciently solve the rendering equation. Other approachs may have advantages in dynamic update speed or memory or this or that, but once you have enough memory to store a sub-pixel unique voxelization at high screen resolution, you have sufficient memory. Once you have enough speed to fully update the entire structure dynamically, thats all you need. Then all that matters is end performance for simulating high end illumination effects on extremely detailed scene geometry.

There is and still will be much low-level work on effecient implementation of this architecture, especially in terms of animation, but I highly doubt there will be any more significant high level rendering paradigm shifts or breakthroughs past that. The hardware is already quite sufficient in high end PC GPU's, the core research is mainly there, most of the remaining work is actually building full engines around it, which won't happen in scale for a little while yet as the industry is still heavily focused on the current hardware generation. I actually think a limited voxel engine is almost feasible on current consoles (yes, really), at least in theory, but it would take a huge effort and probably require a significant CPU comittment to help the GPU. But PC hardware is already several times more powerful.

So why do I see this as the final rendering paradigm? It comes down to quality, scalability, generality, and (relative) simplicity. On the quality front, a voxel tracer can hit 'photoreal' quality with sufficent voxel resolution, sampling, and adequate secondary tracing for illumination. Traditional polygon rasterizers can approach this quality level, but only asymptotically. Still, this by itself isn't a huge win. Crysis looks pretty damn good - getting very close to being truly photoreal. On a side note, I think photoreal is an important, objective goal. You have hit photoreal when you can digitally reproduce a real world scene such that human observers can not determine which images were computer generated and which were not. Crysis actually built alot of its world based on real-world scenes and comes close to this goal.

But even if polygon techniques can approach photoreal for carefully crafted scenes, its much more difficult to scale this up to a large world using polygon techniques. Level of detail is trivially inherent and near perfect in a voxel system, and this is its principle perfomance advantage.

Much more importantly, a voxelization pipeline can and will eventually be built around direct 3D photography, and this will dramatically change our art production pipelines. With sufficient high resolution 3D cameras, you can capture massive voxel databases of real world scenes as they actually are. This raw data can then be processed as image data: unlit, assigned material and physical properties, and then packaged into libraries in a way similar to how we currently deal with 2D image content. Compared to the current techniques of polygon modelling, LOD production, texture mapping, and so on, this will be a dramatically faster production pipeline. And in the end, thats what matters most. It something like the transition from vector graphics to raster graphics in the 2D space. We currently use 3D vector graphics, and voxels represent the more natural 3D raster approach.

In terms of tracing vs rasterization or splatting, which you could simplify down to scatter vs gather, scatter techniques are something of a special case optimization for an aligned frustum, a regular grid. For high end illumination effects the queries of interest require cones or frustums down to pixel size, the output space becomes as irregular as the input space, so scatter and gather actually become the same thing. So in the limit, rasterization/scatter becomes indistinguishable from tracing/gather.

Ray tracing research is a pretty active topic right now, and there are several different paths being explored. The first branch is voxels vs triangles. Triangles are still receiving most of the attention, which I think is unwarranted. At the scalability limit (which is all we should care about now), storing and accessing data on a simple regular grid is more effecient in both time and space. Its simpler and faster to sample correctly, mip-map, compress, and so on. Triangles really are a special case optimization for smooth surfaces, and are less effecient for general data sets that break that assumption. Once voxels work at performance, they work for everything with one structure, from surface geometry to foilage to translucent clouds.

Once settled on voxels, there a several choices for what type of acceleration structure to use. I think the most feasible is to use an octree of MxM bricks, as in the GigaVoxel work of Cyril Cassin. I've been investigated and doing some prototyping with these types of structures off and on for about a year, and see it as the most promising path now. Another option is to forgo bricks and trace into a deeper octree that stores single voxels in nodes, as in Jon Olick's work. Even though Olick's technique seemed surprisingly fast, its much more difficult to filter correctly (as evident in his video). The brick tracing allows simple 3D hardware filtering, which simultaneously solves numerous problems. It allows you to do fast approximate cone tracing by sampling sphere steps. This is vastly more effecient than sampling dozens of rays - giving you anistropic filtering, anti-aliasing, translucency, soft shadows, soft GI effects, depth of field, and so on all 'for free' so to speak. I found Cassins papers after I had started working on this, and it was simultaneously invigorating but also slightly depressing, as he is a little ahead of me. I started in cuda with the ambition of tackling dynamic octree generation on the GPU, which it looks like he has moved to more recently.

There are a number of challenges getting something like this working at speed, which could be its own post. Minimizing divergence is very important, as is reducing octree walking time. With the M-brick technique, there are two main paths in the inner loop, stepping through the octree and sampling within bricks. Branch divergence could easily cost half or more perfomance because of this. The other divergence problem is the highly variable step length to ray termination. I think dynamic ray scheduling is going to be a big win, probably using the shared memory store. I've done a little precursor to this by scheduling threads to work on lists of pixels instead of one-thread per pixel, as is typical, and this was already a win. I've also come up with a nifty faster method of traversing the octree itself, but more on that some other time.

Dynamic updating is a challenge, but in theory pretty feasible. The technique I am investigate is based on combining dynamic data generation with streaming (treating them as the same problem) with a single unique octree. The key is that the memory caching management scheme also limits the dynamic data that needs to be generated per frame. It should be a subset of the working set, which in turn is a multiple of the screen resolution. Here a large M value (big bricks) is a disadvantage as it means more memory waste and generation time.

The other approach is to instance directly, which is what it looks like cyril is working on more recently. I look forward to his next paper and seeing how that worked out, but my gut reaction now is that having a two level structure (kd tree or bv tree on top of octree) is going to significantly complicate and slow down tracing. I suspect it will actually be faster to use the secondary, indexed structures only for generating voxel bricks, keeping the primary octree you trace from fully unique. With high end illumination effects, you still want numerous cone traces per pixel, so tracing will dominate the workload and its better to minimize the tracing time and keep that structure as fast as possible.

Single Pass MSAA Deferred Rendering - Dithered Deferred Rendering Idea

2009-06-09T15:59:00.000-07:00

So what we'd like to have is a deferred shading technique that renders in a single pass, with MSAA, but without much additional memory and bandwidth. In my previous post I describe the MSAA Z Prepass idea, which adds MSAA on top of deferred shading by using only a little extra memory for a MSAA z-buffer (8-16 megs), and a z-only prepass (fairly fast), plus a little up-sampling resolve work at the very end. I think this is pretty good, and is probably overall better than the inferred lighting idea currently pursued by volition. As a quick side note, I would also accumulate lighting buffers at a seperate, lower resolution than the screen, just as in the inferred lighting technique, and would not necessarily even render the main DS buffer at full 720p. (ideally it could even vary dynamically) Only the intial Z-prepass itself needs full screen resolution with MSAA, as it determines coverage.

The key concept here is one of spatial frequency optimization. The numerous sub-components of the shading and lighting contributions do not have equal image contributions at all frequency bands - and their cost of computation varies greatly, so there is a huge potential gain by seperating out sub-components of shading and lighting computations and evaluating them at numerous reductions of the full image resolution. This is a huge speedup if we can quickly up-sample and combine them together later - which is a motivation for a fast bilateral/depth context filter. A typical example is evaluating SSAO at reduced resolution, but we can apply it more generally to everything.

For Deferred Shading, we need to output numerous shader inputs in the geometry pass, but we'd like to reduce the memory footprint down to reasonable levels. What we'd really like to do is render directly into a compressed format. The is exactly what dithered rendering accomplishes. The key of the dithered deferring idea is inspired by Mpeg/Jpeg's mosiac YUV decomposition, which seperates luminance and chrominance, storing luminance at full resolution and the two chrominance values at reduced resolution on an offset grid fashion. So the trick is to break up all the individual scalar components and output them on different interlaced grids, allocating space so that just a few of the most important components get full resolution, and the rest are rendered at reduced resolution with different dithering patterns.

With deferred rendering, we typically have per-pixel storage requirements of depth, 3 albedo color components, 2-3 normal components, emmissive, 2-6 specular components, and perhaps a few extra. Depth needs sub-pixel resolution (as it represents MSAA coverage), luminance and perhaps one of the normal's components needs full pixel resolution, and the rest need less. Next in importance is probably specular scale and then chrominance - specular power, emissive, and specular chrominance (if present) are usually much lower frequency. Now we could probably actually pack most of this in two a single 32-bits with dithering, but with 2X MSAA, we actually have 64-bits to work with per pixel, rendering to a 32-bit buffer.

Unfortunately, I'm not aware of a single pass method to output seperate values to the different MSAA samples. Maybe there is some low-level hackery that could accomplish this? But regardless, thats actually ok, because not all materials actually use all of the parameters listed above. So, the idea is to isolate out the more 'rare' components and render them in a 2nd pass that touches just 1 of the MSAA samples. Albedo, Normal, specular level, and perhaps specular power could fit in the 1st pass, and then emissive and any remaining specular in the 2nd pass.

The dithering would be undone and unpacked during or before lighting into non MSAA (and lower res) textures, hopefully reusing some temp memory as its only needed during the lighting stage. It may also be worthwhile to still do lighting with a light buffer at lower resolution and then up-sample and combine with albedo to get better lighting fillrate.

Why this is potentially cool:

- single CPU-geometry pass for performance limited stuff like foilage, terrain, anything with a more basic shader

- extra pass for just the objects with fancy shaders with alot of inputs (typically not many objects)

- low memory/bandwidth cost (potentially lowest of all schemes)

- full MSAA w/ fat deferred buffers. Even 4x MSAA could be feasible, or perhaps 2X MSAA 1080p

- could output seperate stencil in the 2nd pass, so 16 bits of stencil, more with additional passes for special objects

With a little extra work to store ID bits in the stencil or in one of the channels, you could also get this to work with alpha-to-coverage MSAA. You need some sort of ID in that case as the alpha-to-coverage needs to use all the MSAA samples at once - so the sample's meaning can't just be determined by its grid position.

On the alpha-coverage MSAA note, I've been wondering of late if 4x MSAA with dithering is enough to render full order-independent translucency with DS for things like soft particles. There's some issues combining the ideas, but its possible with the ID technique as the particles don't need to store much info per pixel at all. I did some little tests in photoshop with 5 color dithering (equivalent to alpha to coverage output with 4x MSAA), and it is feasible. A good dither plus a slight post process blur results in a little quality loss, but not too much. The challenges are in the dithering, post-process depth-order blur, soft particle z-output, and of course performance loss due to a highly randomized z-buffer. But if feasible, it would be great to have fully lit/shadowed soft particles follow the same path as everything else. Volumetric lighting and godrays could come out almost for free.

Deferred Rendering w/ MSAA - MSAA Z Prepass Idea

2009-06-09T14:43:00.000-07:00

Deferred rendering presents something of a challenge to combine with MSAA on current console hardware because of the memory/bandwidth overhead of storing multiple render targets with multiple samples. The extra memory hurts both PS3 and 360 equally, but the bandwidth effect differs on the two platforms. On PS3, there is a straightfoward additional bandwidth cost proportional to the per-pixel output memory touched. This can really hurt for overdraw intensive items, such as foilage. Killzone 2 shipped with 2x quincox MSAA and a fat deferred shading buffer 64 Megs! If i recall correctly. However, it was PS3 only, and killzone 2 essentially has no foilage - the world is desolate & urban. They certainly aren't rendering jungles.

On 360, the memory situation is even worse because of EDRAM. Even though the bandwidth while rendering is nearly free, the geometry overhead adds cost and the all the extra resolves eat up time. The tile padding ends up using up even more total bandwidth. And during resolve, the rest of the system is essentially idle, destroying parallelism, so its actually somewhat worse than on PS3.

Newer GPU can do MSAA compression while rendering, using simple block based schemes that store a couple of real samples per pixel (like the min/max of DXTC), and then a large number of 'coverage samples' per pixel which are simply a few bits to select an examplar sample. This compression takes advantage of the fact that really the high frequency information we care about is coverage, and its somewhat wasteful to store all of our buffers at multi-sample resolution.

So based on that, I have a MSAA deferred shading idea that uses 2 passes. Lets call this technique MSAA Z-prepass. One pass is rendered with z-only, 2x or even 4x MSAA, and no other buffers active - an MSAA z-pre pass essentially. The second pass is rendered with your typical DS buffers, but no MSAA. You then perform lighting/shading as normal, resulting in a pre-final non-MSAA framebuffer. As a final step, you use a bilateral filter of depth to fill in the missing information and up-sample to MSAA resolution, which can then be resolved back down for the final buffer - naturally this can all be combined in to one fast step. I'm assuming familiarity with bilateral filtering - but basically here I'm using it as a depth-sensitive up-sample. The results should be very similar to full MSAA on all the buffers, but without the memory/bandwidth cost - as it uses the same compression principle the newer coverage based MSAA techniques use.

With a careful z-downsample/resolve pass, you can probably use the 1st Z pass to populate the Hi-Z for the 2nd pass and speed up rendering. Still requires 2 render passes, which is un-optimal as I described in a previous post.

This is still on my wishlist, not something I've had the time to implement, but there was a recent paper by some guys at volition that uses the same principles to combine MSAA with 2 pass deferred lighting. They decided this warranted a whole new name, dubbing it inferred lighting.

They modify defered lighting to render at different resolutions in the two passes. Specifically, they use a reduced frame buffer (40% or so) for 1st depth/normal pass, and then use a full size MSAA buffer for the 2nd pass. The light buffers are up-sampled using a bilateral technique in the shader of the 2nd pass. By extending this up-sampling technique with some dither knowledge, they also do stippled alpha rendering to get some level of order-independent translucency with deferred shading. Not enough that you could render a full particle system with that path, but enough for a few layers of glass windows or what not.

Their solution is interesting, but it even worsens the performance problems with alpha tested stuff I mentioned in my earlier post on Deferred Shading vs Deferred Lighting - as the 2nd pass has now gotten considerably more expensive due to the bilateral up-sampling filter. And worse, since the buffers mis-match, they can not easily use the 1st pass output to prime the z-buffer for the 2nd pass. (its probably possible to do a conservative screen pass just to populate the HI-Z, but not sure if they are doing that.) Noticeably, Red Faction takes place on mars, so their engine doesn't have to deal with foilage.

Accumulating lighting at lower resolution (or actually better - multiple resolutions) is something I've been thinking about for a while, and is already well tested at least for AO, and they are using this to great effect in Red Faction to get lots of lights per pixel at speed.

But anyway, this motivated me to try and improve my MSAA Z prepass idea to get it down to a single pass with deferred shading, and also to find a fast method of bilateral or depth-sensitive filtering.

Deferring Techniques

2009-06-09T11:20:00.000-07:00

At this point in time, some form of deferred rendering is becoming the standard rendering technique in games. I've long been a fan of deferred shading, and was quite pleased with the results after converting our forward renderer to deferred on a 360 project a little over a year ago. More recently, moving to a different project and engine, we went through the forward->deferred transition again, but our lead programmer tried a variation of the idea called deferred lighting. From the beginning, I wasn't a fan of the technique, for a variety of reasons. For a longer summary of the idea, and a more in depth comparison of deferred shading vs deferred lighting, check out this extensive post on gameangst. At this point I am going to assume you are familiar with the techniques. I mainly agree with Adrian's points, but there's a few issues I think he left out.

Deferred lighting is usually marketed as a more flexible alternative to 'traditional' deferred shading which has the proposed advantages:

- similar performance, perhaps better in terms of light/pixel overdraw cost

- no compromise in terms of material/shader flexibility

In short, I think the first claim is dubious at best, and the 2nd claim actually turns out to be false. Adrian has an exposition on why the 2nd material pass in deferred lighting actually gives you much less flexibility than you would think. The simple answer is that material flexibility (or fancy shaders), modify BRDF inputs or they alter the BRDF itself. Flexibility in terms of modifying the BRDF inputs (multi-layer textures, procedural, animated textures, etc.) can easily be accounted for in traditional deferred shading, so there is no advantage there. Deferred lighting is quite limited in how it can modify the BRDF because it must use a common function for the incoming light (irradiance) at each surface point, for all materials. It only has flexibility for the 2nd half of the BRDF, the exit light (radiance) on the eye path. Materials with fancy specular (like skin) are difficult to even fake with control only over exit radiance.

Now, there is a solution using stencil techniques that allows multiple shader paths during light accumulation, but traditional deferred shading techniques can use this too to get full BRDF flexibility. So Deferred Lighting has no advantage in BRDF flexibility. (more on the stencil techniques in another post)

But the real problem with deferred lighting is in performance - its not similar to deferred shading at all. The 1st problem is that all else being equal, two full render passes are just always going to be slower. The extra CPU draw call cost and geometry processing can be significant, especially if you are trying to push the geometry detail limits of the hardware (and shouldn't you?). The geometry processing could only be 'free' if there was significant pixel shader work to load balance against, and the load balancing was effecient. On PS3, the load balancing is not effecient, and more importantly, there is not much significant pixel shader work. Most of the significant pixel shader work is in the light accumulation, which is moved out of any geometry pass in both techniques - so they easily will be geometry limited. This is the prime disadvantage of any deferred technique right now vs traditional forwad shading. With forward shading, its much easier to really push the geometry limits of the hardware, as all pixel shading is done in one heavy pass.

Furthermore, the overdraw performance of the two systems is not comparable, and for high overdraw objects, such as foilage, deferred shading has a large advantage. Foilage objects are typically rendered with alpha-test, and because of this they receive only a partial benefit from the hardware's HI-Z occlusion. In our engine, the 1st pass in the two techniques for simple foilage is similar, both sample a single texture for albedo/alpha. The only difference is in DS the 1st pass outputs albedo and normal vs just the normal for DL. The 2nd pass, unique to DL, must read that same diffuse/albedo texture again, as well as the lighting information, which is often in a 1 or 2 64-bit texture(s). So its easily 3 times the work per pixel touched.

As a side note: the problems with Hi-Z and alpha test are manifold. With 2 pass rendering, you would think the fully populated z-buffer and Hi-Z from the 1st pass will limit overdraw in the 2nd pass to a little over 1.0. This is largely true for reasonable polygon scenes without alpha test. The problem with alpha-test is that it creates a large number of depth edges and wide z-variation within each Hi-Z tile. Now, this wouldn't be such a problem if the Hi-Z tiles stored a min/max z range, because then you could do fast rejection on the 2nd pass with z-equal compares. But they store a single z-value, either the min or the max, useful only for a greater-equal or less-equal compare test. Thus, when rendering triangles with alpha-test in the second pass, you get alot of false overdraw for pixels with zero-alpha that still pass the Hi-Z test.

The DS vs DL debate gets a little more complicated when you try to do MSAA, but thats another story.

OnLive, OToy, and why the future of gaming is high in the cloud

2009-04-02T22:04:00.000-07:00

For the last six months or so, I have been researching the idea of cloud computing for games, the technical and economic challenges, and the video compression system required to pull it off.

So of course I was shocked and elated with the big OnLive announcement at GDC.

If OnLive or something like it works and has a successful launch, the impact on the industry over the years ahead could be transformative. It would be the end of the console, or the last console. Almost everyone has something to gain out of this change. Consumers gain the freedom and luxury of instant on demand access to ultimately all of the world's games, and finally the ability to try before you buy or rent. Publishers get to cut out the retailer middle-man, and avoid the banes of piracy and used game resales.

But the biggest benefit ultimately will be for developers and consumers in terms of the eventual game development cost reduction and quality increase enabled by the technological leap cloud computing makes possible. Finally developing for one common, relatively open platform (server-side PC) will significantly reduce the complexity in developing a AAA title. But going farther into the future, once we actually start developing game engines specifically for the cloud, we enter a whole new technological era. Its mind-boggling for me to think of what can be done with a massive server farm consisting of thousands or even tens of thousands of densly networked GPUs with shared massive RAID storage. Engines developed for this system will look far beyond anything on the market and will easily support massively multiplayer networking, without any of the usual constraints in physics or simulation complexity. Game development costs could be cut in half, and the quality bar for some AAA titles will eventually approach movie quality, while reducing technical & content costs (but that is the subject for another day).

But can it work? And if so, how well? The main arguments against, as expressed by skeptics such as Richard Leadbetter, boil down to latency, bandwidth/compression, and server economics. Some have also doubted the true value added for the end user: even if it can work technically and economically, how many gamers really want this?

Latency

The internet is far from a guaranteed delivery system, and at first the idea of sending players inputs across the internet, computing a frame on a server, and sending it back across the internet to the user sounds fantastical.
But to assess how feasible this is, we first have to look at the concept of delay from a pyschological/neurological perspective. You press the fire button on a controller, and some amount of time later, the proper audio-visual response is presented in the form of a gunshot. If the firing event and the response event occur close enough in time, the brain processes them as a simultaneous event. Beyond some threshold, the two events desynchronize and are processed distinctly: the user notices the delay. A large amount of research on this subject has determined that the delay threshold is around 100-150ms. Its a fuzzy number obviously, but as a rule of thumb, a delay of under 120ms is essentially not noticeable to humans. This is a simple result of how the brain's parallel neural processing architecture works. It has a massive number of neurons and connections (billions and trillions respectively), but signals propagate across the brain very slowly compared to the speed of light. For more reference I highly recommend "Consciousness Explained" by Daniel C Dennet. Here are some interesting timescale factoids from his book:

saying, "one, Mississippi" 1000msec
umyelinated fiber, fingertip to brain 500msec
speaking a syllable 200msec
starting and stopping a stopwatch 175msec
a frame of television (30fps) 33msec
fast (myelinated) fiber, fingertip to brain 20msec
basic cycle time of a neuron 10msec
basic cycle time of a CPU(2009) .000001msec

So the minimum delay window of 120ms fits very nicely into these stats. There are some strange and interesting consequences of these timings. In the time it takes the 'press-fire' signal to travel from the brain down to the finger muscle, internet packets can travel roughly 4,000 km through fiber! (light moves about 200,000 km/s through fiber, or 200 km/msc * 20 msc) This is about the distance from Los Angeles to New York. Another remarkable fact is that the minimum delay window means that the brain processes the fire event and the response event in only a few dozen neural computation steps.

What really happens is something like this: some neural circuits in the user's brain "make the decision" to press the fire button (although at this moment most of the brain isn't conscious of it), the signal travels down through the fingers to the controller then on to the computer, which then starts processing the response frame. Meanwhile, in the user's brain, the 'button press' event is propagating through the brain, and more neural circuits are becoming aware of the 'button press' event. Remember, each neural tick takes 10ms. Some time later, the computer displays the audio/visual response of the gunshot, and this information hits the retina/cochlea and starts propagating up into the brain. These events connect, and if they are seperated by only a few dozen neural computation steps (120 ms), they are connected and perceived as a single, simultaneous event in time. In another words, there is a minimum time window of around a dozen neural firing cycles where events are propagating around the brain's neural circuits - even though it already happened, it takes time for all of the brain's circuits to become aware of the event. Given the slow speed of neurons, its simply remarkable that humans can make any kind of decisions on sub second timescales, and the 120 ms delay window makes perfect sense.

In the world of computers and networks, 120 ms is actually a long amount of time. Each component of a game system (input connection, processing, output display connection) adds a certain amount of delay, and the total delay must add up to around 120ms or less for good gameplay. Up to 150ms is sometimes acceptable, and beyond 200ms we get quickly into rapid, problematic breakdown in the user experience as every action has noticeable delay.

But how much delay do current games have? Gamasutra has a great article on this. They measure the actual delay of real world games using a high speed digital camera. Of interest for us, they find a "raw response time for GTAIV of 166 ms (200 ms on flat panel TVs)". This is relatively high, beyond the acceptable range, and GTA has received some criticism for sluggish response. And yet this is the grand blockbuster of video games, so it certainly shows that some games can get away with 150-200ms responses and the users simply don't notice or care. Keep in mind this delay time isn't when playing the game over OnLive or anything of that sort: this is just the natural delay for that game with a typical home setup.

If we break it down, the controller might add 5-20ms, the TV can add 10-50ms, but the bulk of the delay comes from the game console itself. Like all modern console games, the GTA engine buffers multiple frames of data for a variety of reasons, and running at 30fps, every frame buffered costs a whopping 30ms of delay. From my home DSL internet in LA, I can get pings of 10-30ms to LA locations, and 30-50ms pings to locations in San Jose. So now you can imagine lengthening the input and video connections out across the internet is not so ridiculous as it first seems at all. It adds additional delay, which you simply need to compensate for somewhere else.

How does OnLive compensate for this delay? The result for existing games is deceptively simple: you just run the game at a mucher higher FPS than the console, and or you reduce internal frame buffering. If the PC version of a console game runs at 120 FPS, and it still keeps 4 frames of internal buffering, you get a delay of only 32 ms. If you reduce the internal buffering to 2, you get a delay of just 16ms! If you combine that with a very low latency controller and a newer low latency TV, suddenly it becomes realistic for me to play a game in LA from a server residing in San Jose. Not only is it realistic, but the gameplay experience could actually be better! In fact, with a fiber FIOS connection and good home equipment, you could conceivably play from almost anywhere in the US, in theory. The key reason is that many console games have already maxxed out the maximum delay (when running on the console), and modern GPU's are many times faster.

Video Compression/Bandwidth

So we can see that in principle, from purely a latency standpoint, the OnLive idea is not only possible, but practical. However, OnLive can not send a raw, uncompressed frame buffer directly to the user (at least, not at any acceptable resolution on today's broadband). For this to work, they need to squeeze those frame buffers down to acceptably tiny sizes, and more importantly, they need to do this rapidly or near instantly. So is this possible? What is the state of the art in video compression?

For a simple, dumb solution, you can just send raw jpegs, or better yet, wavelet compressed frames, and perhaps get acceptable 720p images down to 1 Mbit or even 500Kbit for more advanced wavelets, using more or less off the shelf algorithms. With a wavelet approach, this would allow you to get 10fps with a 5Mbit connection. But of course we can do much better using a true video codec like H.264, which can squeeze 720p60fps video down to 5Mbit easily, or even considerably less, especially if we are willing to lower the fps in some places and or the quality.

H.264 and other modern video codecs work by sending improved JPEG key frames, and then sending motion vectors which allow predicted frames to be delta-encoded in far less bits, getting 10-30X improvement over sending raw JPGs, depending on the motion. But unfortunately, motion compensation means spikes in the bitrate - scene cuts or frames with rapid motion receive little benefit from motion compensation.

But H.264 encoders typically buffer up multiple frames of video to get good compression. OnLive has much less leeway here. Ideally, you would like a zero-latency encoder. H.264 and its predecessors have been designed to be used in video tele-conferencing systems, which demand low-latency. So there is already a predecent, and a modified version of the algorithm that avoids sending complete JPEG key frame images. Instead, using this low latency mode, small blocks of the image are periodically refreshed, but it never sends a complete JPEG key frame down the pipe, as this would take too long - creating multiple frames of delay.

There are in fact some new, interesting off the shelf H.264 hardware solutions which have near zero (1ms) or so delay, and are relatively cheap (in cost and power) - perhaps practical for OnLive. In particular, there is the PureVu family of video processors, from Cavium Networks. I have not seen them in action, but I imagine that with 720p60 at 5MBits/s, you are going to see some artifacts and glitches, especially with fast motion. But at least we are getting close, with off the shelf solutions.

But of course, OnLive is not using an off the shelf system(they have special encoding hardware and a plugin decoder), and improved video compression specific to the demands of remote video gaming is their central tech, so you can expect they have created an advancement here, but it doesn't have to be revolutionary, as the off the shelf stuff is already close.

So the big problem is the variation in bitrate/compressibility from one frame to the next. If the user rapidly spins around, or teleports, you simply can not do better than sending a complete frame. So you either send these 'key' frames at lower quality, and or you spend a little longer on them, introducing some extra delay. In practise some combination of the two is probably ideal. With a wavelet codec or a specialized H.264 variant, key frames can simply be sent at lower resolution, and then the following frames will use motion compensation to start adding detail to the image. The appearance would be a blurred image for the first frame or so when you rapidly spin the camera, which would then quickly up-res in to full detail over the next several frames. With this technique, and some trade off of lowering the frame rate or adding delay a bit on fast motion, I think 5Mbps is not only achievable, but beatable using state of the art compression coming out of research right now.

The other problem with compression is the CPU cost for compression itself. But again, if the PureVu processor is indicative, off the shelf hardware solutions are possible right now with H.264 at very low power, encoding multiple H.264 streams with near zero latency.

But here is where the special nature of game video or computer generated graphics allows us to make some huge effeciency gains over natural video. The most complex CPU task in video encoding is motion vector search - finding the matching image regions from previous frames that allow the encoder to send motion vectors and do effecient delta compression. But for a video stream rendered with a game engine, we can output the exact motion vectors directly. This is a potential problem in that not all games necessarily have motion vectors available, which may require modifying the game's graphics engine. However, motion blur is very common now in game engines (everybody's doing it, you know), and the motion blur image filter computes motion vectors (very cheaply). Motion blur gives an additional benefit for video compression in that it generates blurrier images in fast motion, which are the worst case for video compression.

So if I was doing this, I would require the game to use motion blur, and output the motion vector buffer to my (specialized, not off the shelf) video encoder.

Some interesting factoids: it apparently takes roughly 2 weeks to modify the game for OnLive, and at least 2 of the 16 announced titles (Burnout and Crysis) are particularly known for their beautiful motion blur - and all of them, with the exception of World of Goo - are recent action or racing games that probably use motion blur.

There is however, an interesting and damning problem that I am glossing over. The motion vectors are really only valid for the opaque frame buffer. What does this mean? The automatic 'free' motion vectors are valid for the solid geometry, not all the alpha-blended or translucent effects, such as water, fire, smoke, etc. So these become problem areas. Its interesting that several of the GDC commentors pointed out ugly compression artifacts when fire or smoke effects were prominent in BioShock running OnLive.

However, many games already render their translucent effects at lower resolution (SD and even lower in modern console engines), so it would make sense perhaps to simply send these regions at lower resolution/quality, or blur them out (which a good video encoder would probably do anyway).

But in short, the video compression is the central core tech problem, but they haven't pulled a miracle here - at best they have some good new tech which exploits some of the special properties of game video. And furthemore, I can even see a competitor with a 2x better compression system coming along and trying to muscle them out.

There's one other little issue which is worth mentioning slightly, which is packet loss. The internet is not perfect, and sometimes packets are lost or late. I didn't mention this earlier because it has well known and relatively simple technical solutions for real time systems. Late packets are treated as dropped, and dropped packets and errors are corrected through bit level redundancy. You send small packet streams in groups using bit association techniques such that any piece of lost data can be recovered, at the cost of some redundancy. For example, you send 10 packets worth of data using 11 packets, and any single lost packet can be fully reconstructed. More advanced schemes adaptively adjust the redundancy based on measured packet loss, but this tech is alreadly standard, its just not always use or understood. Good game networking engines already employ these packet loss mitigation techniques, and work fine today over real networks.

The worst case is simply a dropped connection, which you just can't do anything about - OnLive's video stream would immediately break and notify you of a connection problem. Of course, the cool thing about OnLive is that it could potentially keep you in the game or reconnect you once you get your connection back.

Server Economics

So if OnLive is at least possible from a technical perspective (which it clearly is), the real question comes down to one of economics. What is the market for this service in terms of the required customer bandwidth? How expensive are these data centers going to be, and how much revenue can they generate?

Here is where I begin to speculate a little beyond my areas of expertise, but I'll use whatever data I've been able to gather from the web.

A few google searches will show you that US 'broadband' penetration is around 80-90%, and the average US broadband bandwidth is somewhere around 2-3 Mbps. This average is somewhat misleading, because US broadband is roughly split between cable (25 million subscribers), and DSL (20 million subscribers), with outliers like fiber (2-3 million subscribers currently) and the DSL users often have several times lower bandwidth than the cable. At this point in time, the great majority of American gamers already have at least 1.5 Mbps, perhaps half have over 5 Mbps, and almost all have a 5 Mbps option in their neighborhood, if they want it. So OnLive is in theory will have a large potential market, it really comes down to cost. How many gamers already have the required bandwidth? And for those who don't, how cheap is OnLive when you factor in the extra $ users may have to pay to upgrade? And to point out, the upgrade really will be for the HD option, as the great majority of gamers already have 1.5 Mbps or more.

BandWidth Caps

There's also the looming threat of American telcos moving towards bandwidth caps. As of now, Time Warner is the only American telco experiementing with caps low enough to effect OnLive (40 Gigs/Month for their highest tier). Remember that using the HD option, 5 Mbps is the peak bandwidth, the average useage is half that or less, according to OnLive. So Comcast's cap of 250 Gigs/Month isn't really relevant. Time Warner is currently still testing its new policy in only a few areas, so the future is uncertain. However, there is one interesting fact to throw into the mix: Warner Bros, the Time Warner subsidary, is OnLive's principle investor. (the other two are AutoDesk and Maverick Capital) Now conser that Warner cable is planning some sort of internet video system for television based on a new wireless cable modem, and consider that Perlman's other company was Digeo, the creator of Moxi. I think there will be more OnLive suprises this year, but suffice to say, I doubt OnLive will have to worry about bandwidth caps from Time Warner. I suspect Time Warner's caps really are more about a grand plot to control all digital services in the home, by either direclty providing them or charging excess useage fees that will kill enemy services. But OnLive is definetly not their enemy. In the larger picture, the fate of OnLive is entertwined into the larger battle for net neutrality and control over the last mile pipes.

Bandwidth Cost

OnLive is going to have to partner with backbones and telcos, just like the big boys such as Akamai, Google and YouTube do, in what are called either transit or peering arrangements. A transit arrangement is basically bandwidth wholesale, and we'll start with that assumption. A little google searching reveals that wholesale mass transit bandwidth can be had for around or under 10$ per Megabit/s per month (comparable to end broadband customer cost, actually). Further searching suggests that in some places like LA it can be had for under 5$ per Mbs/month. This is for a dedicated connection or peak useage charge.

Now we need some general model assumptions. The exact subscriber numbers don't really matter, what critically matters are a couple of stats: how many hours a month does each subscriber play, and more directly, what is the typical peak fraction of users online at a given time. The data I've found suggests that 10 hours per week is a rough gamer average, or 20 hours per week for an MMO, 10% occupancy is typical for regular games and 20% peak occupancy is typical for some MMOs. Using the 20% peak occupancy means that you need to provide enough peak bandwidth for 20% of your user base to be online at a time - a worst case. In a potential worse case scenario, every user wants HD at 5 Mbits/s and the peak occupancy is 20%, so you need essentially a dedicated 1 Megabit/s for each user or $10/month per user in bandwidth cost alone. Assuming a perhaps more realistic scenario, the average user bandwidth is 3Mbps (not everyone can have or wants HD), peak occpuancy is 10%, and you get $3 per month in bandwidth cost per user.

Remember, in rare peak moments, OnLive can gracifully and slowly degrade video quality - so the service will never fail if they are smart. The worst case at terrible peak times is just a little lower image quality or resolution.

So roughly, we can estimate bandwidth will cost anywhere from $3-10 per month per user with transit arrangements. Whats also possible, and more complex, are peering arragnements. If OnLive partners directly with providers near its data centers, it can get substantially reduced rates (or even free) if the traffic stays with just that provider. So realistically, i think $5 per month in bandwidth per user is a reasonable upper limit on OnLive's bandwidth charges based on today's economic climate - and this will only go down. But 1080p would be significantly more expensive, and it would make sense to charge customer's extra. I wouldn't be surprised if they have a tiered charge based on resolution - as most of their fixed costs scale linearly with resolution.

Dataroom Expense

The main expense is probably not the bandwidth, but the per server cost to run a game - a far more demanding task than what most servers do. Lets start with the worst case and assume that OnLive needs at least one decent CPU/GPU combination per logged on user. OnLive is not stupid, so they are not going to use typical high end, expensive big iron, but nor are they going to use off the shelf PC's. Instead I predict that following in the footsteps of google they will use midrange, cheaper, power effecient components, and get significant bulk discounts. Lets start with the basic cost of a CPU/motherboard/RAM/GPU combo. You don't need a monitor and the storage system can be shared between a very large number of servers - as they are all running the same library of installed games.

So lets take a quick look on pricewatch:
Core 2 Quad Q6600 Cpu fan + - 4GB RAM DDR2 $260
GeForce GTX280 1 GB 512-Bit DDR3 602/2214 Fansink HDCP Video Card $260

These components are actually high end, far more than sufficient to run the PC versions of most existing games at 90-150fps at 720p, and yes even crysis at near 60fps at 720p.

If we consider that they may have researched a little longer and undoubtedly get bulk discounts, we can take $500 per server unit as a safe upper limit. Amortize this over 2 years and you get $20 per month. Factor in the 20% peak demand occupancy, and we get a server cost of $4 per user per month.

This finally leaves us with power/cooling requirements. Lets make an over-assumption of 600watt continous power draw. With power at about $0.10 per kilowatt/hour, and 720 hours in a month, we get roughly $40 a month per server in power draw. Factor in the 20% peak demand occupancy, and we get $8 per user per month. However, this is an over-assumption because the servers are not constantly using power. The 20% peak demand figure means they need enough servers for 20% of their users to be logged in at once - but most of the time not all of the servers are active. The power required would scale with the average demand, not the peak, so its closer to $4 per user per month in this example (assuming a high average 10% occupancy). Cooling cost is harder to estimate, but some google searching reveals its roughly equivalent to the power cost, assuming modern datacenter design (and they are building brand new ones). So this leaves us with around $12 per user per month as an upper limit in server, power, and cooling cost.

However, OnLive is probably more effecient than this. My power/cooling numbers are high because OnLive probably spends a little extra on more expensive but power effecient GPU's that save power/cooling cost to hit the right overall sweet spot. For example, nvidia's more powerful GTX 295 is essentially two GTX 280 cores on a single die. Its almost twice as expensive, but provides twice the performance (so similar performance per $) and draws only a little more power (twice as power effecient). Another interesting development is that Nvidia (OnLive's hardware partner), recently announced virtualization support so that multi-GPU systems can fully support multiple concurrent program instances. So what it really comes down to is how many CPU cores and or GPU cores you need to run games at well over 60fps. Based on what I can see from recent benchmarks, two modern intel cores and a single GPU are more than sufficient (most console games only have enough threads to push 2 CPU cores). Nvidia's server line of GPU's are more effecient and only draw 100-150 watts per GPU, so 600 watts is a high over-estimate of the power required per connected user.

But remember, you need a high FPS to defeat the internet latency - or you need to change the game to reduce internal buffering. There are many trade offs here - and I imagine OnLive picked low-delay games for their launch titles. Apparently Onlive is targeting 60fps, but that probably means most games usually get even higher average fps to reduce delay.

Overall, I think its reasonable, using the right combination of components (typically 2 intel CPU cores and one modern nvidia GPU, possibly as half of a single motherboard system using virtualization) to have the per user power cost down to something more like 200 watts to drive a game at 60-120fps (remember, almost every game today is designed primarily to run at 30fps on the xbox 360 at 720p, and a single modern nvidia GPU is almost 4 times as powerful). Some really demanding games (crysis), get the whole system - 4 cpus and 2 GPU's - 400 watts. This is what I think OnLive is doing.

So adding it all up, I think 10$ per month per user is a safe upper limit for OnLive's expenses, and its perhaps as low as 5$ per month or less, assuming they typically need two modern intel CPUs and one nvidia GPU per user logged on, adequate bandwidth and servers for a peak occupancy of 20%, and power/cooling for an average occupancy of 10%.

Clearly, all of the numbers scale with the occupancy rates. I think this is why OnLive is at least initially not going for MMOs - they are too addictive and have very high occupancy. More ideal would be single player games and casual games that are played less often. Current data suggests the average gamer plays 10 hours a week, and the average MMO players plays 20 hours per week. The average non-MMO player is thus probably playing less than 10 hours per week. This works out to something more like 5% typical occupancy, but we are interested more in peak occupancy, so my 10%/20% numbers are a reasonable over-estimate of average/peak. Again, you need enough hardware & bandwidth for peak occupancy, but the power & cooling cost is determined by average occupancy.

$10 per month may seem like a high upper limit in monthly expense per user, but even at these expense rates OnLive could be profitable, because this is still less than the cost to the user of running comparable hardware at home.

Here's the simple way of looking at it. That same $600 server rig would cost $1000-1500 for an end user, because they need extra components like a hard drive, monitor, etc which OnLive avoids or gets cheaper, and OnLive buys in bulk. But most importantly, the OnLive hardware is amortized and shared over a number of users. The user's high end gaming rig sits idle most of the time. So the end user's cost to play at home on an even cheap $600 machine amortized over 2 years is still over $30 per month, three times the worst case per user expense of OnLive. And that doesn't even factor in extra power expense for gaming at home. OnLive's total expense is probably more comparable to that of xbox 360. A $500 machine (include necessary periphials) amortized over 5 years is a little under $10 per month. And then xbox live gold service is another $5 a month on top of that. OnLive can thus easily cover its costs and still be less expensive than 360 and PS3, and considerably less expensive than PC gaming.

The game industry post Cloud

In reality, I think that OnLive's costs will be considerably less than $10 per user per month, and will be increasingly less over time. Just like the console makers periodically update their hardware to make the components cheaper, OnLive will be constantly expanding its server farms and always buying the current sweet spot combination of CPU's and GPU's. But Nvidia and Intel refresh their lineups at least twice a year, so OnLive can really ride moore's law continously. Every year OnLive will become more economical and or provide higher FPS and less delay and or support more powerful games.

So its seems possible, even inevitable that OnLive can be economically viable charging a relatively low subscription fee to cover their fixed costs - comparable to Xbox Live's subscription fee (about 5$/month for xbox live gold) . Then they make their real profit on taking a console/distributor like cut of each game sale or rental. For highly anticipated releases, they could even use a pay to play model initially, followed up by traditional purchase or rental later on, just like the movie industry does. Remember the madness that surrounded the Warcraft3 Beta, and think how many people would pay to play Starcraft2 multiplayer ahead of time. I know I would.

If you scale OnLive's investment requirements to support the entire US gaming population, you get a ridiculous hardware investment cost of billions of dollars, but this is no different than a new console launch, which is exactly what OnLive must be viewed as. The Wii has sold 22 million units in the Americas, the 360 is close behind at 17 million. I think these numbers represent majority penetration of the console market in the Americas. To scale to that user base, OnLive will need several million (virtual) servers, which may cost a billion dollars or more, but the investment will pay for itself as it goes - just as it did for Sony and Microsoft. Or they simply will be bought up by some big deep pocket entity which will provide the money, such as Google, or Verizon, or Microsoft.

The size and quantity of the datarooms OnLive will have to build to support even just the US gaming populations is quite staggering. We are talking about perhaps millions of servers in perhaps a dozen different data center locations, drawing the combined power output of an entire large power plant. And thats just for the US. However, we already have a very successful example of a company that has built up a massive distributed network of roughly 500,000 servers in over 40 data centers.

Yes, that company is Google.

To succeed, OnLive will have to build an even bigger and more massive supercomputer system. But I imagine Google makes less money per month for each of its servers than OnLive will eventually make for each of its gaming servers. Just how much money can OnLive eventually make? If OnLive could completley conquer the gaming market, than it stands to completely replace both the current consoles manufacturers AND the retailers. Combined, these entities take perhaps 40-50% of the retail price of a game. Even assuming OnLive only takes a 30% cut, it could thus eventually take in almost 30% of the game industry - estimated at around $20 billion per year in the US alone, and $60 billion world-wide, eventually turning it into another Google.

Another point to consider is that most high end PC sales are mainly used for gaming, and thus the total real gaming market (in terms of total money people spend for gaming) is even larger, perhaps as large as 100 billion worldwide, and OnLive stands to rake a chunk of this in and change the whole industry - further reducing the end consumer PC market and shifting that money into OnLive subscriptions, game charges, etc. part of which in turn covers the centralized hardware cost. NVIDIA and ATI will still get a cut, but perhaps less than they do now. In other words, in the brave new world of OnLive, gamers will only ever need a super-cheap microconsole or netbook to play games, so saving money on consoles and rigs will allow them to buy more games, and all this money gets sucked into OnLive.

Now consider that the game market has consistently grown 20% per year for many years and you can understand why investors have funnelled hundreds of millions into OnLive in order to make it work. And eventually, OnLive can find new ways to 'monetize' gaming (using Google's term), such as ads and so on. Eventually, it should make as much or more per user hour as television does.

Now this is the fantasy of course, but I doubt OnLive will grow to become a Google any time soon, mainly because Nintendo, Sony, Microsoft, and the like aren't going to suddenly dissappear, bringing me to my final point.

But What about the games?

In the end people use a console to play games and thus the actual titles are all that really matters. In one sense part of the pitch of OnLive - 'run high end PC games on your netbook' - is a false premise. Most of OnLive's lineup is current gen console games, and even though OnLive will probably run them at higher fps, this is mainly to compensate for latency. Video compression and all the other factors discussed above will result in an end user experience no better, and often worse than simply playing the console version. (especially if you are far from the data center) OnLive's one high end PC title - crysis - is probably twice as expensive for them to run, and will be seen as somewhat inferior to gamers who have high end rigs and have played the game locally. It will be more like the console version of Crysis. But unfortunately, Crytek's already working on that.

This is really the main obstacle that I think could hold OnLive back - 16 titles at launch is fine, but they are already available on other platforms. Nintendo dominated this current console generation because of its cheap, innovative hardware and a lineup of unique titles that exploit it. I think Nintendo of America's president Reggie Aime was right on the money:

Based on what I’ve seen so far, their opportunity may make a lot of sense for the PC game industry where piracy is an issue. But as far as the home console market goes, I’m not sure there is anything they have shown that solves a consumer need

What does OnLive really offer the consumer? Brag Clips? The ability to spectate any player? Try before you buy? Rent? These are nice(especially the latter two), but can they amount to a system seller?. Its a little cheaper, but is that really important considering most gamers already have a system? It seems that PC games could be where OnLive has more potential, but how much can it currently add over Steam? If OnLive's offerings expanded to include almost all current games, then it truly could acheive a high market penetration, as the successor of Steam (with the ultimate advantage of free trial and rental - which steam can never do). But Valve does have the significant advantage of having a variety of exclusive games built on the Source Engine, which all together (Left for Dead, CounterStrike, Team Fortress 2, Day of Defeat, etc) make up a good chunk of the PC multiplayer segment.

The real opportunity with OnLive is to have exclusive titles, which takes advantage of OnLive's unique super-computer power to create a beyond next gen experience. This is the other direction in which the game industry expands, by slowly moving into the blockbuster story experiences of movies. And this expansion is heavily tech driven.

If such a mega-hit was made, such as a beyond next gen Halo, or GTA, it could rapidly drive OnLive's expansion, because OnLive requires very little user investment to play. At the very least, everyone would be able to try or play the game on some sort of PC they already have, and the microconsole to play on your TV will probably only cost as much as a game itself. So this market is a very different beast than the traditional consoles, where the market for your game is determined by the number of users who own the console. Once OnLive expands its datacenter capacity sufficiently, the market for an exclusive OnLive game is essentially any gamer. So does OnLive have an exclusive in the works? That would be the true game changer.

This is also where OnLive's less flashy competitor, OToy & LivePlace, may be going in a better direction. Instead of building the cloud and a business based first on existing games, you build the cloud and a new cloud engine for a totally new, unique product, which is specifically designed to harness the cloud's super resources and has no similar competitor.

Without either exclusives or a vast, retail competitive game lineup, OnLive won't take over the industry.

Forward Reprojection for future hardware

2008-12-13T18:00:00.000-08:00

The schemes described in my previous post are well suited for ps3/360 level hardware and a mix of CPU/GPU work. Namely, octree-screen tile intersection and point splatting done on CPU threads, and everything else done on the GPU.

But for future hardware, point splatting is less ideal than direct cone tracing. Why? They end up being almost the same actually. The critical scene traversal loop, which I think is best done on the CPU at coarse resolution for the current-gen, would be better done at high resolution to support higher quality illumination. If you want accurate reflections and GI effects, you need to be able to 'rasterize' very small frustums the size of a few pixel blocks - so it amounts to almost the same thing. Whether splatting or tracing (scatter or gather), you are doing fine grained intersections of frustums (pixels/tiles) with a 3D tree structure (octree or what have you).

For cuda based hardware, I think cone tracing into a new type of adaptive distance field structure is the way to go, and some initial prototyping done at home has been very promising. Combine this with the forward projection system, and you can get away with tracing 30k or so pixels per frame on average for primary visibility! For dynamic lighting update effects, you want additional rays per pixel, but with the magic of fast cone tracing (which is far more effecient than rays for soft-area queries) combined with the magic of frame coherent reprojection, I think we can hit uber quality with far less than 30 million cone traces per second, which my prototype can already exceed on an 8800GT.

However, one area which can be improved is the forward projection. Since tracing is relatively expensive, and incoherent tracing is vastly more so, its worth it to really get fine grained accurate projection and improve the coherence.

One simple way to improve the tracing coherence is to reorder the frame buffer after the projection pass. This can be done on the GPU, resulting in a reordered frame buffer with nice coherent blocks of pixels which need to be traced. This doesn't necessarily help for memory coherence, as these rays can still be scattered in space, especially after reflections, but it helps immensely with branch performance, which is critical.

The coherence can also be improved by splatting at finer granularity. Ideally we would want to actually splat individual samples as point splats, but actually using point primitives is way too slow on even new GPU's, as it goes through the terrible polygon rasterizer hardware bottleneck, and doing a few million per frame, although possible on modern GPU's, would eat up a big chunk of the frame time.

We can do better in cuda, which got me thinking about how to do this properly and in parallel. In short, I think a hierachical tile sorting approach is the way to go. First, you project all the points and save out the 2d positions - easy and fast. In the next step, the points are again broken up and assigned to threads, and each thread then builds up its own per-tile list point list of points hitting that tile - essentially sorting its subset of the points onto all the tiles. This results in NumThreads seperate point lists per tile. Then in the next pass, each thread gets a tile, and it merges all the lists for that tile from the 1st pass, resulting in a single list of points for each tile. There's a few details that i'm skipping over, such as parallel memory allocation to build up the lists - but this just requires a seperate counting/histogram pass. The tiles can't be too small as it would eat up too much memory for all the lists. Nor can they be too big, as you need adequate threads.

After the particles are thus sorted into per-tile lists, you can then subdivide and repeat, until you get down to fine-grained tiles (perhaps it only takes a couple of iterations). The fine grained tiles with their particle lists can then be rasterized, wich each thread getting one such tile, so the whole operation is parallel and scaleable, without any memory conflicts. You could store the final microtiles in local memory actually, so that z-buffering can be done without alot of extra memory read-writes.

This is probably similar or related to how they intend to do parallel polygon rasterization in Larrabee, but I didn't read all of that paper yet. Polygon rasterization doesn't really interest me much anymore, for that matter.

Hmm, come to think of it, this multi-stage hierachy sorting can be generalized for any data vs tree structure intersection problem, of which finding point-quad intersections is just one example.

Why is all this useful again? Because hopefully it could be much faster than using the triangle setup engine to render point primitives, and because reprojecting at pixel granularity could better handle the difficult cases with less errors than reprojecting larger tiles, especially for some difficult scenes that have lots of z-edges (like foilage).

Point splatting with cuda in this way is also a potential alternative to tracing, although my current feeling is that its not as well suited to fine granularity searches, for small fustra generated for reflections and advanced illumination. However, it is much better suited to handle animated objects, and may have an advantage there vs dynamic octree construction.

Forward Reprojection - current console hardware

2008-12-13T17:16:00.000-08:00

Lately I have been thinking alot about motion compensation inspired forward reprojection schemes, both for current console generation level hardware and for the next generation cone traced engine plans.

For the current generation, I think a reasonable approach is to store and track image macrotiles, just like mpeg does, at say 8x8 pixel granularity. The rasterizer (preferablly an octree hybrid point-splatting/polygon renderer) would be designed to be very effecient at culling scene geometry down to this fine level of granularity. Tiles, once generated, are reused across frames and projected on the GPU, which can render them quite easily as simple quads with a relatively simple pixel shader filter. To handle depth edges, I would render each tile as two quads, one covering the near-pixels, the other the far-pixels. (otherwise these edges will result in large stretched quads, rather than two small quads which move apart) The tiles would be managed with a caching policy, with weights assigned to the number of correct pixels the tiles provide each frame. Old invalidated tiles would then be evicted to make way for new tile generation.

The resulting image will have a small number of error regions that then need to be corrected - some pixels won't be hit by any previous tiles for new regions of the scene revealed by camera motion and occlusion changes. If you also track motion vectors for the tiles, you can have some regions that need to be invalidated because of animation. In some cases moving tiles will happen to project into an occlusion gap but are actually behind something else in the scene (false occlusion). For a static scene, you can treat the result of the forward projection as a conservative z-buffer. Animation errors can be handled then by simply not projecting stored tiles that have too much animation error. A coarse z-pyramid rendering can then reliably identify new screen tiles which need to be regenerated.

The stencil & z-buffer is used to track invalid regions of the screen, resulting in a low-res version of these maps which is read back onto the CPU. The CPU then does heirachical intersection of the image z/stencil pyramid with the scene octrees to rasterize out new tiles which need to be generated. A bias is used to avoid re-rendering onto valid reprojected screen tiles - essentially it searches for octree cells that intersect error pixels or have moved in front of the previous projection. This is particularly well suited to the PS3's SPU's, but could also work reasonably well on the 360 with slightly different tile size tradeoffs. The key is that it is relatively coarse, operating at lower levels of an image pyramid and an octree.

When combined with deferred point splat filtering, this system can tolerate and cover over a few pixels of small occlusion errors, which can improve performance at some small potential error cost. You want to avoid rasterizing a whole tile just because of a couple error pixels. The screen interpolation filtering for the point splatting would fill in missing z-information by propagating splat surfaces using a hierachical push-pull algorithm. In essence, small gaps of a couple of pixels caused by a moving edge would be filled in to match the background surface, and because even texturing is deferred in such a scheme, there would be no noticeable smearing - small gaps can thus easily be filled in. What your left with then is a more coherent error mask and less tiles that need to be refreshed.

Lighting changes would be handled seperately with deferred shading. Static light interactions could be cached in the g-buffer and greatly benefit from the forward projection. So your deferred shading system could seperate static and dynamic lights. Static lights could use the screen error mask so they only need to recompute for the small number of new pixels. There is one slight complication, which is moving shadow casters, but this can be handled by a rough low-res shadow map look up which identifies screen regions that are shadowed by dirty regions of the shadow map, and thus need to be resampled.

This type of fine-grained micro-culling architecture is also exactly what you need to do really high quality outdoor shadows through quadtree shadow maps, and the reprojection scheme can also be employed to speed up the shadow map generation as well. And in this case, there is even more temporal coherence for typical outdoor scenes, as the sun can be treated as a static light (even if time-of-day is simulated, this would only change the projection every minute or so - not an issue). For this case, the only shadowmap regions that need to be regenerated are new quadtree cells as they are expanded in response to scene update, and cells which overlap moving objects. There are some crappy cases like a windy day in a jungle, but for most typical scenes this could be a vastly faster shadowing system that could scale to the ultra-detailed geometry of a frame coherent point-splatting renderer.