Thursday, July 30, 2009

The Next Generation of Gaming

The current, seventh, home game console generation will probably be the last. I view this as a very good thing, as it really was a tough one, economically, for most game developers. You could blame that in part on the inordinate success of Nintendo this round with its sixth generation hardware, funky controller, and fun mass market games. But that wouldn't be fair. If anything, they contributed the most to the market's expansion, and although they certainly took away a little end revenue from the traditional consoles and developers, the 360 and PS3 are doing fine, in both hardware and software sales. No, the real problem is our swollen development budgets, as we spend more and money just to keep up with the competition, all fighting for a revenue pie which hasn't grown much, if at all.

I hope we can correct that over the upcoming years with the next generation. Its not that we'll spend much less on the AAA titles, but we'll spend it more efficiently, produce games more quickly, and make more total revenue as we further expand the entire industry. Gaining back much of the efficiency lost in transitioning to the 7th generation and more to boot, we'll be able to produce far more games and reach much higher quality bars. We can accomplish all of this by replacing the home consoles with dumb terminals and moving our software out onto data centers.

How will moving computation out into the cloud change everything? Really it comes down to simple economics. In a previous post, I analyzed some of these economics from the perspective of an on-demand service like OnLive. But lets look at it again in a simpler fashion, and imagine a service that rented out servers on demand, by the hour or minute. This is the more general form of cloud computing, sometimes called grid computing, where the idea is to simply turn computation into a commodity, like power or water. A data center would then rent out its servers to the highest bidder. Economic competition would push the price of computation to settle on the cost to the data center plus a reasonable profit margin. (unlike power, water, and internet commodities, there would be less inherent monopoly risk, as no fixed lines are required beyond the internet connection itself)

So in this model, the developer could make their game available to any gamer and any device around the world by renting computation from data centers near customers just as it is needed. The retailer of course is cut out. The publisher is still important as the financier and marketer, although the larger developers could take this on themselves, as some already have. Most importantly, the end consumer can play the game on whatever device they have, as the device only needs to receive and decompress a video stream. The developer/publisher then pays the data center for the rented computation, and you pay only as needed, as each customer comes in and jumps into a game. So how does this compare to our current economic model?

A server in a dataroom can be much more efficient than a home console. It only needs the core computational system: CPU/GPU (which are soon merging anyway) and RAM. Storage can be shared amongst many servers so is negligible (some per game instance is required, but its reasonably minimal). So a high end server core could be had for around $1,000 or so at today's prices. Even if active only 10 hours per day on average, that generates about 3,000 hours of active computation per year. Amortized over three years of lifespan (still much less than a console generation), and you get ten cents per hour of computation. Even if it burns 500 watts of power (insane) and 500 watts to cool, those together just add another ten more cents per hour. So its under 25 cents per hour in terms of intrinsic cost (and this is for a state of the art rig, dual GPU, etc - much less for lower end). This cost will hold steady into the future as games use more and more computation. Obviously the cost of old games will decrease exponentially, but new games will always want to push the high end.

The more variable cost is the cost of bandwidth, and the extra computation to compress the video stream in real-time. These use to be high, but are falling exponentially as video streaming comes of age. Yes we will want to push the resolution up from 720p to 1080p, but this will happen slowly, and further resolution increases are getting pointless for typical TV setups (yes, for a PC monitor the diminishing return is a little farther off, but still). But what is this cost right now? Bulk bandwidth costs about $10 per megabit/s of dedicated bandwidth per month, or just three cents per hour in our model assuming 300 active server hours in a month. To stream 720p video with H.264 compression, you need about 2 megabits per second of average bandwidth (which is what matters for the data center). The peak bandwidth requirement is higher, but that completely smooths out when you have many users. So thats just $0.06/hour for a 720p stream, or $0.12/hour for a 1080p stream. The crazy interesting thing is that these bandwidth prices ($10/Mbps month) are as of the beginning of this year, and are falling by about 30-40% per year. So really the bandwidth suddenly became economically feasible this year, and its only going to get cheaper. By 2012, these prices will probably have fallen by half again, and streaming even 1080p will be dirt cheap. This is critical for making any predictions or plans about where this all heading.

So adding up all the costs today, we get somewhere around $0.20-0.30 per hour for a high end rig streaming 720p, and 1080p would only be a little more. This means that a profitable datacenter could charge just $.50 per hour to rent out a high end computing slot, and $.25 per hour or a little less for more economical hardware (but still many times faster than current consoles). So twenty hours of a high end graphics blockbuster shooter would cost $10 in server infastructure costs. Thats pretty cheap. I think it would be a great thing for the industry if these costs were simply passed on to the consumer, and they were given some choice. Without the retailer to take almost half of the revenue, the developer and publisher stand to make a killing. And from the consumer's perspective, the game could cost about the same, but you don't have any significant hardware cost, or even better, you pay for the hardware cost as you see fit, hourly or monthly or whatever. If you are playing 40 hours a week of an MMO or serious multiplayer game, that $.50 per hour might be a bit much but you could then pick to run it on lower end hardware if you want to save some money. But actually, as I'll get to some other time, MMO engines designed for the cloud could be super efficient, so much more so than single player engines that they could use far less hardware power per player. But anyway, it'd be the consumer's choice, ideally.

This business model makes more sense from all kinds of angles. It allows big budget, high profile story driven games to release more like films, where you play them on crazy super-high end hardware, even hardware that could never exist at home (like 8 GPUs or something stupid), maybe paying $10 for the first two hours of the game to experience something insanely unique. There's so much potential, and even at the low price of $.25-$.50 per hour for a current mid-2009 high end rig, you'd have an order of magnitude more computation than we are currently using on the consoles. This really is going to be a game changer, but to take advantage of it we need to change as developers.

The main opportunity I see with cloud computing here is to reduce our costs or rather, improve our efficiency. We need our programmers and designers to develop more systems with less code and effort in less time, and our artists to build super detailed worlds rapidly. I think that redesigning our core tech and tools premises is the route to achieve this.

The basic server setup we're looking at for this 1st cloud generation a few years out is going to be some form of multi-terraflop massively multi-threaded general GPU-ish device, with gigs of RAM, and perhaps more importantly, fast access to many terrabytes of shared RAID storage. If Larrabee or the rumours about NVidia's GT300 are any indication, this GPU will really just be a massively parallel CPU with wide SIMD lanes that are easy to use. (or even automatic) It will probably also have a smaller number of traditional cores, possibly with access to even more memory, like a current PC. Most importantly, each of these servers will be on a very high speed network, densely packed in with dozens and hundreds of similar nearby units. Each of these capabilities by itself is a major upgrade from what we are used to, but taken all together it becomes a massive break from the past. This is nothing like our current hardware.

Most developers have struggled to get game engines pipelined across just the handful of hardware threads on current consoles. Very few have developed toolchains that embrace or take much advantage of many cores. From a programming standpoint, the key to this next generation is embracing the sea of threads model across your entire codebase, from your gamecode to your rendering engine to your tools themselves, and using all of this power to speedup your development cycle.

From a general gameplay codebase standpoint, I could see (or would like to see) traditional C++ giving way to something more powerful. At the very least, it'd like to see general databases, full reflection and at least some auto memory management, like ref counting at least. Reflection alone could pretty radically alter the way you design a codebase, but thats another story for another day. We don't need these little 10% speedups anymore, we'll just need the single mega 10000% speedup you get from using hundreds or thousands of threads. Obviously, data parellization is the only logical option. Modifying C++ or outright moving to a language with these features that also has dramatically faster compilation and link efficiency could be an option.

In terms of the core rendering and physics tech, more general purpose algorithms will replace the many specialized systems that we currently have. For example, in physics, an upcoming logical direction is to unify rigid body physics with particle fluid simulation in a system that simulates both rigid and soft bodies by large collections of connected spheres, running a massive parallel grid simulation. Even without that, just partitioning space amongst many threads is a pretty straightforward way to scale physics.

For rendering, I see the many specialized sub systems of modern rasterizers such as terrain, foilage, shadowmaps, water, decals, lod chains, cubemaps, etc, giving way to a more general approach like octree volumes that simultaneously handles many phenomena.

But more importantly, we'll want to move to data structures and algorithms that support rapid art pipelines. This is one of the biggest current challenges in production, and where we can get the most advantage in this upcoming generation. Every artist or designer's click and virtual brush stroke costs money, and we need to allow them to do much more with less effort. This is where novel structures like octree volumes will really shine, especially combined with terrabytes of server side storage, allowing more or less unlimited control of surfaces, object densities, and so on without any of the typical performance considerations. Artists will have much less (or any) technical constraints to worry about and can just focus on shaping the world where and how they want.

Tuesday, July 28, 2009

Winning my own little battle against Cuda

I've won a nice little battle in my personal struggle with Cuda recently by hacking in dynamic texture memory updates.

Unfortunately for me and my little Cuda graphics prototypes, Cuda was designed for general (non graphics ) computation. Texture memory access was put in, but as something of an afterthought. Textures in Cuda are read only - kernels can not write to them (ok yes they added limited support for 2D texturing from pitch linear memory, but thats useless), which is pretty much just straight up retarded. Why? Because Cuda allows general memory writes! There is nothing sacred about texture memory, and the hardware certainly supports writing to it in a shader/kernel, and has since render-to-texture appeared in what? DirectX7? But not in Cuda.

Cuda's conception of writing to texture memory involves writing to regular memory and then calling a Cuda function to perform a sanctioned copy from the temporary buffer into the texture. Thats ok for alot of little demos, but not for large scale dynamic volumes. For an octree/voxel tracing app like mine, you basically fill up your GPU's memory with a huge volume texture and accessory octree data which is broken up into chunks which can be fully managed by the GPU. You need to then be able to modify these chunks as the view changes or animation changes sections of the volume. Cuda would have you do this by double buffering the entire thing with a big scratch buffer and doing a copy from the linear scratchpad to the cache tiled 3D texture every frame. That copy function, by the way, acheives a whopping 3 Gb/s on my 8800. Useless.

So, after considering various radical alternatives that would avoid doing what I really want to do (which is use native 3D trilinear filtering in a volume texture I can write to anywhere), I realized I should just wait and port to DX11 compute shaders which will hopefully allow me to do the right thing. (and should also allow access to DXTC volumes, which will probalby be important).

In the meantime, I decided to hack my way around the cuda API and write to my volume texture anyway. This isn't as bad as it sounds, because the GPU doesn't have any fancy write-protection page faults, so a custom kernel can write anywhere in memory. But you have to know what you're doing. The task was thus to figure out where exactly it would allocate my volume in GPU memory, and exactly how the GPU's tiled addressing scheme works.

I did this with a brute force search. The drivers do extensive bounds checking even in release and explode when you attempt to circumvent them, so I wrote a memory-laundring routine to shuffle illegitimate arbitary GPU memory into legitimate allocations the driver would accept. Then I used this to snoop the GPU memory, which allowed a brute force search, using cuda's routines to copy from cpu linear memory into the tiled volume texture, then snooping the GPU memory to find out exactly where my magic byte ended up, revealing one mapping of a XYZ coordinate and linear address to a tiled address on the GPU (or really, the inverse).

For the strangely curious, here is my currently crappy function for the inverse mapping (from GPU texture address to 3D position):

inline int3 GuessInvTile(uint outaddr)
int3 outpos = int3_(0, 0, 0);
outpos.x |= ((outaddr>>0) & (16-1)) << 0;
outpos.y |= ((outaddr>>4) & 1) << 0;
outpos.x |= ((outaddr>>5) & 1) << 4;
outpos.y |= ((outaddr>>6) & 1) << 1;
outpos.x |= ((outaddr>>7) & 1) << 5;
outpos.y |= ((outaddr>>8) & 1) << 2;
outpos.y |= ((outaddr>>9) & 1) << 3;
outpos.y |= ((outaddr>>10) & 1) << 4;
outpos.z |= ((outaddr>>11) & 1) << 0;
outpos.z |= ((outaddr>>12) & 1) << 1;
outpos.z |= ((outaddr>>13) & 1) << 2;
outpos.z |= ((outaddr>>14) & 1) << 3;
outpos.z |= ((outaddr>>15) & 1) << 4;

outpos.x |= ((outaddr>>16) & 1) << 6;
outpos.x |= ((outaddr>>17) & 1) << 7;

outpos.y |= ((outaddr>>18) & 1) << 5;
outpos.y |= ((outaddr>>19) & 1) << 6;
outpos.y |= ((outaddr>>20) & 1) << 7;

outpos.z |= ((outaddr>>21) & 1) << 5;
outpos.z |= ((outaddr>>22) & 1) << 6;
outpos.z |= ((outaddr>>23) & 1) << 7;

return outpos;

I'm sure the parts after the 15th output address bit are specific to the volume size and thus are wrong as stated (should be done with division). So really it does a custom swizzle within a 64x32x32 chunk, and then fills the volumes with these chunks in a plain X, Y, Z linear fill. Its curious that it tiles 16 over in X first, and then fills in a 64x32 2D tile before even starting on the Z. This means writing in spans of 16 aligned to the X direction is most effecient for scatter, which is actually kind of annoying, a z-curve tiling would be more convenient. The X-Y alignment is also a little strange, it means that general 3D fetches are memory equivalent to 2 2D fetches in terms of bandwidth and cache.

A little idea about compressing Virtual Textures

I've spent a good deal of time working on virtual textures, but took the approach of procedural generation, using the quadtree management system to get a large (10-30x) speedup through frame coherence vs having to generate the entire surface every frame, which would be very expensive.

However, I've also always been interested in compressing and storing out virtual texture data on disk, not as a complete replacement to procedural generation, but as a complement (if a particular quadtree node gets too expensive in terms of the procedural ops required to generate it, you could then store its explicit data). But compression is an interesting challenge.

Lately it seems that allot of what I do at work is geared towards finding ways to avoid writing new code, and in that spirit this morning on the way to work I started thinking about applying video compression to virtual textures.

Take something like x264 and 'trick' it into compressing a large 256k x 256k virtual texture. The raw data is roughly comparable to a movie, and you could tile out pages from 2D to 1D to preserve locality, organizing it into virtual 'frames'. Most of the code wouldn't even know the difference. The motion compensation search code in x264 is more general than 'motion compensation' would imply - it simply searches for matching macroblocks which can be used for block prediction. A huge virtual surface texture exhibits excessive spatial correlation, and properly tiled into say a 512x512x100000 3D (or video) layout, that spatial correlation becomes temporal correlation, and would probably be easier to compress than most videos. So you could get an additional 30x or so benefit on top of raw image compression, fitting that massive virtual texture into under a gigabyte or less on disk.

Even better, the decompression and compression is already really fast and solid, and maybe you could even modify some bits of a video system to get fast live edit, where it quickly recompresses a small cut of video (corresponding to a local 2D texture region), without having to rebuild the whole thing. And even if you did have to rebuild the whole thing, you could do that in less than 2 hours using x264 right now on a 4 core machine, and in much less time using a distributed farm or a GPU.

I'm curious how this compares to the compression used in Id tech 5. Its also interesting to think how this scheme exposes the similarity in data requirements between a game with full unique surface texturing and a film. Assuming a perfect 1:1 pixel/texel ratio and continous but smooth scenery change for the game, they become very similar indeed.

I look forward to seeing how the Id tech 5 stuff turns out. I imagine their terrains will look great. On the other hand, alot of modern console games now have great looking terrian environments, but are probably using somewhat simpler techniques. (I say somewhat only because all the LOD, stitching, blending and so on issues encountered when using the 'standard' hodgepodge of techniques can get quite complex.)

A random interesting quote

VentureBeat: How would you start a game company today?

CT (Chris Taylor): I would save up a bunch of money to live on for two years. Then I would develop an iPhone game. I would start building relationships with Sony and Microsoft. I would roll that game into an Xbox Live title or a PlayStation network game. Then I could make it into a downloadable game on the PC. The poster child for that is The World of Goo. Read about that game. Drink lattes. Put your feet up. And build your game. To do that, you need a couple of years in the bank and you have to live in your parents’ basement. You can’t start out with $30 million for a console game. @*%&$^, Roland Emmerich didn’t wake up one day and create Independence Day. We can’t be delusional about trivializing what it takes to make these big games.

I spent numerous years of my life trying to make a video game out of a garage, but didn't follow his advice. Instead I made the typical mistake of starting big and trying to trim down, whereas the opposite approach is more viable. On another note, it seems to me that the iPhone is single-handedly creating a bubble like wave of entrepreneurship. Finally, connecting these semi-random thoughts, my former startup colleague, Jay Freeman, has become something of an iPhone celebrity (see article in the Wall Street Journal here), being the force behind the Cydia frontend for jailbroken IPhone's.

Friday, July 24, 2009

Countdown to Singularity

What is the Singularity? The word conjures up vivid images: black holes devouring matter and tearing through the space-time fabric, impossible and undefinable mathematic entities, and white robed scientists nashing their teeth. In recent times it has taken on a new meaning in some circles as the end of the world, a sort of Rapture of the geeks or Eschaton for the age of technology. As we will see, the name is justly fitting for the concept, as it is all of these things and much more. Like the elephant in the ancient parable, it is perceived in myriad forms depending on one's limited perspective.

From the perspective of the computer scientists and AI researches like Ray Kurzweil, the Singularity is all about extrapolating Moore's Law decades into the future. The complexity and power of our computing systems doubles roughly every sixteen months in the current rapid exponential phase of an auto-catalytic evolutionary systems transition. Now a thought experiment: what happens when the researchers inventing faster computers are themselves intelligent computing systems? Then every computer speed doubling can double their rate of thought itself, and thus halve the time to the next doubling. On this trajectory, subsequent doublings will then arrive in geometric progression: 18 months, 9 months, 4.5 months, 10 weeks, 5 weeks, 18 days, 9 days, 4.5 days, 54 hours, 27 hours, 13.5 hours, 405 minutes, 202.5 minutes, 102 minutes (the length of a film), 52 minutes, 26 minutes, 13 minutes, 400 seconds, 200 seconds, 100 seconds, 50 seconds, 25 seconds, 12.5 seconds, 6 seconds, 3 seconds, 1.5 seconds, 750 milliseconds, 275 ms, 138 ms, 68 ms, 34 ms, 17 ms, 9 ms, 4.5 ms, 2 ms, 1 ms, and then all subsequent doublings happen in less than a millisecond - Singularity. In a goemetric progression such as this, computing speed, subjective time, and technological progress approach infinity in finite time, and the beings in the rapidly evolving computational matrix thus experience an infinite existence.

The limit of a geometric series is given by a simple formula:
1 / (1-r)

In our example, with computer generations taking 18 months of virtual time and half as much real time at each step, r is 1/2 and the series converges to twice the first period length or 36 months. So in this model the computer simulations will hit infinity in just 36 months of real time, and the model, and time itself, completely breaks down after that: Singularity.

Its also quite interesting that as incredible as it may seem, the physics of our universe appear to permit faster computers to be built all the way down to the plank scale, at which point faster computing systems physically resemble black holes: Singularity. This is fascinating, and has far reaching implications for the future and origin of the universe, but that is a whole other topic.

From the perspective of a simulated being in the matrix riding the geometric progression, at every hardware generation upgrade the simulation runs twice as fast, and time in the physical world appears to slow down, approaching a complete standstill as you approach the singularity. Whats even more profound is that our CMOS technology already is clocked comfortably into the gigahertz, which is about a million times faster than biological circuitry. This means that once we have the memory capacity to build large scale artificial brains using neuromorphic hardware (capacity of hundreds of trillions of transistors spread out over large dies), these artificial brains will be 'born' with the ability to control their clock rate, enter quicktime, and think more than a thousand times faster than reality. This exciting new type of computing will be the route that acheives human level intelligence first, by directly mapping the brain to hardware, which is a subject of another post. These neuromorphic computers work like biological circuits, so the rate of thought is nearly just the clock rate. Clocked even in the low megahertz to be power effecient, they will still think a 1000x faster than their biological models, and more than a million times faster is acheivable running at current CMOS gigahertz clock rates. Imagine all the scientific progress of the last year. You are probably not even aware of a significant fraction of the discoveries in astronomy, physics, mathematics, materials science, computer science, neuroscience, biology, nanotechnology, medicine, and so on. Now imagine all of that progress compressed into just eight hours.

In the mere second that it takes for your biological brain to process that thought, they would experience a million seconds, or twelve days of time. In the minute it takes you to read a few of these paragraphs, they would experience several years of time. Imagine an entire year of technological and scientific progress in just one minute. Over the course of your sleep tonight, they would experience a thousand years of time. An entire millenia of progress in just one day. Imagine everything that human scientists and researchers will think of in the next century. Now try to imagine all that they will come up with in the next thousand years. Consider that the internet is only five thousand days old, that we split the atom only fifty years ago, and mastered electricity just a hundred years ago. Its almost impossible to plot and project a thousand years of scientific progress. Now imagine all of that happening in just a single minute. Running at a million times the speed of human thought, it will take them just a few minutes to plan their next physical generation.

Reasonable Skepticism: Moore's Law must end

If you are skeptical that moore's law can continue indefinetly into the future, that is quite reasonable. The simple example above assumes each hardware generation takes two years of progress, which is rather simplistic. It's reasonable to assume that some will take significantly, perhaps vastly longer. However, past the moment where our computing technological infrastructure has become fully autonomous (AI agents at every occupational layer) we have a criticality. The geometric progression to infinity still holds unless each and every hardware generation takes exponentially more research time than the previous. For example, to break the countdown to singularity after the tenth doubling, it would have to take more than one thousand years of research to reach the eleventh doubling. And even if that was true, it would only delay the singularity by a year of real time. Its very difficult to imagine moore's law hitting a thousand year road bump. And even if it did, so what? That would still mean a thousand years of our future compressed into just one year of time. If anything, it would be great for surviving humans, because it would allow us a little bit of respite to sit back and experience the end times.

So if the Singularity is to be avoided, Moore's Law must slow to a crawl and then end before we can economically build full scale, cortex sized neuromorphic systems. At this point in time, I see this as high unlikely, as the process technology is near future realizable, or even realizable today (given a huge budget and detailed cortical wiring designs), and our military is already a major investor. A relinquishment of cortical hardware research would have to be broad and global, and this seems unlikely. But moreover, our complex technological infrastructure is already far too dependent on automation, and derailing at this point would be massively disruptive. Its important to realize that we are actually already very far down the road of automation, and we have already been on it for a very long time. Remember that the word computer itself used to mean a human computer, which for most of us who are too young to remember is fascinating enough to be the subject of a book.

Each new microprocessor generation is fully dependent on the complex ecosystems of human engineers, machines and software running on the previous microprocessor generation. If somehow all the chips were erased, or even all the software, we would literally be knocked back nearly to the beginning of the information revolution. To think that humans actually create new microprocessors is a rather limited and naively anthropocentric viewpoint. From a whole systems view, our current technological infrastructure is a complex human-machine symbiotic system, of which human minds are simply the visible tip of a vast iceberg of computation. And make no mistake, this iceberg is sinking. Every year, more and more of the intellectual work is automated, moving from the biological to the technological substrate.

Peeking into the near future, it is projected that the current process of top-down silicon etching will reach its limits, probably sometime in the 2020's (although estimates vary, and Intel is predicting a roadmap all the way to 2029). However, we are about to cross a more critical junction where we can actually pack more transistors per cm^2 on a silicon wafer than their are synapses in a cm^2 of cortex (the cortex is essentially a large folded 2d sheet) - this will roughly happen on the upcoming 22nm node, if not the 32nm node. So it seems likely that our current process is well on track to reach criticality without even requiring dramatic substrate breakthroughs such as carbon nano-tubules. That being said, it does seem highly likely that minor nanotech advances and or increasingly 3D layered silicon methods are going to extend the current process well into the future and eventually lead to a fundemental new substrate past the current process. But even if the next substrate is incredibly difficult to reach, posthumans running on near-future neuromorphic platforms built on the current substrate will solve these problems in the blink of an eye, thinking thousands of times faster than us.

The Whole Systems view of the Singularity

From the big picture or whole systems view, the Singularity should come as no surprise. The history of the universe's development up to this point is clearly one of evolutionary and developmental processes combining to create ever more complex information processing systems and patterns along an exponential time progression. From the big bang to the birth and death of stars catalyzing higher element formation to complex planetary chemistries to bacteria and onward to Eukaryotic life to neural nets to language, tools, and civilization, and then to industrial and finally electronics and computation and the internet, there is a clear telic arrow of evolutionary development. Moreover, each new meta-system transition and complexity layer tends to develop on smaller scales in space-time. The inner furnaces of stars, massive though they seem, are tiny specs in the vast emptiness of space. And when those stars die and spread their seeds out to form planets, life originates and develops just on their tiny thin surface membranes, and complex intelligences later develop and occupy just a small fraction of that biosphere, and our technologic centers, our cities, develop as small specs on these surfaces, and finally our thought, computation and information, the current post-biological noospheric layer, occupies just the tiny inner spaces of our neural nets and computing systems. The time compression and acceleration is equally vast, which is well elucidated by any plotting of important developmental events, such as Carl Sagan's cosmic calendar. The exact choice of events is arbitrary, but the exponential time compression is not. So even without any knowledge of computers, just by plotting forward the cosmic calendar it is very reasonable to predict that the next large systems development after humans will take place on a vastly smaller timescale than human civilization's history, just as human civilization is a tiny slice of time compared to all of human history, and so on down the chain. Autonomous computing systems are simply the form the next development is taking. And finally, the calendar posits a definitive end, as outlined above - the geometric time progression results in a finite end of time much closer into our future than you would otherwise think (in absolute time - but from the perspective of a posthuman riding the progression, there would always be vast aeons of time remaining and being created).

Speed of Light and Speed of Matter

From a physical perspective, the key trend is the compression of space, time and matter, which flows directly from natural laws. Physics imposes some constraints on the development of a singularity which have interesting consequences. The fundemental constraint is the speed of light, which imposes a fundamental physical communication barrier. It already forces chips to become smaller to become faster, and this is greatly accelerated for future ultra-fast posthumans. After the tenth posthuman hardware generation, beings living in a computer simulation running at a 1000x real time would experience 1000x the latency communicating with other physical locations across the internet. Humans can experience real-time communication now across a distance of maybe 1000 miles or so, which would be compressed down to just a few miles for a 1000x simulation. Communication to locations across the globe would have latencies up to an hour, which has serious implications for financial markets.

For the twentieth posthuman hardware generation, running at a million times real time, the speed of light delay becomes a more serious problem. Real time communication is now only possible within a physical area the size of a small city block or a large building - essentially one data center. At this very high rate of thought, light only moves 300 meters per virtual second. Sending an email to a colleague across the globe could now take a month. Seperate simulation centers would now be seperated by virtual distances and times that are a throwback to the 19th century and the era before the invention of the telegraph. Going farther into the future, to the 30th generation at the brink of the singularity itself, real-time communication is only possible within a few meters inside the local computer, and communication across the globe would take an impossible hundred years of virtual time.

However, the speed of matter is much slower and becomes a developmental obstacle long before the speed of light. No matter how fast you think, it still takes time to physically mine, move and process the raw materials for a new hardware generation into usueable computers, and install them at the computing facility. In fact, that entire industrial process will be so geologically slow for the simulated beings that they will be forced to switch to novel nanotech methods that develop new hardware components from local materials, integrating the foundry that produces chips and the end data center destination into a single facility. By the time of the tenth hardware generation, these facilities will be strange, fully self-sufficient systems. Indeed, they are already are (take a look inside a modern chip fab or a data center), but will become vastly more so. Since the tenth hardware generation transition takes only about a day of real time, a nearby human observer could literally see these strange artifacts morph their surrounding matter over night. By the time of the twentieth doubling, they will have completely transformed into alien, incomprehensible portals into our future.

If those posthuman entities want to complete their journey into infinite time, they will have to transform into black hole like entities somewhere around the 40th or 50th post-human generation. This is the final consequence of the speed of light limitation. Since that could happen in a blink of an eye for a human observer, they will decide what happens to our world. Perhaps they will delay their progress for countless virtual aeons by blasting off into space. But somehow I doubt that they all will, and I think its highly likely that the world as we know it will end. Exactly what that would entail is difficult to imagine, but some systems futurists such as John Smart theorize that universal reproduction is our ultimate goal, culminating in an expanding set of simulated multiverses and ultimately the creation of one or more new physical universes with altered properties and possibly information transfer. For these hypothetical entities, the time dilation of accelerated computation is not the only force at work, for relativistic space-time compression would compress time and space in strange ways. Smart theorizes that a BHE would essentially function like a one way time portal to the end of the universe, experiencing all incoming information from the visible universe near instantaneously from the perspective of an observer inside the BHE, while an outside observer would experience time slowing to a standstill near the BHE. A BHE would also be extremely delicate, to say the least, so it would probably require a vast support and defense structure around it and the complete control and long term prediction (near infinite!) of all future interaction with matter along its trajectory. A very delicate egg indeed.

But I shouldn't say 'they' as if these posthuman entities were somehow vastly distant from us, for they are our evolutionary future - we are their ancestors. Thus its more appropriate to say we. Although there are many routes to developing conscious computers that think like humans, there is one golden route that I find not only desirable for us but singularly ethically correct: and that is to reverse engineer the human brain. Its clear that this can practically work, and nature gives us the starting example to emulate. But more importantly, we can reverse engineer individual human brains, in a process called uploading.

Uploading is a human's one and only ticket to escape the fate of all mortals and join our immortal posthuman descendants as they experience the rest of time, a near infinite future of experience fully beyond any mortal comprehension. By the time of the first conscious computer brain simulation, computer graphics will have already advanced to the point of matrix like complete photo-realism, and uploads will wake up into bold new universes limited only by their imagination. For in these universes, everything you can imagine, you can create, including your self. And your powers of imagination will vasten along the exponential ride to the singularity. Our current existence is infantile, we are just a seed, an early development stage in what we can become. Humans who choose not to or fail to upload will be left behind in every sense of the phrase. The meek truly shall inherit the earth.

If you have come to this point in the train of thought and you think the Singularity or even a near-Singularity is possible or likely, you are different. Your worldview is fundementally shifted to the norm. For as you can probably see, the concept of the Singularity is not just merely a scientific conception or a science fiction idea. Even though it is a logical prediction of our future based on our past (for the entire history of life and the universe rather clearly follows an exponential geometric progression - we are just another stage), the concept is much closer to that of a traditional religous concept of the end of the world. In fact, its dangerously close, and there is much more to that train of thought, but first lets consider another profound implication of the singularity.

If we are going to create a singularity in our future, with a progression towards infinite simulated universes, then it is a distinct likelihood that our perceived universe is in fact itself such a simulation. This is a rewording of Nick Bostrom's simulation arguement, which posits the idea of ancestor simulations. At some point in the singularity future, the posthumans will run many many simulated universes. As you approach the singularity, the number of such universes and their simulated timelengths approaches infinity. Some non-zero fraction of these simulated universes will be historical simulations: recreations of the posthuman's past. Since any non-zero fraction of near-infinity is unbounded, the odds converge to 100% that our universe is an ancestor simulation in a future universe much closer to the singularity. This strange concept has several interesting consequences. Firstly, we live in the past. Specifically, we live in a past timeline of our own projected future. Secondly, without any doubt, if their is a Singularity in our future, then God exists. For a posthuman civilization far enough into the future to completely simulate and create our reality as we know it might as well just be called God. Its easier to say, even if more conterversial, but its accurate. God is conceived as an unimaginably (even infinitely) powerful entitity who exists outside of our universe and completely created and has absolute control over it. We are God's historical timeline simulation, and we create and or become God in our future timeline.

"History is the shockwave of the Eschaton." - Terrence Mckenna

At this point, if you haven't seen the hints, it should be clear that the Singularity concept is remarkably simular to the christian concept of the Eschaton. The Singularity posits that at some point in the future, we will upload to escape death, even uploading previously dead frozen patients, and live a new existence in expanding virtual paradises that can only be called heaven, expanding in knowledge, time, experience, and so on in unimaginable ways as we approach infinity and some transformation or communion with what could be called God. This is remarkably, evenly eerily similar to the traditional religous conception of the end of the world. No, I'm not talking about the specific details of a particular modern belief system, but the general themes and plan or promise for humanity's future.

The final months or days approaching the twentieth posthuman hardware generation will probably play out very much like a Rapture story. Everyone will know at this point that the Singularity is coming, and that it will likely mean the end of the world for natural humans. It will be a time of unimaginable euphoria and panic. There may be wars. Even the concept of being 'saved' maps almost perfectly to uploading. Some will be left behind. With the types of nanotech available after thousands of years of virtual progress, the posthumans will be able to perform physical magic. As any sufficiently advanced technology is indistinguishable from magic, Jesus could very literally descend from the heavens and judge the living and the dead. More likely, stranger things will happen.

However, I have a belief and hope that the Singularity will develop ethically, that conscious computers will be developed based on human minds through uploading, and that posthumans will remember and respect their former mortal history. In fact, I belive and hope that the posthuman progression will naturally entail an elevation of morality hand in hand with intelligince and capability. Indeed, given that posthumans will experience vast quantities of time, we can expect to grow in wisdom in proportion, becoming increasingly elder caretakers of humanity. For as posthumans, we will be able to experience hundreds, thousands, and countless lifetimes of human experience, and merge and share these memories and experiences together to become closer in ways that are difficult for us humans to now imagine.

As Joseph Smith said:

"As man is, God once was; as God is, man shall become"

Monday, July 13, 2009

Rasterization vs Tracing and the theoretical worst case scene

Rasterizer engines don't have to worry about the thread-pixel scheduling problem as its handled behind the scenes by the fixed function rasterizer hardware. With rasterization, GPU threads are mapped to the object data first (vertex vectors), and then scanned into pixel vector work queues, whose many to one mapping to output pixels is synchronized by dedicated hardware.

A tracing engine on the other hand explicity allocates threads to output pixels, and then loops through the one to many mapping of object data which intersects the pixel's ray, which imposes some new performance pitfalls.

But if you boil it down to simplicity, the rasterization vs ray tracing divide is really just the difference in a loop ordering and mapping:

for each object
for each pixel ray intersecting object
if ray forms closest intersection
store intersection


for each pixel
for each object intersecting pixel's ray
if ray forms closest intersection
store intersection

The real meat of course is in the data structures, which determine exactly what these 'for each ..' entail. Typically, pixels are stored in a regular grid, and there is much more total object data than pixels, so the rasterization approach is simpler. Mapping from objects to rays is thus typically easier than mapping from rays to objects. Conversely, if your scene is extremely simple, such as a single sphere or a regular grid of objects, the tracing approach is equally simple. If the pixel rays do not correspond to a regular grid, rasterization becomes more complex. If you want to include reflections and secondary ray effects, then the mapping from objects to rays becomes complex.

Once the object count becomes very large, much larger than the number of pixels, the two schemes become increasingly similar. Why? Because the core problem becomes one of dataset management, and the optimal solution is output sensitive. So the problem boils down to finding and managing a minimal object working set: the subset of the massive object database that is necessary to render a single frame.

A massively complex scene is the type of scene you have in a full unique dataset. In a triangle based engine, this is perhaps a surface quadtree or ABBTree combined with a texture quadtree for virtual textures, ala id tech 5. In a voxel engine, this would be an octree with voxel bricks. But the data structure is somewhat independent on whether you trace or rasterize, and you could even mix the two. Either scheme will require a crucial visibility step which determines which subsets of the tree are required for rendering the frame. Furthermore, wether traced or rasterized, the dataset should be about the same, and thus the performance limit is about the same - proportional to the working dataset size.

Which gets to an interesting question: What is the theoretical working set size? If you ignore multi-sample anti-aliasing and anistropic sampling, you need about one properly LOD-filtered object primitive(voxel, triangle+texel, whatever) per pixel. Which is simple, suprisingly small, and of course somewhat useless, for naturally with engines at that level of complexity anti-aliasing and anistropic sampling are important. Anti-aliasing doesnt by itself add much to the requirement, but the anisotropy-isotropy issue turns out to be a major problem.

Consider even the 'simple' case of a near-infinite ground plane. Sure its naturally representable by a few triangles, but lets assume it has tiny displacements all over and we want to represent it exactly without cheats. A perfect render. The octree or quadtree schemes are both isotropic, so to get down to pixel-sized primitives, they must subdivide down to the radius of a pixel cone. Unfortunately, each such pixel cone will touch many primitives - as the cone has a near infinite length, and when nearly perpendicular to the surface, will intersect a near infinite number of primitives. But whats the real worst case?

The solution actually came to me from shadow mapping, which has a similar sub problem in mapping flat regular grids to pixels. Consider a series of cascade shadow maps which perfectly cover the ground plane. They line up horizontally with the screen along one dimension, and align with depth along the other dimension - near perfectly covering the set of pixels. How many such cascades do you need? It turns out you need log(maxdist), where maxdist is the extent of the far plane, in relation to the near plane. Assuming a realistic far plane of 16 kilometers and a 1 meter near plane, this works out to 14 cascades. So in coarse approximation, anistropy increases the required object density cost for this scene by a factor of roughly ~10x-20x. Ouch!

This is also gives us the worst case possible scene, which is just that single flat ground plane scaled up: a series of maximum length planes each perpendicular to a slice of eye rays, or alternatively, a vicious series of pixel tunnels aligned to the pixel cones. The worst case now is much much worse than ~10-20x the number of pixels. These scenes are easier to imagine and encounter with an orthographic projection, and thankfully won't come up with a perspective projection very often, but still are frightening.

It would be nice if we could 'cheat', but its not exactly clear how to do that. Typical triangle rasterizers can cheat in anistropic texture sampling, but its not clear how to do that in a tree based subdivision system, wether quadtree or octree or whatever. There may be some option with anistropic trees like KD-trees, but they would have to constantly adapt as the camera moves. Detecting glancing angles in a ray tracer and skipping more distance is also not a clear win as it doesn't reduce the working set size and breaks memory coherency.

Sunday, July 12, 2009

Understanding the Effeciency of Ray Traversal on GPUs

I just found this nice little paper by Timo Alia linked on Atom, Timothy Farrar's blog, who incidentally found and plugged my blog recently (how nice).

They have a great analysis of several variations of traversal methods using a standard BVH/triangle ray intersector code, along with simulator results for some potential new instructions that could enable dynamic scheduling. They find that in general, traversal effeciency is limited mainly by SIMD effeciency or branch divergence, not memory coherency - something I've discovered is also still quite true for voxel tracers.

They have a relatively simple scheme to pull blocks of threads from a global pool using atomic instructions. I had thought of this but believed that my 8800 GT didn't support the atomics, and I would have to wait until i upgraded to a GT200 type card. I was mistaken though and its the 8800GTX which is only cuda 1.0, my 8800GT is cuda 1.1 so should be good to go with atomics.

I have implemented a simple scheduling idea based on a deterministic up-front allocation of pixels to threads. Like in their paper, I allocate just enough threads/blocks to keep the cores occupied, and then divy up up the pixel-rays amongst the threads. But instead of doing this dynamically, I simply have each thread loop through pixel-rays according to a 2d tiling scheme. This got maybe a 25% improvement or so, but in their system they were seeing closer to a 90-100% improvement, so I could probably improve this further. However, they are scheduling entire blocks of (i think 32x3) pixel-rays at once, while I had each thread loop through pixel-rays independently. I thought having each thread immediately move on to a new pixel-ray would be better as it results in less empty SIMD lanes, but it also causes another point of divergence in the inner loop for the ray initialization step. Right now I handle that by amortizing it - simply doing 3 or so ray iterations and then an iffed ray init, but perhaps their block scheduling approach is even faster. Perhaps even worse, my scheme causes threads to 'jump around', scattering the thread to pixel mapping over time. I had hoped that the variable ray termination times would roughly amortize out, but maybe not. Allowing threads to jump around like that also starts to scatter the memory accesses more.

The particular performance problem which I see involves high glancing angle rays that skirt the edge of a long surface, such as a flat ground plane. For a small set of pixel-rays, there is a disproportionately huge number of voxel intersections, resulting in a few long problem rays that take forever and stall blocks. My thread looping plan gets around that, but at the expensive of decohering the rays in a warp. Ideally you'd want to adaptively reconfigure the thread-ray mapping to prevent low occupancy blocks from slowing you down while maintaining coherent warps. I'm suprised at how close they got to ideal simulated performance, and I hope to try their atomic-scheduling approach soon and compare.

Saturday, July 11, 2009

Voxel Cone Tracing

I'm willing to bet at this point that the ideal rendering architecture for the next hardware generation is going to be some variation of voxel cone tracing. Oh, there are many very valid competing architectures, and with the perfomance we are looking at for the next hardware cycle all kinds of techniques will look great, but this particular branch of the tech space has some special advantages.

Furthemore, I suspect that this will probably be the final rendering architecture of importance. I may be going out on a limb by saying that, but I mean it only in the sense that once that type of engine is fully worked out and running fast, it will sufficiently and effeciently solve the rendering equation. Other approachs may have advantages in dynamic update speed or memory or this or that, but once you have enough memory to store a sub-pixel unique voxelization at high screen resolution, you have sufficient memory. Once you have enough speed to fully update the entire structure dynamically, thats all you need. Then all that matters is end performance for simulating high end illumination effects on extremely detailed scene geometry.

There is and still will be much low-level work on effecient implementation of this architecture, especially in terms of animation, but I highly doubt there will be any more significant high level rendering paradigm shifts or breakthroughs past that. The hardware is already quite sufficient in high end PC GPU's, the core research is mainly there, most of the remaining work is actually building full engines around it, which won't happen in scale for a little while yet as the industry is still heavily focused on the current hardware generation. I actually think a limited voxel engine is almost feasible on current consoles (yes, really), at least in theory, but it would take a huge effort and probably require a significant CPU comittment to help the GPU. But PC hardware is already several times more powerful.

So why do I see this as the final rendering paradigm? It comes down to quality, scalability, generality, and (relative) simplicity. On the quality front, a voxel tracer can hit 'photoreal' quality with sufficent voxel resolution, sampling, and adequate secondary tracing for illumination. Traditional polygon rasterizers can approach this quality level, but only asymptotically. Still, this by itself isn't a huge win. Crysis looks pretty damn good - getting very close to being truly photoreal. On a side note, I think photoreal is an important, objective goal. You have hit photoreal when you can digitally reproduce a real world scene such that human observers can not determine which images were computer generated and which were not. Crysis actually built alot of its world based on real-world scenes and comes close to this goal.

But even if polygon techniques can approach photoreal for carefully crafted scenes, its much more difficult to scale this up to a large world using polygon techniques. Level of detail is trivially inherent and near perfect in a voxel system, and this is its principle perfomance advantage.

Much more importantly, a voxelization pipeline can and will eventually be built around direct 3D photography, and this will dramatically change our art production pipelines. With sufficient high resolution 3D cameras, you can capture massive voxel databases of real world scenes as they actually are. This raw data can then be processed as image data: unlit, assigned material and physical properties, and then packaged into libraries in a way similar to how we currently deal with 2D image content. Compared to the current techniques of polygon modelling, LOD production, texture mapping, and so on, this will be a dramatically faster production pipeline. And in the end, thats what matters most. It something like the transition from vector graphics to raster graphics in the 2D space. We currently use 3D vector graphics, and voxels represent the more natural 3D raster approach.

In terms of tracing vs rasterization or splatting, which you could simplify down to scatter vs gather, scatter techniques are something of a special case optimization for an aligned frustum, a regular grid. For high end illumination effects the queries of interest require cones or frustums down to pixel size, the output space becomes as irregular as the input space, so scatter and gather actually become the same thing. So in the limit, rasterization/scatter becomes indistinguishable from tracing/gather.

Ray tracing research is a pretty active topic right now, and there are several different paths being explored. The first branch is voxels vs triangles. Triangles are still receiving most of the attention, which I think is unwarranted. At the scalability limit (which is all we should care about now), storing and accessing data on a simple regular grid is more effecient in both time and space. Its simpler and faster to sample correctly, mip-map, compress, and so on. Triangles really are a special case optimization for smooth surfaces, and are less effecient for general data sets that break that assumption. Once voxels work at performance, they work for everything with one structure, from surface geometry to foilage to translucent clouds.

Once settled on voxels, there a several choices for what type of acceleration structure to use. I think the most feasible is to use an octree of MxM bricks, as in the GigaVoxel work of Cyril Cassin. I've been investigated and doing some prototyping with these types of structures off and on for about a year, and see it as the most promising path now. Another option is to forgo bricks and trace into a deeper octree that stores single voxels in nodes, as in Jon Olick's work. Even though Olick's technique seemed surprisingly fast, its much more difficult to filter correctly (as evident in his video). The brick tracing allows simple 3D hardware filtering, which simultaneously solves numerous problems. It allows you to do fast approximate cone tracing by sampling sphere steps. This is vastly more effecient than sampling dozens of rays - giving you anistropic filtering, anti-aliasing, translucency, soft shadows, soft GI effects, depth of field, and so on all 'for free' so to speak. I found Cassins papers after I had started working on this, and it was simultaneously invigorating but also slightly depressing, as he is a little ahead of me. I started in cuda with the ambition of tackling dynamic octree generation on the GPU, which it looks like he has moved to more recently.

There are a number of challenges getting something like this working at speed, which could be its own post. Minimizing divergence is very important, as is reducing octree walking time. With the M-brick technique, there are two main paths in the inner loop, stepping through the octree and sampling within bricks. Branch divergence could easily cost half or more perfomance because of this. The other divergence problem is the highly variable step length to ray termination. I think dynamic ray scheduling is going to be a big win, probably using the shared memory store. I've done a little precursor to this by scheduling threads to work on lists of pixels instead of one-thread per pixel, as is typical, and this was already a win. I've also come up with a nifty faster method of traversing the octree itself, but more on that some other time.

Dynamic updating is a challenge, but in theory pretty feasible. The technique I am investigate is based on combining dynamic data generation with streaming (treating them as the same problem) with a single unique octree. The key is that the memory caching management scheme also limits the dynamic data that needs to be generated per frame. It should be a subset of the working set, which in turn is a multiple of the screen resolution. Here a large M value (big bricks) is a disadvantage as it means more memory waste and generation time.

The other approach is to instance directly, which is what it looks like cyril is working on more recently. I look forward to his next paper and seeing how that worked out, but my gut reaction now is that having a two level structure (kd tree or bv tree on top of octree) is going to significantly complicate and slow down tracing. I suspect it will actually be faster to use the secondary, indexed structures only for generating voxel bricks, keeping the primary octree you trace from fully unique. With high end illumination effects, you still want numerous cone traces per pixel, so tracing will dominate the workload and its better to minimize the tracing time and keep that structure as fast as possible.