Thursday, July 30, 2009
The Next Generation of Gaming
I hope we can correct that over the upcoming years with the next generation. Its not that we'll spend much less on the AAA titles, but we'll spend it more efficiently, produce games more quickly, and make more total revenue as we further expand the entire industry. Gaining back much of the efficiency lost in transitioning to the 7th generation and more to boot, we'll be able to produce far more games and reach much higher quality bars. We can accomplish all of this by replacing the home consoles with dumb terminals and moving our software out onto data centers.
How will moving computation out into the cloud change everything? Really it comes down to simple economics. In a previous post, I analyzed some of these economics from the perspective of an on-demand service like OnLive. But lets look at it again in a simpler fashion, and imagine a service that rented out servers on demand, by the hour or minute. This is the more general form of cloud computing, sometimes called grid computing, where the idea is to simply turn computation into a commodity, like power or water. A data center would then rent out its servers to the highest bidder. Economic competition would push the price of computation to settle on the cost to the data center plus a reasonable profit margin. (unlike power, water, and internet commodities, there would be less inherent monopoly risk, as no fixed lines are required beyond the internet connection itself)
So in this model, the developer could make their game available to any gamer and any device around the world by renting computation from data centers near customers just as it is needed. The retailer of course is cut out. The publisher is still important as the financier and marketer, although the larger developers could take this on themselves, as some already have. Most importantly, the end consumer can play the game on whatever device they have, as the device only needs to receive and decompress a video stream. The developer/publisher then pays the data center for the rented computation, and you pay only as needed, as each customer comes in and jumps into a game. So how does this compare to our current economic model?
A server in a dataroom can be much more efficient than a home console. It only needs the core computational system: CPU/GPU (which are soon merging anyway) and RAM. Storage can be shared amongst many servers so is negligible (some per game instance is required, but its reasonably minimal). So a high end server core could be had for around $1,000 or so at today's prices. Even if active only 10 hours per day on average, that generates about 3,000 hours of active computation per year. Amortized over three years of lifespan (still much less than a console generation), and you get ten cents per hour of computation. Even if it burns 500 watts of power (insane) and 500 watts to cool, those together just add another ten more cents per hour. So its under 25 cents per hour in terms of intrinsic cost (and this is for a state of the art rig, dual GPU, etc - much less for lower end). This cost will hold steady into the future as games use more and more computation. Obviously the cost of old games will decrease exponentially, but new games will always want to push the high end.
The more variable cost is the cost of bandwidth, and the extra computation to compress the video stream in real-time. These use to be high, but are falling exponentially as video streaming comes of age. Yes we will want to push the resolution up from 720p to 1080p, but this will happen slowly, and further resolution increases are getting pointless for typical TV setups (yes, for a PC monitor the diminishing return is a little farther off, but still). But what is this cost right now? Bulk bandwidth costs about $10 per megabit/s of dedicated bandwidth per month, or just three cents per hour in our model assuming 300 active server hours in a month. To stream 720p video with H.264 compression, you need about 2 megabits per second of average bandwidth (which is what matters for the data center). The peak bandwidth requirement is higher, but that completely smooths out when you have many users. So thats just $0.06/hour for a 720p stream, or $0.12/hour for a 1080p stream. The crazy interesting thing is that these bandwidth prices ($10/Mbps month) are as of the beginning of this year, and are falling by about 30-40% per year. So really the bandwidth suddenly became economically feasible this year, and its only going to get cheaper. By 2012, these prices will probably have fallen by half again, and streaming even 1080p will be dirt cheap. This is critical for making any predictions or plans about where this all heading.
So adding up all the costs today, we get somewhere around $0.20-0.30 per hour for a high end rig streaming 720p, and 1080p would only be a little more. This means that a profitable datacenter could charge just $.50 per hour to rent out a high end computing slot, and $.25 per hour or a little less for more economical hardware (but still many times faster than current consoles). So twenty hours of a high end graphics blockbuster shooter would cost $10 in server infastructure costs. Thats pretty cheap. I think it would be a great thing for the industry if these costs were simply passed on to the consumer, and they were given some choice. Without the retailer to take almost half of the revenue, the developer and publisher stand to make a killing. And from the consumer's perspective, the game could cost about the same, but you don't have any significant hardware cost, or even better, you pay for the hardware cost as you see fit, hourly or monthly or whatever. If you are playing 40 hours a week of an MMO or serious multiplayer game, that $.50 per hour might be a bit much but you could then pick to run it on lower end hardware if you want to save some money. But actually, as I'll get to some other time, MMO engines designed for the cloud could be super efficient, so much more so than single player engines that they could use far less hardware power per player. But anyway, it'd be the consumer's choice, ideally.
This business model makes more sense from all kinds of angles. It allows big budget, high profile story driven games to release more like films, where you play them on crazy super-high end hardware, even hardware that could never exist at home (like 8 GPUs or something stupid), maybe paying $10 for the first two hours of the game to experience something insanely unique. There's so much potential, and even at the low price of $.25-$.50 per hour for a current mid-2009 high end rig, you'd have an order of magnitude more computation than we are currently using on the consoles. This really is going to be a game changer, but to take advantage of it we need to change as developers.
The main opportunity I see with cloud computing here is to reduce our costs or rather, improve our efficiency. We need our programmers and designers to develop more systems with less code and effort in less time, and our artists to build super detailed worlds rapidly. I think that redesigning our core tech and tools premises is the route to achieve this.
The basic server setup we're looking at for this 1st cloud generation a few years out is going to be some form of multi-terraflop massively multi-threaded general GPU-ish device, with gigs of RAM, and perhaps more importantly, fast access to many terrabytes of shared RAID storage. If Larrabee or the rumours about NVidia's GT300 are any indication, this GPU will really just be a massively parallel CPU with wide SIMD lanes that are easy to use. (or even automatic) It will probably also have a smaller number of traditional cores, possibly with access to even more memory, like a current PC. Most importantly, each of these servers will be on a very high speed network, densely packed in with dozens and hundreds of similar nearby units. Each of these capabilities by itself is a major upgrade from what we are used to, but taken all together it becomes a massive break from the past. This is nothing like our current hardware.
Most developers have struggled to get game engines pipelined across just the handful of hardware threads on current consoles. Very few have developed toolchains that embrace or take much advantage of many cores. From a programming standpoint, the key to this next generation is embracing the sea of threads model across your entire codebase, from your gamecode to your rendering engine to your tools themselves, and using all of this power to speedup your development cycle.
From a general gameplay codebase standpoint, I could see (or would like to see) traditional C++ giving way to something more powerful. At the very least, it'd like to see general databases, full reflection and at least some auto memory management, like ref counting at least. Reflection alone could pretty radically alter the way you design a codebase, but thats another story for another day. We don't need these little 10% speedups anymore, we'll just need the single mega 10000% speedup you get from using hundreds or thousands of threads. Obviously, data parellization is the only logical option. Modifying C++ or outright moving to a language with these features that also has dramatically faster compilation and link efficiency could be an option.
In terms of the core rendering and physics tech, more general purpose algorithms will replace the many specialized systems that we currently have. For example, in physics, an upcoming logical direction is to unify rigid body physics with particle fluid simulation in a system that simulates both rigid and soft bodies by large collections of connected spheres, running a massive parallel grid simulation. Even without that, just partitioning space amongst many threads is a pretty straightforward way to scale physics.
For rendering, I see the many specialized sub systems of modern rasterizers such as terrain, foilage, shadowmaps, water, decals, lod chains, cubemaps, etc, giving way to a more general approach like octree volumes that simultaneously handles many phenomena.
But more importantly, we'll want to move to data structures and algorithms that support rapid art pipelines. This is one of the biggest current challenges in production, and where we can get the most advantage in this upcoming generation. Every artist or designer's click and virtual brush stroke costs money, and we need to allow them to do much more with less effort. This is where novel structures like octree volumes will really shine, especially combined with terrabytes of server side storage, allowing more or less unlimited control of surfaces, object densities, and so on without any of the typical performance considerations. Artists will have much less (or any) technical constraints to worry about and can just focus on shaping the world where and how they want.
Tuesday, July 28, 2009
Winning my own little battle against Cuda
I did this with a brute force search. The drivers do extensive bounds checking even in release and explode when you attempt to circumvent them, so I wrote a memory-laundring routine to shuffle illegitimate arbitary GPU memory into legitimate allocations the driver would accept. Then I used this to snoop the GPU memory, which allowed a brute force search, using cuda's routines to copy from cpu linear memory into the tiled volume texture, then snooping the GPU memory to find out exactly where my magic byte ended up, revealing one mapping of a XYZ coordinate and linear address to a tiled address on the GPU (or really, the inverse).
inline int3 GuessInvTile(uint outaddr)
{
int3 outpos = int3_(0, 0, 0);
outpos.x |= ((outaddr>>0) & (16-1)) << 0;
outpos.y |= ((outaddr>>4) & 1) << 0;
outpos.x |= ((outaddr>>5) & 1) << 4;
outpos.y |= ((outaddr>>6) & 1) << 1;
outpos.x |= ((outaddr>>7) & 1) << 5;
outpos.y |= ((outaddr>>8) & 1) << 2;
outpos.y |= ((outaddr>>9) & 1) << 3;
outpos.y |= ((outaddr>>10) & 1) << 4;
outpos.z |= ((outaddr>>11) & 1) << 0;
outpos.z |= ((outaddr>>12) & 1) << 1;
outpos.z |= ((outaddr>>13) & 1) << 2;
outpos.z |= ((outaddr>>14) & 1) << 3;
outpos.z |= ((outaddr>>15) & 1) << 4;
outpos.x |= ((outaddr>>16) & 1) << 6;
outpos.x |= ((outaddr>>17) & 1) << 7;
outpos.y |= ((outaddr>>18) & 1) << 5;
outpos.y |= ((outaddr>>19) & 1) << 6;
outpos.y |= ((outaddr>>20) & 1) << 7;
outpos.z |= ((outaddr>>21) & 1) << 5;
outpos.z |= ((outaddr>>22) & 1) << 6;
outpos.z |= ((outaddr>>23) & 1) << 7;
return outpos;
}
A little idea about compressing Virtual Textures
However, I've also always been interested in compressing and storing out virtual texture data on disk, not as a complete replacement to procedural generation, but as a complement (if a particular quadtree node gets too expensive in terms of the procedural ops required to generate it, you could then store its explicit data). But compression is an interesting challenge.
A random interesting quote
VentureBeat: How would you start a game company today?
CT (Chris Taylor): I would save up a bunch of money to live on for two years. Then I would develop an iPhone game. I would start building relationships with Sony and Microsoft. I would roll that game into an Xbox Live title or a PlayStation network game. Then I could make it into a downloadable game on the PC. The poster child for that is The World of Goo. Read about that game. Drink lattes. Put your feet up. And build your game. To do that, you need a couple of years in the bank and you have to live in your parents’ basement. You can’t start out with $30 million for a console game. @*%&$^, Roland Emmerich didn’t wake up one day and create Independence Day. We can’t be delusional about trivializing what it takes to make these big games.
I spent numerous years of my life trying to make a video game out of a garage, but didn't follow his advice. Instead I made the typical mistake of starting big and trying to trim down, whereas the opposite approach is more viable. On another note, it seems to me that the iPhone is single-handedly creating a bubble like wave of entrepreneurship. Finally, connecting these semi-random thoughts, my former startup colleague, Jay Freeman, has become something of an iPhone celebrity (see article in the Wall Street Journal here), being the force behind the Cydia frontend for jailbroken IPhone's.
Friday, July 24, 2009
Countdown to Singularity
Monday, July 13, 2009
Rasterization vs Tracing and the theoretical worst case scene
Sunday, July 12, 2009
Understanding the Effeciency of Ray Traversal on GPUs
They have a relatively simple scheme to pull blocks of threads from a global pool using atomic instructions. I had thought of this but believed that my 8800 GT didn't support the atomics, and I would have to wait until i upgraded to a GT200 type card. I was mistaken though and its the 8800GTX which is only cuda 1.0, my 8800GT is cuda 1.1 so should be good to go with atomics.
Saturday, July 11, 2009
Voxel Cone Tracing
There are a number of challenges getting something like this working at speed, which could be its own post. Minimizing divergence is very important, as is reducing octree walking time. With the M-brick technique, there are two main paths in the inner loop, stepping through the octree and sampling within bricks. Branch divergence could easily cost half or more perfomance because of this. The other divergence problem is the highly variable step length to ray termination. I think dynamic ray scheduling is going to be a big win, probably using the shared memory store. I've done a little precursor to this by scheduling threads to work on lists of pixels instead of one-thread per pixel, as is typical, and this was already a win. I've also come up with a nifty faster method of traversing the octree itself, but more on that some other time.