Thursday, October 15, 2009

Nvidia's Fermi and other new things

I've been ignoring this blog lately as work calls, and in the meantime there's been a few interesting developments:
* Nvidia announced/hyped/unveiled their next-gen architecture, Fermi, aka Nvidia's Larrabee
* Nvidia is apparently abandoning/getting squeezed out of the chipset market in the near term
* But, they also apparently have won a contract for the next gen DS using Tegra
* OnLive is supposedly in open Beta (although its unclear how 'open' it is just yet)
* OnLive also received a large new round of funding, presumably to build up more data centers for launch. Interestingly, AT&T led this round, instead of Time Warner. Rumour is they are up to a billion dollar evaluation, which if true, is rather insane. Consider for example that AMD has a current market cap of just $4 billion.

The summation of a converging whirlwind of trends points to a future computing market dominated on one hand by pervasive, super-cheap hand-held devices and large-scale industrial computing in the cloud on the other.

1. Moore's law and PC marginalization. It is squeezing the typical commodity PC into increasingly smaller and cheaper forms. What does the typical customer need a computer for? For perhaps 80% of the customers 99% of the time, its for web, video and word processing or other simple apps (which these days all just fall into the web category). The PC was designed for an era when these tasks were formidable, and more importantly, before pervasive high speed internet. This trend is realized in system designs such as Nvidia's Tegra or Intel's Atom, integrating a cheap low power CPU with dedicated hardware for video decode/encode, audio and the other common tasks. For most users, there just isn't a compelling reason for more powerful hardware, unless you want to use it to play games.

In the end this is very bad for Intel, AMD and Nvidia, and they all know it. In the short to medium term they can offset losses in the traditional PC market with their low-power designs, but if you extrapolate the trend into the decade ahead, eventually the typical computational needs of the average user will be adequately met by a device that costs just a few dozen bucks. This is a long term disaster for all parties involved unless you can find a new market or sell customers on new processor intensive features.

2. Evolution of the game industry. Moore's law has vastly expanded the game landscape. On the high end, you have the technology leaders, such as Crysis, which utilize the latest CPU/GPU tech. But increasingly the high end is less of the total landscape, not because there is less interest in high end games, but simply because the landscape is so vast. The end result of years of rapid evolutionary adaptive radiation is a huge range of games across the whole spectrum of processing complexity, from Crysis on one end to nintendo DS or flash games on the other. Crysis doesn't quite compete with free web games, they largely occupy different niches. In the early days of the PC, the landscape was simple and all the games were more or less 'high end' for the time. But as technology marches on and allows you to do more in a high end game, this never kills the market for simpler games on the low end.

The other shift in games is the rise of console dominance, both in terms of the living room and the market. The modern console has come along way, and now provides a competitive experience in most genres, quality multiplayer, media and apps. The PC game market still exists, but mainly in the form of genres that really depend on keyboard and mouse or are by nature less suitable to playing on a couch. Basically the genres that Blizzard dominates. Unfortunately for the hardware people, Blizzard is rather slow in pushing the hardware.

3. The slow but inexorable deployment of pervasive high speed broadband. Its definitely taking time, but this is where we are headed sooner rather than later. Ultimately this means that the minimal cheap low power device described above is all you need or will ever need for local computation (basically video decompression), and any heavy lifting that you need can be made available from the cloud on demand. This doesn't mean that there won't still be a market for high end PC's, as some people will always want their own powerful computers, but it will be increasingly marginal and hobbyist.

4. The speed of light barrier. Moore's law generally allows exponential increase in the number of transistors per unit area as process technology advances and shrinks, but only more marginal improvements in clock rate. Signal propagation is firmly limited by the speed of light, and so the round trip time of a typical fetch/execute/store operation is relatively huge, and has been for quite some time. The strategy up to fairly recently for CPU architects was to use ever more transistors to hide this latency and increase execution rate through pipelining with caches, instruction scheduling and even prediction . GPU's, like DSP's and even cray vector procesors before them, took the simpler route of massive parallelization. Now the complex superscalar design has long since reached its limits, and architects are left with massive parallelization as the only route forward to take advantage of additional transistors. In the very long term, the brain stands as a sort of example of where computing might head eventually, faced with the same constraints.

This is the future, and I think its clear enough that the folks at Intel, NVidia and AMD can all see the writing on the wall, the bigger question is what to do about it. As discussed above, I don't think the low end netbook/smartphone/whatever market is enough to sustain these companies in the longer term, there will only be more competition and lower margins going forward.

Where is the long term growth potential? Its in the cloud. Especially as gaming starts to move into this space, here is where moore's law will never marginalize.

This is why Nvidia's strategy with Fermi makes good sense to me, just as Larrabee does for Intel. With Fermi Nvidia is betting that paying the extra die space for the remaining functionality to elevate their GPU cores into something more like CPU cores is the correct long term decision.

When you think about it, there is a huge difference between a chip like Larrabee or (apparently) Fermi which can run full C++, and more limited GPU's like the GT2xx series or AMD's latest. Yes you can port many algorithms to run on Cuda or OpenCL or whatever, but port is the key word.

With Larrabee or Fermi you actually should be able to port over existing CPU code, as they support local memory caches, unified addressing and function pointers/indirect jumps, and thus even interrupts. IE, they are complete, and really should be called wide-vector massively threaded CPUs. The difference between that kind of 'GPU' and upcoming 'CPU's really just comes down to vector-width, cache sizes and hardware threading decisions.

But really, porting existing code is largely irrelevant. Existing CPU code, whether single or multi threaded, is a very different beast than mega-threaded code. The transition from a design based on one to a handful of threads to a design for thousands of threads is the important transition. The vector-width or instruction set details are tiny details in comparison (and actually, I agree with Nvidia's decision to largely hide the SIMD width, having them simulate scalar threads). Larrabee went with a somewhat less ambitious model, supporting 4-way hyper-threading vs the massive threading of current GPU's, and I think this is a primary mistake. Why? Because future architectures will only get faster by adding more threads, so you better design for massive thread scalability now.

What about fusion, and CPU/GPU integration?

There's a lot of talk now about integrating the CPU and GPU onto a single die, and indeed ATI is massively marketing/hyping this idea. In the near term it probably makes sense in some manner, but in the longer term its largely irrelevant.

Why? Because the long term trend is and must be software designed for a sea of threads. This is the physical reality, like it or not. So whats the role of the traditional CPU in this model? Larrabee and Fermi point to GPU cores taking on CPU features. Compare upcoming Intel CPU designs to Fermi or Larrabee. Intel will soon move to 16 superscalar 4-way SIMD cores on a chip at 2-3 GHZ. Fermi will be 16 'multi-processors' with 32 scalar units each at 1-2 GHZ. Larrabee somewhere inbetween, but closer to Fermi.

Its also pretty clear at this point that most software or algorithms designed massively parallel perform far better on the more GPU-ish designs above (most, but not all). So in the long term CPU and GPU become historical terms - representing just points on a spectrum between superscalar or supervector, and we just have tons of processors, and the whole fusion idea really just amounts to a heterogeneous vs homogeneous design. As a case study, compare the 360 to the PS3. The 360 with 3 general CPUs and a 48-unit GPU is clearly easier to work with than the PS3 with its CPU, 7 wierd SPU's, and 24-unit GPU. Homogeneity is generally the better choice.

Now going farther forward into the next decade, looking at a 100+ core design, would you rather have the die split between CPU cores and GPU cores? One CPU as coordinator and then a bunch of GPU cores, or, just all cGPU cores? In the end the latter is the most attractive if the cGPU cores have all the features of a CPU. If the same C++ code runs on all the cores, then perhaps it doesn't matter.


Rex Guo said...

Hi Jake, nice article and great thoughts. My main concern remains with the issue of bandwidth and latency. Broadband speeds have improved as little as battery life over the years and I can't see that changing soon. Although there's a ton of applications that do not require low latencies, like interactive real-estate walkthroughs, a huge chunk of the market is still driven by games.

Here's a follow-up of the RealityServer cloud from the GTC event that complements some of your points:

Jake Cannell said...

Rex, deploying a full streaming game service on the current internet is a big technical challenge, but not due to one issue like bandwidth or latency.

I read somewhere recently that the average US connection is up to around 3 Mbps. This is enough to stream H.264 video at 720p. OnLive's compression requires a little more juice for 720p, but most gaming households already have it or have access to 5 Mbps+ connections.

I've wrote more extensively about latency in previous posts, and its probably the harder challenge now, but its solve-able mainly on the provider side through efficient compression and fast servers. You only need to get a <150ms round trip to replicate the current home console experience.

Mandiri888 said...

good artikel dan very instrsting for this site.


Unknown said...

We are really grateful for your blog post. You will find a lot of approaches after visiting your post. I was exactly searching for. Thanks for such post .