Thursday, August 13, 2009

Unique Voxel Storage

How much memory does a unique voxelization of a given scene cost? Considering anistropic filtering and translucency a pixel will be covered by more than one voxel in general. An upper bound is rather straightforwad to calculate. For a single viewport with a limited nearZ and farZ range, there are a finite number of pixel radius voxels extending out to fill the projection volume. The depth dimension of this volume is given by viewportDim * log2(farz/nearz). For a 1024x1024 viewport, a nearZ of 1 meter and a view distance of 16 kilometers, this works out to about log2(16000)*1024, or 14,000 voxels per pixel, or 14 billion voxels for the frustum's projection volume, and around ~100 billion voxels for the entire spherical viewing volume. This represents the maximum possible data size of any unique scene when sampled at proper pixel sampling rate with unlimited translucency and AA precision.

Now obviously, this is the theoretical worst case, which is interesting to know, but wouldn't come up in reality. A straightforward, tighter bound can be reached if we use discrete multi-sampling for the AA and anistropic filtering, which means that each sub-sample hits just one voxel, and we only need to store the visible (closest) voxels. In this case, considering occlusion, the voxel cost is dramatically lower, being just ScreenArea*AAFactor. For an average of 10 sub-samples and the same viewport setup as above, this is just around 100 million voxels for the entire viewing volume. Anistropic filtering quickly hits diminishing returns by around 16x maximum samples per pixel, and most pixels need much less, so a 10x average is quite reasonable.

For translucent voxels, a 10x coverage multiplier is quite generous, as the contribution of high frequencies decreases with decreasing opacity (which current game rasterizers exploit by rendering translucent particles at lower resolution). This would mean that voxels at around 10% opacity would get full pixel resolution, and voxels at about 1.5% or lower would get half-pixel resolution, roughly.

The octree subdivision can be guided with the z occlusion information. Ideally we would update a node's visibility during the ray traversal, but due to the scattered memory write ineffeciency it will probably be better to write out some form of z-buffer and then back-project the nodes to determine visibility.

A brute force multi-sampling approach sounds expensive, but would still be feasible on future hardware, as Nvidia's recent siggraph paper "Alternative Rendering Pipelines with Nvidia Cuda" demonstrates in the case of implementing a Reyes micropolygon rasterizer in Cuda. With enough multi-samples, you don't even need bilinear filtering - simple point sampling will suffice. But for voxel tracing, discrete multi-sampling isn't all that effecient compared to the more obvious and desireable path, which is simply to accumulate coverage/alpha directly while tracing. This is by far the fastest route to high quality AA & filtering. However it does pose a problem for the visibility determination mentioned above - without a discrete z-buffer, you don't have an obvious way of calculating voxel visibility for subdivision.

One approach would be to use an alpha-to-coverage scheme, which would still be faster than true multi-sampled tracing. This would require updating a number of AA z samples inside the tracing inner loop, which is still much more work then just alpha blending. A more interesting alternative is to store an explicit depth function. One scheme would be to store a series of depths representing equal alpha intervals. Or better yet, store arbitrary piecewise segments of the depth/opacity function. In the heirarchical tracing scheme, these could be written out and stored at a lower resolution mip level, such as the quarter res level, and then be used both to accelerate tracing for the finer levels and for determing octree node visibility. During the subdivision step, nodes would project to the screen and sample their visibility from the appropriate depth interval from this structure.

I think the impact of anisotropy and translucency can be limited or capped just as in the discrete z-buffer case by appropriate node reweighting based on occlusion or opacity contribution. A node which finds that it is only 25% visible would only get slightly penalized, but a 5% visibile node more heavily so, effectively emulating a maximum effective voxel/pixel limit, after which resolution is lost. (which is fine, as the less a node contributes, the less important the loss of its high frequency content). Or more precisely, node scores would decrease in proportion to their screen coverage when it falled below the threshold 1/AA, where AA is the super-sampling limit you want to emulate.

Friday, August 7, 2009

Hierarchical Cone Tracing (half baked rough sketch)

High quality ray tracing involves the computation of huge numbers of ray-scene intersections. As most scenes are not random, the set of rays traced for a particular frame are highly structured and spatially correlated. Just as real world images feature significant local spatial coherence which can be exploited by image compression, real world scenes feature high spatial coherence which ray tracers can exploit. Current real time ray tracers exploit spatial coherence at the algorithm level through hierachical packet/cone/fustrum tracing, and at the hardware level through wide SIMD lanes, wide memory packet transactions, and so on. Coherence is rather easy to maintain for primary rays and shadow rays, but becomes increasingly difficult with specular reflection and refraction rays in the presence of high frequency surface normals, or the wide dispersion patterns of diffuse tracing. However, taking inspiration from both image compression and collision detection, it should be possible to both significantly reduce the number of traces per pixel and trace all rays (or cones) in a structured, coherent and even heirarchical manner.

A set of rays with similar origins and directions can be conservatively bounded or approximated by a frustum or cone. Using these higher order shapes as direct intersection primitives can replace many individual ray traces, but usually at the expense of much more complex intersections for traditional primitives such as triangles. However, alternative primitives such as voxels (or distance fields) permit cheap sphere intersections, and thus fast approximate cone tracing through spherical stepping. Cones have the further advantage that they permit fairly quick bounding and clustering operations.

Building on cones as the base primitive, we can further improve on ray tracing in several directions. First we can treat the entire set of cone traces for a frame as a large collision detection problem, intersecting a set of cones with the scene. Instead of intersecting each cone individually, HCT builds up a cone heirachy on the fly and uses this to quickly exclude large sections of potential sphere-scene intersections. The second area of improvement is to use clustering approximations at the finer levels to reduce the total set of cones, replacing clusters of similar cones with approximations. Finally, the heiarchy exposed can be navigated in a coherent fashion which is well suited to modern GPUs.

In short sketch, hierachical cone tracing amounts to taking a cluster of cone segments (which in turn are clusters of points+rays), and then building up some sort of hierachical cone tree organization, and then using this tree to more effeciently trace and find intersections. As tracing a cone involves testing a set of spheres along the way, testing enclosing spheres up the hierachy can be used to skip steps lower down in the hierachy. However, instead of building the hierarchy from bottom up (which would require the solution to already be known), the cone tree is built from the top down, using adaptive subdivision. Starting with a single cone (or small set of cones) from the camera, it calculates an intersection slice (or ranges of slices) with the scene. These intersection spheres are tested for bounds on the normals, which allow computation of secondary cones, with origins bounding the interesection volumes and angular widths sufficient to bound the secondary illumination of interest. This space forms a 3D (2 angular + depth) dependency tree structure, which can then be adaptively subdivided, eventually down to the level of near pixel width primary cones which typically intersect a small slice of the scene and have similar normals (small radius normal bounding cone). Refraction secondary cones have a similar width to the incoming cone's width. Specular secondary cones have a width ranging from the incoming width to much wider, depending on the glossiness term. Diffuse bounce secondary cones are essentially equivalent to the widest specular, expanding the cone to maximum width. Subdivision can proceed in several ways at each step, either in primary cone direction (2D), intersection depth, or secondary cone direction (2D) or intersection depth. (considering more bounces in one step would add additional dimensions) The subdivision is error guided, terminating roughly when a low error approximation is reached. This framework is complete, and can simultaneously handle all illumination effects.

First, consider just the simple case of primary rays. Here, hierachical cone tracing is pretty straightforward. You trace the image pyramid from coarse to finest. Each mip trace finds the first intersection or conservative near-z for that mip level, and the next mip level trace uses this coarse near-z to start tracing, instead of starting at the camera. Just even this basic idea can result in more than a 2x speedup vs regular tracing. Further speedup is possible if the lower res mip traces farther in, possibly outputting a set of segments instead of just the 1st intersection. This can be done rather easily with alpha tracing in a voxel system. Considering a range of segments, and not just the 1st intersection amounts to treating it as a 3D subdivision problem instead of 2D. The global bounding cone is approximated by a few dozen spheres, each of which subdivide into about 8 child spheres, and so on. Testing a sphere high up in the heirachy can exclude that entire subtree.

Next lets consider a directional or point light. Starting with a very large cone for the entire scene, we get a set of intersection slices along the entire frustum each of which has a range of normals. Since these regions are so huge, the normals will probably cover the full sphere, so the cones for any secondary effects will basically be larger spheres emenating out across the entire world. Not suprisingly, the orgin of the cone hierachy doesn't tell you much other than you have a camera frustum, it hits a bunch of stuff, and the secondary illumination could come from anywhere in the scene. Subdivision can proceed in several different ways. You could next subdivide the screen, or you could even subdivide in 3D (screen + depth), splitting into 8 child traces, but the farther depth slices are provisional (they may not accumulate anything). However, another dimension of subdivision, which makes sense here, is to subdivide the 2D direction of secondary (illumination) cones. This should be error guided, subdividing cones that contain high frequency illumination (stored in the octree) and contribute the most error. A bright point light will be found this way, or a direct light will just show up as a huge illumination spike in one cone direction. Subdivision of direction will then find the incoming light direction logarithimically. If there is no indirect illumination (all zeroed out in the octree), then the error guidance will expand in the direction dimension until it finds the primary light direction, and ignore the other directions.

How would the error guidance work more specifically? Its goal would seek to minimize the final tree size (and thus total computation). It would do this in a greedy, approximate fashion using the local information available at each step. When subdividing a cone, there are several splitting options to choose from, and one dimension (depth, ie the cone's length) is rather special as it involves a temporal dependency. The closer cone section potentially masks the farther section, so if the near section is completely opaque (blocks all light), the far section can be cut. Tracing a cone as a sphere approximation amounts to fully subdividing along just the depth dimension, and reveals the full alpha interval function along that depth (which could be saved as just a min and max, or in a more explicit format). Subdividing a cone then along either the spatial dimension or angular dimension would depend on the outcome of the trace, illumination info found, and the current angle relative to the spherical origin bound.

Careful error guidance will result in an effecient traversal for the directional light case. Once the secondary cone direction is sufficiently subdivided, finding that a secondary cone trace towards the light direction results in a 0 (fully shadowed) will automatically stop further subdivision of that entire subtree. Likewise, subdividing in secondary cone depth will reveal entire cone subsections that are empty, and do not need to be traced. The fully lit regions will then end up exploring increasingly short cone traces near the surface. Full cone traces down to pixel level are only required near shadow edges, and only when the shadow is near pixel resolution, as the error guidance can terminate softer shadows earlier. Secondary bounce diffuse illumination is similar, but the energy is smeared out (lower frequency), so it explores a broader subtree in cone direction, but can usually terminate much earlier in the spatial dimension. It again ends up terminating with just a few shallow traces at the finest resolution, representing local searches. Specular and reflection traces are handled in a similar fashion, and really aren't that different.

Surface Clustering

Further improvement comes from clustering approximation. The coarse level tests can use bounds on the normals and depths in an intersection region to approximate the intersection and terminate early (for example, finding that the intersection for a 4x4 pixel block is a smooth, nearly coplanar surface section allows early termination and computation of intersection through simple interpolation for the 2 finer levels). This is related to the concept of shading in a compressed space. Consider a simple block based compression scheme, such as DXTC, which essentially uses a PCA approach to compress a set of values (a 4x4 block) by a common 1D line segment through the space (low frequency component) combined with a per pixel distance along the line (high frequency and lower accuracy). The scheme compresses smooth regions of the image with little error, and the error in noisy regions of the image is masked by the noise.

Now lets first apply this compression scheme in the context of a traditional shading. In fact, it is directly applicable in current raster engines for complex but lower frequency shading effects, like screen space AO or GI. Downsampling the depth buffer to compute AO on less samples, and then upsampling with a bilateral filter can be considered a form of compression that exploits the lower frequency dominant AO. A related scheme, closer to DXTC, is to perform a min/max depth downsampling, evaluate the AO on the min&max samples per block, and then use these to upsample - with a dual bilateral or even without. The min/max scheme better represents noisy depth distributions and works much better in those more complex cases. (although similar results could be obtained with only storing 1 z-sample per block and stipple-alternating min and max)

So the generalized form of this spatial clustering reduction is to 1. reduce the set of samples into a compressed, smaller examplar set, often reducing the number of dimensions, 2. run your per-sample algorithm on the reduced exemplar set of samples, and then 3. use a bilateral-type filter to interpolate the results from the exemplars back onto the originals. If the exemplars are chosen correctly (and are sufficient) to capture the structure and frequency of the function on the sample set, there will be little error.

As another example, lets consider somewhat blurry, lower frequency reflections on a specular surface. After primary ray hit points are generated for a block of pixels (or rasterized), you have a normal buffer. Then you run a block compression on this buffer to find a best fit min and max normal for the block, and the per pixel interpolators. Using this block information, you can compute specular illumination on the 2 per block normal directions, and then interpolate the results for all of the pixels in the block. In regions where the normal is smooth, such as a smooth surface like the hood of a car, the reduced block normals are a very close approximation. In regions with high frequency noise in the normals, the noise breaks up any coherent pattern in the specular reflection and hides any error due to the block compression. Of course, on depth edges there is additional error due to the depth/position. To handle this, we need to extend the idea to also block compress the depth buffer, resulting in a multi-dimensional clustering, based on 2 dimensions along a sphere for the normal, and one dimension of depth. This could still be approximated by two points along a line, but there are other possibilities, such as a 3 points (forming a triangle), or even storing 2 depth clusters with 2 normals each (4 points in the 3D space). It would be most effecient to make the algorithm adaptive, using perhaps 1-3 candidate examplars for a block. The later bilateral filtering would pick the appropriate neighbor candidates and weights for each full res sample.

Integrating this concept into the hierachical cone tracer amounts to adding a new potential step when considering expansion sub tasks. As I described earlier about primary rays, you can stop subdivision early if you know that the hit intersection is simple and reducable to a plane. Generalizing that idea, the finer levels of the tree expansion can perform an approximation substitution instead of fully expanding and evaulating in each dimension, terminating early. A plane fit is one simple example, with an examplar set being the more general case. The 5D cone subtree at a particular lower level node (like 4x4), which may have many dozens of fully expanded children can be substituted with a lower dimensional approximation and a few candidate samples. This opens up a whole family of algorithms which adaptively compress and reduce the data and computation space. Triangle like planar primitives can be considered a spot optimization that may be cheaper than subidiving and tracing additional spheres.

Its certainly complex, or potentially complex, but I think it represents a sketch of the direction of what an optimal or near optimal tracer would look like. Just as in image compression, increasing code complexity hits some asymptotic wall at some point. I've only considered a single frame here, going further would require integrating these ideas with temporal coherence. Temporal coherence is something I've discussed earlier, although there's many options to how to approach that in a heirarchical cone tracer.

What are the potential advantages? Perhaps an order of magnitude speedup or more if you really add alot of the special case optimizations. Of course, this speedup is only at the algorithmic level, and really depends very much on implementation details, which are complex. I think the queue task generation models now being explored on GPUs are the way to go here, and could implement an adaptive tree subdivision system like this effeciently. Coherence can be maintained by organizing expansions in spatially related clusters, and enforcing this through clustering.

But the real advantage would be combining all lighting effects into a single unified framework. Everything would cast, receive, bounce and bend light, with no special cases. Lights would just be objects in the scene, and everything in the voxelized scene would store illumination info. A real system would have to cache this and what not, but it can be considered an intermediate computation in this framework, one of many things you could potentially cache.

I've only explored the baby steps of this direction so far, but its interesting to ponder. Of course, the streaming and compression issues in a voxel system are far higher priority. Complex illumination can come later.

Sunday, August 2, 2009

More on grid computing costs

I did a little searching recently to see how my conjectured cost estimates for cloud gaming compared to the current market for grid computing. The prices quoted for server rentals vary tremendously, but I found this NewServers 'Bare Metal Cloud' service as an interesting example of raw compute server rental by the hour or month (same rate, apparently no bulk discount).

Their 'Jumbo' option for 38 cents per hour is within my previous estimate of 25-50 cents per hour. It provides dual quad cores and 8GB of RAM. It doesn't have a GPU of course, but instead has two large drives. You could substitute those drives for a GPU and keep the cost roughly the same (using a shared network drive for every 32 or 64 servers or whatever - which they also offer). Nobody needs GPU's in server rooms right now, which is the biggest difference between a game service and anything else you'd run in the cloud, but I expect that to change in the years ahead with Larrabbee and upcoming more general GPUs. (and coming from the other angle, CPU rendering is becoming increasingly viable) These will continue to penetrate into the grid space, driven by video encoding, film rendering, and yes, cloud gaming.

What about bandwidth?
Each server includes 3 GB of Pure Internap bandwidth per hour

So adequate bandwidth for live video streaming is already included. Whats missing, besides the GPU? Fast, low latency video compression, of course. Its interesting that x264, the open source encoder, can do realtime software encoding using 4 intel cores (and its certainly not the fastest out there). So if you had a low latency H.264 encoder, you could just use 4 of the cpus for encoding and 4 to run the game. Low latency H.264 encoders do exist of course, and I suspect that is the route Dave Perry's Gaikai is taking.

Of course, in the near-term, datacenters for cloud gaming will be custom built, such as what OnLive and OToy are attempting. Speaking of which, the other interesting trend is the adoption of GPU's for feature film use, as used recently in the latest Harry Potter film. OToy is banking on this trend, as their AMD powered datacenters will provide computation for both film and games. This makes all kinds of sense, because the film rendering jobs can often run at night and use otherwise idle capacity. From an economic perspective, film render farms are already well established, and charge significantly more per server hour - usually measured per Ghz-hour. Typical prices are around 12-6 cents per Ghz in bulk, which would be around a dollar or two per hour for the server example given above. I imagine that this is mainly due to the software expense, which for a render server could add up to be many times the hardware cost.

So, here are the key trends:
- GPU/CPU convergence, leading to a common general server platform that can handle film/game rendering, video compression, or anything really
- next gen game rendering moving into ray tracing and the high end approaches of film
- bulk bandwidth already fairly inexpensive for 720p streaming, and falling 30-40% per year
- steadily improving video compression tech, with H.265 on the horizon, targeting a further 50% improvement in bitrate

Will film and game rendering systems eventually unify? I think this is the route we are heading. Both industries want to simulate large virtual worlds from numerous camera angles. The difference is that games are interesting in live simulation and simultaneous broadcast of many viewpoints, while films aim to produce a single very high quality 2 hour viewpoint. However, live simulation and numerous camera angles are also required during a film's production, as large teams of artists each work on small pieces of the eventual film (many of which are later cut), and need to be able to quickly preview (even at reduced detail). So the rendering needs of a film production are similar to that of a live game service.

Could we eventually see unified art pipelines and render packages between games and films? Perhaps. (indeed, the art tools are largelly unified already, except world editing is usually handled by propriatary game tools) The current software model for high end rendering packages is not well suited to cloud computing, but the software as a service model would make alot of sense. As a gamer logs in (through a laptop, cable box, microconsole, whatever) and starts a game, that would connect to a service provider to find a host server nearby, possibly installing the rendering software as needed and streaming the data (cached at each datacenter, of course). The hardware and the software could both be rented on demand. Eventually you could even create games without licensing an engine in the traditional sense, but simply by using completely off the shelf software.

Saturday, August 1, 2009

Some thoughts on metaprogramming, reflection, and templates

The thought struck me recently that C++ templates really are a downright awful metaprogramming system. Don't get me wrong, they are very powerful and I definitely use them, but recently I've realized that whatever power they have is soley due to enabling metaprogramming, and there are numerous other ways of approaching metaprogramming that actually make sense and are more powerful. We use templates in C++ because thats all we have, but they are an ugly, ugly feature of the language. It would be much better to combine full reflection (like Java or C#) with the capability to invoke reflective code at compile time to get all the performance benefits of C++. Templates do allow you to invoke code at compile time, but through a horribly obfuscated functional style that is completely out of synch with the imperative style of C++. I can see how templates probably evolved into such a mess, starting as a simple extension of the language that allowed a programmer to bind a whole set of function instantiations at compile time, and then someone realizing that its turing complete, and finally resulting in a metaprogramming abomination that never should have been.

Look at some typical simple real world metaprogramming cases. For example, take a generic container, like std::vector, where you want to have a type-specialized function such as a copy routine that uses copy constructors for complex types, but uses an optimized memcpy routine for types where that is equivalent to invoking the copy constructor. For simple types, this is quite easy to do with C++ templates. But using it with more complex user defined structs requires a type function such as IsMemCopyable which can determine if the copy constructor is equivalent to a memcpy. Abstractly, this is simple: the type is mem-copyable if it has a default copy constructor and all of its members are mem-copyable. However, its anything but simple to implement with templates, requiring all kinds of ugly functional code.

Now keep in my mind I havent used Java in many years, and then only briefly, I'm not familar with its reflection, and I know almost nothing of C#, although I understand both have reflection. In my ideal C++ with reflection language, you could do this very simply and naturally with an imperative meta-function with reflection, instead of templates (maybe this is like C#, but i digress):

struct vector {
generic* start, end;
generic* begin() {return start;}
generic* end() {return end;}
int size() {return end-start;}

type vector(type datatype) {start::type = end::type = datatype*;}

void SmartCopy(vector& output, vector& input)
if ( IsMemCopyable( typeof( *input.begin() ) ) {
memcpy(output.begin(), input.begin(), input.size());
else {
for_each(output, input) {output[i] = input[i];}

bool IsMemCopyable(type dtype) {
bool copyable (dtype.CopyConstructor == memcpy );
for_each(type.members) {
copyable &= IsMemCopyable(type.members[i]);
return copyable;

The idea is that using reflection, you can unify compile time and run-time metaprogramming into the same framework, with compile time metaprogramming just becoming an important optimization. In my pseudo-C++ syntax, the reflection is accesable through type variables, which actually represent types themselves: pods, structs, classes. Generic types are specified with the 'generic' keyword, instead of templates. Classes can be constructed simply through functions, and I added a new type of constructor, a class constructor which returns a type. This allows full metaprogramming, but all your metafunctions are still written in the same imperative language. Most importantly, the meta functions are accessible at runtime, but can be evaluated at compile time as well, as an optimization. For example, to construct a vector instantiation, you would do so explicitly, by invoking a function:

vector(float) myfloats;

Here vector(float) actually calls a function which returns a type, which is more natural than templates. This type constructor for vector assigns the actual types of the two data pointers, and is the largest deviation from C++:

type vector(type datatype) {start::type = end::type = datatype*;}

Everything has a ::type, which can be set and manipulated just like any other data. Also, anything can be made a pointer or reference by adding the appropriate * or &.

if ( IsMemCopyable(typeof( *input.begin() ) ) {

There the * is used to get past the pointer returned by begin() to the underlying data.

When the compiler sees a static instantiation, such as:
vector(float) myfloats;

It knows that the type generated by vector's type constructor is static and it can optimize the whole thing, compiling a particular instantiation of vector, just as in C++ templates. However, you could also do:

type dynamictype = figure_out_a_type();
vector(dynamictype) mystuff;

Where dynamictype is a type not known at compile time and could be determined by other functions, loaded from disk, or whatever. Its interesting to note that in this particular example, the unspecialized version is not all that much slower as the branch in the copy function is invoked only once per copy, not once per copy constructor.

My little example is somewhat contrived and admittedly simple, but the power of reflective metaprogramming can make formly complex big systems tasks mucher simpler. Take for example the construction of a game's world editor.

The world editor of a modern game engine is a very complex beast, but at its heart is a simple concept: it exposes a user interface to all of the game's data, as well as tools to manipulate and process that data, which crunch it into an optimized form that must be streamed from disk into the game's memory and thereafter parsed, decompressed, or what have you. Reflection allows the automated generation of GUI components from your code itself. Consider a simple example where you want to add dynamic light volumes to an engine. You may have something like this:

struct ConeLight {
HDRcolorRGB intensity_;
BoundedFloat(0,180) angleWidth_;
WorldPosition pos_;
Direction dir_;
TextureRef cookie_;
static HelpComment description_ = "A cone-shaped light with a projected texture."

The editor could then automatically connect a GUI for creating and manipulating ConeLights just based on analysis of the type. The presence of a WorldPosition member would allow it to be placed in the world, the Direction member would allow a rotational control, and the intensity would use an HDR color picker control. The BoundedFloat is actually a type constructor function, which sets custom min and max static members. The cookie_ member (a projected texture) would automatically have a texture locator control and would know about asset dependencies, and so on. Furthermore, custom annotations are possible through the static members. Complex data processing, compression, disk packing and storage, and so on could happen automatically, without having to write any custom code for each data type.

This isn't revolutionary, in fact our game editor and generic database system are based on similar principles. The difference is they are built on a complex, custom infrastructure that has to parse specially formatted C++ and lua code to generate everything. I imagine most big game editors have some similar custom reflection system. Its just a shame though, because it would be so much easier and more powerful if built into the language.

Just to show how powerful metaprogramming could be, lets go a step farther and tackle the potentially hairy task of a graphics pipeline, from art assets down to the GPU command buffer. For our purposes, art packages expose several special asset structures, namely geometry, textures, and shaders. Materials, segments, meshes and all the like are just custom structures built out of these core concepts. On the other hand, a GPU command buffer is typically built out of fundemental render calls which look something like this (again somewhat simplified):

error GPUDrawPrimitive(VertexShader* vshader, PixelShader* pshader, Primitive* prim, vector samplers, vector vconstants, vector pconstants);

Lets start with a simpler example, that of a 2D screenpass effect (which, these days, encompasses alot).

Since this hypothetical reflexive C language could also feature JIT compilation, it could function as our scripting language as well, the effect could be coded completely in the editor or art package if desired.

struct RainEffect : public FullScreenEffect {

function(RainPShader) pshader;

float4 RainPShader(RenderContext rcontext, Sampler(wrap) fallingRain, Sampler(wrap) surfaceRain, AnimFloat density, AnimFloat speed)
// ... do pixel shader stuf

// where the RenderContext is the typical global collection of stuff
struct RenderContext {
Sampler(clamp) Zbuffer;
Sampler(clamp) HDRframebuffer;
float curtime;
// etc ....

The 'function' keyword specifies a function object, much like a type object with the parameters as members. The function is statically bound to RainPshader in this example. The GUI can display the appropriate interface for this shader and it can be controlled from the editor by inspecting the parameters, including those of the function object. The base class FullScreenEffect has the quad geometry and the other glue stuff. The pixel shader itself would be written in this reflexive C language, with a straightforward metaprogram to actually convert that into HLSL/cg and compile as needed for the platform.

Now here is the interesting part: all the code required to actual render this effect on the GPU can be generated automatically from the parameter type information emedded in the RainPShader function object. The generation of the appropriate GPUDrawPrimitive function instance is thus just another metaprogram task, which uses reflection to pack all the samplers into the appropriate state, set the textures, pack all the float4s and floats into registers, and so on. For a screen effect, invoking this translator function automatically wouldn't be too much of a performance hit, but for lower level draw calls you'd want to instantiate (optimize) it offline for the particular platform.

I use that example because I actually created a very similar automatic draw call generator for 2D screen effects, but all done through templates. It ended up looking more like how cuda is implemented, and also allowed compilation of the code as HLSL or C++ for debugging. It was doable, but involved alot of ugly templates and macros. I built that system to simplify procedural surface operators for geometry image terrain.

But anyway, you get the idea now, and going from a screen effect you could then tackle 3D geometry and make a completely generic, data driven art pipeline, all based on reflective functions that parse data and translate or reorganize it. Some art pipelines are actually built on this principle already, but oh my wouldn't it be easier in a more advanced, reflective language.