Sunday, July 12, 2009

Understanding the Effeciency of Ray Traversal on GPUs

I just found this nice little paper by Timo Alia linked on Atom, Timothy Farrar's blog, who incidentally found and plugged my blog recently (how nice).

They have a great analysis of several variations of traversal methods using a standard BVH/triangle ray intersector code, along with simulator results for some potential new instructions that could enable dynamic scheduling. They find that in general, traversal effeciency is limited mainly by SIMD effeciency or branch divergence, not memory coherency - something I've discovered is also still quite true for voxel tracers.

They have a relatively simple scheme to pull blocks of threads from a global pool using atomic instructions. I had thought of this but believed that my 8800 GT didn't support the atomics, and I would have to wait until i upgraded to a GT200 type card. I was mistaken though and its the 8800GTX which is only cuda 1.0, my 8800GT is cuda 1.1 so should be good to go with atomics.

I have implemented a simple scheduling idea based on a deterministic up-front allocation of pixels to threads. Like in their paper, I allocate just enough threads/blocks to keep the cores occupied, and then divy up up the pixel-rays amongst the threads. But instead of doing this dynamically, I simply have each thread loop through pixel-rays according to a 2d tiling scheme. This got maybe a 25% improvement or so, but in their system they were seeing closer to a 90-100% improvement, so I could probably improve this further. However, they are scheduling entire blocks of (i think 32x3) pixel-rays at once, while I had each thread loop through pixel-rays independently. I thought having each thread immediately move on to a new pixel-ray would be better as it results in less empty SIMD lanes, but it also causes another point of divergence in the inner loop for the ray initialization step. Right now I handle that by amortizing it - simply doing 3 or so ray iterations and then an iffed ray init, but perhaps their block scheduling approach is even faster. Perhaps even worse, my scheme causes threads to 'jump around', scattering the thread to pixel mapping over time. I had hoped that the variable ray termination times would roughly amortize out, but maybe not. Allowing threads to jump around like that also starts to scatter the memory accesses more.

The particular performance problem which I see involves high glancing angle rays that skirt the edge of a long surface, such as a flat ground plane. For a small set of pixel-rays, there is a disproportionately huge number of voxel intersections, resulting in a few long problem rays that take forever and stall blocks. My thread looping plan gets around that, but at the expensive of decohering the rays in a warp. Ideally you'd want to adaptively reconfigure the thread-ray mapping to prevent low occupancy blocks from slowing you down while maintaining coherent warps. I'm suprised at how close they got to ideal simulated performance, and I hope to try their atomic-scheduling approach soon and compare.




10 comments:

TimothyFarrar said...

Jake, with edge on surfaces and things with incredible depth complexity (such as field of grass or grove of trees), have you thought about using a hybrid solution of raycasting into instanced objects in vector friendly tiles in combination with a tile cache reprojected via rasterization? Seems to me that some method which amortizes search cost taking advantage of temporal consistency will end up as the optimal solution.

Jake Cannell said...

Yes taking advantage of temporal coherence is definitely important to making this all really fast, but I'm not quite sure what you mean by a 'tile cache reprojected via rasterization'.

I've thought about reprojection a good deal, but haven't implemented it yet. I suspect that OTOY uses a reprojection system. Some of the screenshots exhibit a strange artifact which is best explained by a 2D mesh based reprojection system.

The edge on surface problem is really just the problem that rays have highly variable costs and the GPU scheduling is naive.


I'd like to avoid tracing into an instanced structure - I prefer keeping a unique cache. I think there is enough memory now to store all of the voxels you have time to visit in a frame. (I mean, at 1 byte per voxel, 1 Gig of memory and 1 million pixels gives 1,000 unique voxel hits per pixel - more than you need or can even hit)

TimothyFarrar said...

I just posted a more detailed description of the idea of a "tile cache reprojected via rasterization" on my blog.

BTW, have you thought about skipping storing the distance field for sphere step tracing into voxel bricks, and just doing a standard iso-surface raycast into a RGBA brick volume (stopping when accumulated alpha coverage reaches a set limit)?

Tomat said...

Do you meen "distance field for sphere step tracing into voxel bricks" - caclulating distance field for brick offline?

TimothyFarrar said...

"caclulating distance field for brick offline?"

Yes, the distance field would be calculated offline and stored along with the color data for the static geometry stored in the voxel bricks. It is similar to storing the cone size in relaxed cone step mapping.

Jake Cannell said...

Tim - yes actually I have done some experiments with distance fields. Its an interesting alternative to standard octree/brick casting ala GigaVoxels. It is a more derived structure, which has the disadvantage that it has some extra cost if you want to generate it dynamically. (which I think is important) I have also tried an implementation closer to the octree/brick tracing, and my current approach is faster, but I also don't think I have a very efficient octree/brick implementation yet. Time will tell. If it does end up being faster, then it might be worthy of a paper or article. Otherwise I'll just have to switch to regular octree/bricks ;)

Tomat said...

Will wait article on distance fields (dynamic?) calculation.

Timothy Farrar said...

BTW, I've found on of the OTOY artifact images again,

http://www.tgdaily.com/images/slideshows/200806271/ati_rt_06.jpg

It could very well be that they are building a mesh out of a prior frame-buffer. Prior adjacent pixels with large z distances might explain the sheets in that shot.

Jake Cannell said...

Yes, thats the shot. Its strange though - must be a bug, because it would be quite easy to detect the error pixels based on edge length and retrace.

In some interview he talks about the tech and mentions they just use use pixel shaders and the tessellator, which is odd if its a voxel tracing engine - but its a great fit for mesh reprojection - allowing you to easily adapt the triangle density to the screen edges to make it really fast.

Timothy Farrar said...

Interesting, I hadn't thought about tessellation for the reprojection mesh, sure does make a lot of sense though.

Followers