They have a great analysis of several variations of traversal methods using a standard BVH/triangle ray intersector code, along with simulator results for some potential new instructions that could enable dynamic scheduling. They find that in general, traversal effeciency is limited mainly by SIMD effeciency or branch divergence, not memory coherency - something I've discovered is also still quite true for voxel tracers.
They have a relatively simple scheme to pull blocks of threads from a global pool using atomic instructions. I had thought of this but believed that my 8800 GT didn't support the atomics, and I would have to wait until i upgraded to a GT200 type card. I was mistaken though and its the 8800GTX which is only cuda 1.0, my 8800GT is cuda 1.1 so should be good to go with atomics.
I have implemented a simple scheduling idea based on a deterministic up-front allocation of pixels to threads. Like in their paper, I allocate just enough threads/blocks to keep the cores occupied, and then divy up up the pixel-rays amongst the threads. But instead of doing this dynamically, I simply have each thread loop through pixel-rays according to a 2d tiling scheme. This got maybe a 25% improvement or so, but in their system they were seeing closer to a 90-100% improvement, so I could probably improve this further. However, they are scheduling entire blocks of (i think 32x3) pixel-rays at once, while I had each thread loop through pixel-rays independently. I thought having each thread immediately move on to a new pixel-ray would be better as it results in less empty SIMD lanes, but it also causes another point of divergence in the inner loop for the ray initialization step. Right now I handle that by amortizing it - simply doing 3 or so ray iterations and then an iffed ray init, but perhaps their block scheduling approach is even faster. Perhaps even worse, my scheme causes threads to 'jump around', scattering the thread to pixel mapping over time. I had hoped that the variable ray termination times would roughly amortize out, but maybe not. Allowing threads to jump around like that also starts to scatter the memory accesses more.
The particular performance problem which I see involves high glancing angle rays that skirt the edge of a long surface, such as a flat ground plane. For a small set of pixel-rays, there is a disproportionately huge number of voxel intersections, resulting in a few long problem rays that take forever and stall blocks. My thread looping plan gets around that, but at the expensive of decohering the rays in a warp. Ideally you'd want to adaptively reconfigure the thread-ray mapping to prevent low occupancy blocks from slowing you down while maintaining coherent warps. I'm suprised at how close they got to ideal simulated performance, and I hope to try their atomic-scheduling approach soon and compare.