Tuesday, July 28, 2009

Winning my own little battle against Cuda

I've won a nice little battle in my personal struggle with Cuda recently by hacking in dynamic texture memory updates.

Unfortunately for me and my little Cuda graphics prototypes, Cuda was designed for general (non graphics ) computation. Texture memory access was put in, but as something of an afterthought. Textures in Cuda are read only - kernels can not write to them (ok yes they added limited support for 2D texturing from pitch linear memory, but thats useless), which is pretty much just straight up retarded. Why? Because Cuda allows general memory writes! There is nothing sacred about texture memory, and the hardware certainly supports writing to it in a shader/kernel, and has since render-to-texture appeared in what? DirectX7? But not in Cuda.

Cuda's conception of writing to texture memory involves writing to regular memory and then calling a Cuda function to perform a sanctioned copy from the temporary buffer into the texture. Thats ok for alot of little demos, but not for large scale dynamic volumes. For an octree/voxel tracing app like mine, you basically fill up your GPU's memory with a huge volume texture and accessory octree data which is broken up into chunks which can be fully managed by the GPU. You need to then be able to modify these chunks as the view changes or animation changes sections of the volume. Cuda would have you do this by double buffering the entire thing with a big scratch buffer and doing a copy from the linear scratchpad to the cache tiled 3D texture every frame. That copy function, by the way, acheives a whopping 3 Gb/s on my 8800. Useless.

So, after considering various radical alternatives that would avoid doing what I really want to do (which is use native 3D trilinear filtering in a volume texture I can write to anywhere), I realized I should just wait and port to DX11 compute shaders which will hopefully allow me to do the right thing. (and should also allow access to DXTC volumes, which will probalby be important).

In the meantime, I decided to hack my way around the cuda API and write to my volume texture anyway. This isn't as bad as it sounds, because the GPU doesn't have any fancy write-protection page faults, so a custom kernel can write anywhere in memory. But you have to know what you're doing. The task was thus to figure out where exactly it would allocate my volume in GPU memory, and exactly how the GPU's tiled addressing scheme works.

I did this with a brute force search. The drivers do extensive bounds checking even in release and explode when you attempt to circumvent them, so I wrote a memory-laundring routine to shuffle illegitimate arbitary GPU memory into legitimate allocations the driver would accept. Then I used this to snoop the GPU memory, which allowed a brute force search, using cuda's routines to copy from cpu linear memory into the tiled volume texture, then snooping the GPU memory to find out exactly where my magic byte ended up, revealing one mapping of a XYZ coordinate and linear address to a tiled address on the GPU (or really, the inverse).

For the strangely curious, here is my currently crappy function for the inverse mapping (from GPU texture address to 3D position):

inline int3 GuessInvTile(uint outaddr)
int3 outpos = int3_(0, 0, 0);
outpos.x |= ((outaddr>>0) & (16-1)) << 0;
outpos.y |= ((outaddr>>4) & 1) << 0;
outpos.x |= ((outaddr>>5) & 1) << 4;
outpos.y |= ((outaddr>>6) & 1) << 1;
outpos.x |= ((outaddr>>7) & 1) << 5;
outpos.y |= ((outaddr>>8) & 1) << 2;
outpos.y |= ((outaddr>>9) & 1) << 3;
outpos.y |= ((outaddr>>10) & 1) << 4;
outpos.z |= ((outaddr>>11) & 1) << 0;
outpos.z |= ((outaddr>>12) & 1) << 1;
outpos.z |= ((outaddr>>13) & 1) << 2;
outpos.z |= ((outaddr>>14) & 1) << 3;
outpos.z |= ((outaddr>>15) & 1) << 4;

outpos.x |= ((outaddr>>16) & 1) << 6;
outpos.x |= ((outaddr>>17) & 1) << 7;

outpos.y |= ((outaddr>>18) & 1) << 5;
outpos.y |= ((outaddr>>19) & 1) << 6;
outpos.y |= ((outaddr>>20) & 1) << 7;

outpos.z |= ((outaddr>>21) & 1) << 5;
outpos.z |= ((outaddr>>22) & 1) << 6;
outpos.z |= ((outaddr>>23) & 1) << 7;

return outpos;

I'm sure the parts after the 15th output address bit are specific to the volume size and thus are wrong as stated (should be done with division). So really it does a custom swizzle within a 64x32x32 chunk, and then fills the volumes with these chunks in a plain X, Y, Z linear fill. Its curious that it tiles 16 over in X first, and then fills in a 64x32 2D tile before even starting on the Z. This means writing in spans of 16 aligned to the X direction is most effecient for scatter, which is actually kind of annoying, a z-curve tiling would be more convenient. The X-Y alignment is also a little strange, it means that general 3D fetches are memory equivalent to 2 2D fetches in terms of bandwidth and cache.


Cyril Crassin said...

Hi Jake, congratulations for this very nice investigation ! I am also very interested in being able to write into texture memory from a CUDA kernel (in fact exactly for the same application than yours, for my GigaVoxels stuff). Have you been able to find a predictable way of knowing the starting addresses of allocated 3D arrays ?
In the same way, it should be possible to write directly into OpenGL textures that would also be very interesting...

Jake Cannell said...

Hey Cyril, good work with the gigavoxels research, very interesting stuff! I've been working on similar research in my limited free time. I wish I could have seen your siggraph presentation.

But anyway, getting the starting address is actually the easier part. The cuda driver seems to have a simple linear memory mapping, and it allocates cudaArrays (textures) in the same space as regular large cudaAllocs. So basically, what I do is allocate a regular device memory chunk of the same size as the array (properly padded and what not), record the pointer where cuda allocates it, and then free the memory and allocate the cudaArray. I have some verification functions that can then copy into it, read back the data, and verify its where its supposed to be.

There's some caveats: textures are always allocated with large pages and thus come from the large page heap. There is another memory area for smaller allocations, so this trick probably doesn't work for smaller textures. Also, you have to make sure to pad your texture to the chunk size, (at least in the X & Y dimensions). I could post some more example code if you're interested.

One thing I'd really like to do is be able to use DXTC textures, because the memory cost is already a problem for voxel bricks if you want to store alpha/coverage, albedo, normals, specular, cached illumination, etc.

Sadly DXTC isn't supported in Cuda and I don't know if Nvidia will ever add it to Cuda, but maybe OpenCL or DirectX Compute will allow DXTC textures.

Cyril Crassin said...

Hey, thanks for GigaVoxels, too bads you missed the talk, but you can find the slides there: Slides.
Thanks for the trick for the starting address. I am also definitely interested by some more example code !
As you are, I would also be very interested to be able to access DXTC textures in CUDA. It's a request I already made to NVIDIA. As you mentioned it must be only a software problem.