How much memory does a unique voxelization of a given scene cost? Considering anistropic filtering and translucency a pixel will be covered by more than one voxel in general. An upper bound is rather straightforwad to calculate. For a single viewport with a limited nearZ and farZ range, there are a finite number of pixel radius voxels extending out to fill the projection volume. The depth dimension of this volume is given by viewportDim * log2(farz/nearz). For a 1024x1024 viewport, a nearZ of 1 meter and a view distance of 16 kilometers, this works out to about log2(16000)*1024, or 14,000 voxels per pixel, or 14 billion voxels for the frustum's projection volume, and around ~100 billion voxels for the entire spherical viewing volume. This represents the maximum possible data size of any unique scene when sampled at proper pixel sampling rate with unlimited translucency and AA precision.
Now obviously, this is the theoretical worst case, which is interesting to know, but wouldn't come up in reality. A straightforward, tighter bound can be reached if we use discrete multi-sampling for the AA and anistropic filtering, which means that each sub-sample hits just one voxel, and we only need to store the visible (closest) voxels. In this case, considering occlusion, the voxel cost is dramatically lower, being just ScreenArea*AAFactor. For an average of 10 sub-samples and the same viewport setup as above, this is just around 100 million voxels for the entire viewing volume. Anistropic filtering quickly hits diminishing returns by around 16x maximum samples per pixel, and most pixels need much less, so a 10x average is quite reasonable.
For translucent voxels, a 10x coverage multiplier is quite generous, as the contribution of high frequencies decreases with decreasing opacity (which current game rasterizers exploit by rendering translucent particles at lower resolution). This would mean that voxels at around 10% opacity would get full pixel resolution, and voxels at about 1.5% or lower would get half-pixel resolution, roughly.
The octree subdivision can be guided with the z occlusion information. Ideally we would update a node's visibility during the ray traversal, but due to the scattered memory write ineffeciency it will probably be better to write out some form of z-buffer and then back-project the nodes to determine visibility.
A brute force multi-sampling approach sounds expensive, but would still be feasible on future hardware, as Nvidia's recent siggraph paper "Alternative Rendering Pipelines with Nvidia Cuda" demonstrates in the case of implementing a Reyes micropolygon rasterizer in Cuda. With enough multi-samples, you don't even need bilinear filtering - simple point sampling will suffice. But for voxel tracing, discrete multi-sampling isn't all that effecient compared to the more obvious and desireable path, which is simply to accumulate coverage/alpha directly while tracing. This is by far the fastest route to high quality AA & filtering. However it does pose a problem for the visibility determination mentioned above - without a discrete z-buffer, you don't have an obvious way of calculating voxel visibility for subdivision.
One approach would be to use an alpha-to-coverage scheme, which would still be faster than true multi-sampled tracing. This would require updating a number of AA z samples inside the tracing inner loop, which is still much more work then just alpha blending. A more interesting alternative is to store an explicit depth function. One scheme would be to store a series of depths representing equal alpha intervals. Or better yet, store arbitrary piecewise segments of the depth/opacity function. In the heirarchical tracing scheme, these could be written out and stored at a lower resolution mip level, such as the quarter res level, and then be used both to accelerate tracing for the finer levels and for determing octree node visibility. During the subdivision step, nodes would project to the screen and sample their visibility from the appropriate depth interval from this structure.
I think the impact of anisotropy and translucency can be limited or capped just as in the discrete z-buffer case by appropriate node reweighting based on occlusion or opacity contribution. A node which finds that it is only 25% visible would only get slightly penalized, but a 5% visibile node more heavily so, effectively emulating a maximum effective voxel/pixel limit, after which resolution is lost. (which is fine, as the less a node contributes, the less important the loss of its high frequency content). Or more precisely, node scores would decrease in proportion to their screen coverage when it falled below the threshold 1/AA, where AA is the super-sampling limit you want to emulate.