Edit: current, very tentative profile data for my demo is as follows:
- 1.9 sec per 16 million points voxelized at octree max depth 11. (4 or 6 CPU threads, I forget which).
- 480p software raycast of entire depth-10 octree in 8ms (6 CPU threads).
Hey there ye people,
I’ve been making an octree demo. It’s not out yet, but it will be. I’ve studied octrees far too extensively for my own good, so allow me to share with you some of the knowledge that has allowed me to optimize this wonderful data structure.
for the subject of this rant. Namely, why I think they might (MIGHT!) be marginally better than KD-trees, at least for sparse voxels (aka brick maps, which are very old in the film industry, but mostly used for caching). They might (MIGHT!) also be marginally more cache coherent, for the following reason:
Say you have a contiguous, unfragmentable allocator like a pool allocator of fixed size structures. Now say each of those structures is an entire node subdivided (that is, 8 nodes per allocation). Well, now you have better per-level spatial coherence, and thus better cache coherence for /parametric/ traversal (that is, testing every adjacent leaf node entered for intersection with an object). This is easy to see if you draw eight nodes out and conceive of a ray that passes through three of them. Those three nodes will all be contiguous in memory; as such, they’re all likely to be in L1 (this is especially true if you’re using an unsigned short to store offsets, as a kind SE user pointed out is feasible and has been done in research).
Now, each node of a kd-tree only has k subnodes (k is usually a lot less than 8). So, if using a pool-allocated structure, as above, wherein nodes are deleted and subdivided pretty quickly, you get this funny situation where two spatially adjacent nodes have been placed a good distance from each other in the cache; not very good for raycasting. Particularly, this happens if the parent node hasn’t been re-allocated recently but the child has. Again, this is easier to see if you draw it out.
The caveat here is that traversing a kd-tree to add, search for, or delete nodes is theoretically much more cache-efficient than jumping down L1 several lines for each node of each sublevel, many of which will be extraneous. I haven’t profiled this, so I can’t say for sure; all I can tell you is that the bounds test for searching an octree can be written without the stack and looks something like this (pardon the AT&T syntax; just remember, destination register is always last):
"movaps %1, %%xmm7\n\t"
"cmpleps %2, %%xmm7\n\t"
"xorps %%xmm15, %%xmm15\n\t"
"cmpeqps %%xmm14, %%xmm14\n\t"
"xorps %%xmm15, %%xmm14\n\t"
"movaps %2, %%xmm8\n\t"
"andnps %%xmm15, %%xmm8\n\t\n\t"
"andnps %%xmm14, %%xmm7\n\t"
"dpps $0xFF, %%xmm8, %%xmm7\n\t"
"cvttss2si %%xmm7, %0\n\t"
What we’re doing here is effectively “index=(x<center.x)+(y<center.y)*2+(z<center.z)*4”, only we’ve turned it into a 3-component dot product to save instructions. The code for the entire traversal isn’t much else.
The point is, kd-tree construction is probably just as fast if not faster, but it’s not something to cry over when we’re talking about inherently static data structures. In fact, with this traversal code and my optimized structure, I’m able to voxelize 16 million points into a 2048x2048x2048 grid in well under 2 seconds. Once you get to be that fast in an offline process, close enough is close enough.
At this point, you’re probably yelling through your screen. I can hear it now: “BUT KEVIN!!!1 THERE ARE FEWER INTERSECTION TESTS NECESSARY FOR THE KD-TREE THAN FOR THE OCTREE! KD-TREES HAVE SHOWN TO BE BETTER BY REAL RESEARCH PAPERS!” But then you’d be missing my point, because I’m not talking about the theoretical performance of data structures on hardware without a cache. Yes, a naive implementation of a kd-tree for triangular meshes is faster than an octree…usually. But I have found in my experience that octrees are much more traversable in practice than people give them credit for; the implementation is everything.
So I think for voxels they could be faster than kd-trees in many cases, particularly with more and larger nodes, and particularly when rapidly streaming in new data and modifying lods. I don’t claim to be right about this, as I haven’t even seen parametric kd-tree traversal code except the sloppy stuff on stackoverflow with branches and jumps everywhere. My octree code uses about 2 easily-mispredicted branches per ray, and it’s likely you could achieve similar results with a kd-tree if you vectorized the entire traversal math and hand-optimized much of it, as I did.
I guess my point is, it’s worth giving real thought to this stuff.