the kd-tree vs. octree debate

Edit: current, very tentative profile data for my demo is as follows:

  • 1.9 sec per 16 million points voxelized at octree max depth 11. (4 or 6 CPU threads, I forget which).
  • 480p software raycast of entire depth-10 octree in 8ms (6 CPU threads).

Hey there ye people,

I’ve been making an octree demo.  It’s not out yet, but it will be.  I’ve studied octrees far too extensively for my own good, so allow me to share with you some of the knowledge that has allowed me to optimize this wonderful data structure.

See here:
for the subject of this rant.  Namely, why I think they might (MIGHT!) be marginally better than KD-trees, at least for sparse voxels (aka brick maps, which are very old in the film industry, but mostly used for caching).  They might (MIGHT!) also be marginally more cache coherent, for the following reason:

Say you have a contiguous, unfragmentable allocator like a pool allocator of fixed size structures.   Now say each of those structures is an entire node subdivided (that is, 8 nodes per allocation).  Well, now you have better per-level spatial coherence, and thus better cache coherence for /parametric/ traversal (that is, testing every adjacent leaf node entered for intersection with an object).  This is easy to see if you draw eight nodes out and conceive of a ray that passes through three of them.  Those three nodes will all be contiguous in memory; as such, they’re all likely to be in L1 (this is especially true if you’re using an unsigned short to store offsets, as a kind SE user pointed out is feasible and has been done in research).  

Now, each node of a kd-tree only has k subnodes (k is usually a lot less than 8).  So, if using a pool-allocated structure, as above, wherein nodes are deleted and subdivided pretty quickly, you get this funny situation where two spatially adjacent nodes have been placed a good distance from each other in the cache; not very good for raycasting.  Particularly, this happens if the parent node hasn’t been re-allocated recently but the child has.  Again, this is easier to see if you draw it out.

The caveat here is that traversing a kd-tree to add, search for, or delete nodes is theoretically much more cache-efficient than jumping down L1 several lines for each node of each sublevel, many of which will be extraneous.  I haven’t profiled this, so I can’t say for sure; all I can tell you is that the bounds test for searching an octree can be written without the stack and looks something like this (pardon the AT&T syntax; just remember, destination register is always last):

                                           asm volatile(

"movaps %1, %%xmm7\n\t"

"cmpleps %2, %%xmm7\n\t"

"xorps %%xmm15, %%xmm15\n\t"

"cmpeqps %%xmm14, %%xmm14\n\t"

"xorps %%xmm15, %%xmm14\n\t"

"movaps %2, %%xmm8\n\t"

"andnps %%xmm15, %%xmm8\n\t\n\t"

"andnps %%xmm14, %%xmm7\n\t"

"dpps $0xFF, %%xmm8, %%xmm7\n\t"

"cvttss2si %%xmm7, %0\n\t"

: “=r”(index)




 What we’re doing here is effectively “index=(x<center.x)+(y<center.y)*2+(z<center.z)*4”, only we’ve turned it into a 3-component dot product to save instructions.  The code for the entire traversal isn’t much else.  

The point is, kd-tree construction is probably just as fast if not faster, but it’s not something to cry over when we’re talking about inherently static data structures.  In fact, with this traversal code and my optimized structure, I’m able to voxelize 16 million points into a 2048x2048x2048 grid in well under 2 seconds.  Once you get to be that fast in an offline process, close enough is close enough.

At this point, you’re probably yelling through your screen.  I can hear it now: “BUT KEVIN!!!1 THERE ARE FEWER INTERSECTION TESTS NECESSARY FOR THE KD-TREE THAN FOR THE OCTREE!  KD-TREES HAVE SHOWN TO BE BETTER BY REAL RESEARCH PAPERS!”  But then you’d be missing my point, because I’m not talking about the theoretical performance of data structures on hardware without a cache.  Yes, a naive implementation of a kd-tree for triangular meshes is faster than an octree…usually.  But I have found in my experience that octrees are much more traversable in practice than people give them credit for; the implementation is everything.    
So I think for voxels they could be faster than kd-trees in many cases, particularly with more and larger nodes, and particularly when rapidly streaming in new data and modifying lods.   I don’t claim to be right about this, as I haven’t even seen parametric kd-tree traversal code except the sloppy stuff on stackoverflow with branches and jumps everywhere.  My octree code uses about 2 easily-mispredicted branches per ray, and it’s likely you could achieve similar results with a kd-tree if you vectorized the entire traversal math and hand-optimized much of it, as I did.  
I guess my point is, it’s worth giving real thought to this stuff.

code standards (pffffft!)

Another post about something Mike Acton said or did.

But really, you should take a look at the insomniac games coding standards.  They are available online.  They are awesome and not what the spurt of giggling in the title was about.

…That would be my reaction to my /own/ coding standards :D.

I mean, don’t get me wrong, my code is fast as hell and I could never give that up.  Not when I have a third of an ms to raycast through an entire octree in software.  But…such is life, you have to make sacrifices.  In my case, I take all concept of readability and sacrifice it before the accursed anti-throne of the crimson Satan voodoo doll thing….

Point is, I have to rewrite everything in the comments so people can read it.  But here are some more performance-based, much less readability-based code standards (for whatever they’re worth.  I use them whenever I work on my own projects.):

1.  Never use virtual (not even in destructors; see casting rules below). Avoid function pointers at all costs.  Exception: member functions that get passed as parameters to CreateThread.
2. Never overload operators.
3. Never use stl.
4.  Never use getsetters; they add pointless instructions in the best case (inline) and in the worst case they shuffle the stack (not inlined; see below).  Instead use classes like structs with a few inlined utility members, because IMO structs work in Linux, Git, Blender, Gimp, and so on and encourage better design.  Also, using _thiscall to access everything is patently nuts.
4.  Never explicitly “inline” on class members.  The compiler can do that for you.
5.  Never rely on compiler autovectorization for critical optimizations.  If it’s that important, do it by hand to ensure nothing (or almost nothing) is moved through memory.
6.  Don’t dynamically allocate anything less than 1kb.
7.  Use a statically-allocated, lock-free smart array class with capped size for all buffers in the heap unless you don’t know the size to the nearest 65536 bytes. 
8.  Never use malloc(), free(), new[], or delete[]; always use templated pool allocators with pools allocated at link time.
9.  Never cast between two types unless you know the size and binary layout of both.  
10.  Move all the most commonly accessed data to the front of the class.  For classes with more than 512b of data, create a struct that aliases to that type and contains only the smaller, more frequently accessed header data, so as to not pollute the cache e.g. with statically typed arrays.
11.  Don’t use pointers to single elements unless an algorithm requires it, and that algorithm is known to be faster on all supported hardware than its linear alternative.
12.  Don’t use pointers within classes (exception: statically allocated arrays when you don’t want to cache the whole thing, but ABSOLUTELY must cache the entire object). 
13.  Use static-length character arrays instead of const char*, and then document all the maximum string lengths.
14.   Don’t print stuff to screen, save it to the heap in some easily readable hex format, then access it in the debugger.
15.  Don’t use system library functions in code that needs realtime execution (exception: a few OpenGL routines).  Code which gets executed more than a few times per frame should not depend on ANYTHING, stdio and stdlib included (exception: OpenGL state switches).    Use thread pooling and replace fread() with memory-mapped streaming whenever possible.  From c standard lib use only strcpy, strlen and strcmp and re-implement them since they’re so elementary, along with functions like exp() and sqrt().
16.  Always increment with ++x instead of x++.
17.  On windows, code specifically for 32-bit architectures, since this allows more efficient packing of structs with pointers in them (and since this also allows inline assembly).
18.  Don’t add third-party software as a link-time dependency unless you’re making an offline tool.
19.  Don’t add proprietary software as any kind of dependency.
20.  Don’t use default assert(), use something like the following code, inspecting the resulting regdump in your debugger:

static inline __fastcall void __NZassert(unsigned int condition, const char* _fileline){

__asm volatile{

test ecx; //MSVC fastcall puts assert condition in this reg

jz 617373657274696f6e206661696c6564h; //if zero, jump to “assertion failed” in hex ;).  



21.  Don’t use -> with large structs that get dereferenced multiple times in the same function.  Dereference the entire object, read it, and then write it back after the necessary modifications are made.

22.  For small arrays and structs, instead of copying variables in a loop, cast to a char* and use memcpy.  For big arrays, if possible, align to 64 or 128 bytes so memcpy will autovectorize (and so loads won’t fragment the cache).

23.  Use globals in lieu of singleton to save a few nasty treks past L2.
24.  Avoid dynamic branching in critical code unless you save a few loads or more than 600 cycles.  Invalidating the I$ is nasty business, and AFAIK branch predictors haven’t gotten any more magical!  Instead, multiply by a lot of bools.
25.  Don’t use int when short will work.  Don’t use short when char will work.  Period.  Because passing big stuff by reference pushes big stuff onto the stack (== loads), and putting big stuff in registers pushes other stuff onto the stack, and __cdecl is the worst of both worlds.
26.  Always build in Release mode.

All right, that’s enough for now! You get the idea: I’m not just being a cowboy here, I’m following some rules.  And you may hate them, and you may think they’re a bit nutty (for a larger project with multiple people, they probably are!).   But they’re rules nonetheless, very strict ones as a matter of fact, and you can’t argue with the raw numbers (from cachegrind, profiling, etc :D).  

Special thanks to twitter people @TomHammersley, @MikeActon, and @okonomiyonda for reminding me to go learn how computers work.


(Request) How to learn #gamedev

Okay, okay.  So someone asked Mike Acton on twitter about resources to learn game programming.  I’m pretty up-to-date in this subject, so I’ll try to answer it as best I can.  Keep in mind, though, I’m probably less than half as knowledgeable about this stuff as Mike is, and many times less experienced.  But this is what’s helped me so far:

"The Practice of Programming" by Kernighan, Pike. 

This book immediately comes to mind when I think of the early days I spent learning about CS.  It’s not a perfect overview, but it gives you a deep understanding of C (and you should learn C first, C++ second or third), plenty of data structures for a beginner, and some best practices from the co-author of the C programming language himself.

But before you read that you should read Kernighan’s “The C Programming Language”; that’ll give you a better idea of the basic syntax you’ll need before delving deeper.  I recommend Kernighan’s books alone for learning this stuff; it’s all very lucid and professional, and the code is well-written/elegant and plentiful.  I have the opinion of many other,  better-respected programmers to back me up there.

Once you learn C pretty in-depth, I find you can learn C++ pretty easily (from any book or comprehensive set of tutorials).  Most people learn C++ first, I think that’s a mistake because while it may be easier to do, it’s much harder to do /well/.  To do that requires a knowledge of C, from which C++ extends easily.

Python is a decent language for scripting and hacking and stuff.  Learn all the basic constructs of the language.  I recommend the tutorials by Guido Van Rossum.  Also, read the source code of the python interpreter, so you get an idea of what a fully functional C software looks like.

Learn something about practical multithreading; the Wikipedia articles on Spinlocks, Mutexes, ThreadPools and Threads should be enough. You should learn how they work and how to invoke them from a library.  If you’re just starting out, there’s a lot of stuff here you’ll want to come back to later.

After that you should be ready for learning something about games.  Depending on your idea of fun, this can go two main ways.  For me it went sorta both; you should do some of each, though:

This is a very technical, low-level, and math-heavy field.  Don’t be afraid, but don’t dwell on it too much if it isn’t fun. 
1.  Learn matrix math/linear algebra for computer graphics.  Here’s a book on the subject: http://www.amazon.com/Mathematics-Computer-Graphics-John-Vince/dp/1846280346.  I don’t know if this one in particular is any good, but this stuff is pretty intuitive and you just need a basic grasp of it to do beginning stuff.

2.  VERY IMPORTANT: get a very firm grasp of how the current rendering pipeline works, internally and otherwise.  I recommend reading very current material on this, since it’s changed.  But if you want a general overview there’s a book called A Trip Through the Graphics Pipeline by Jim Blinn, who is considered by many to be the father of computer graphics.  Also read http://fgiesen.wordpress.com this up-to-date source, specifically the series “A trip through the 2011 graphics pipeline”, otherwise Blinn’s book will trip you up.

3.  I recommend using a 3D engine like OGRE right around this point.  OGRE is a very good learning tool and a powerful renderer in its own right.  You don’t have to do much with it to learn how it works.  I don’t recommend buying an entire book on this, it’s better just to read the online manual, a tutorial or two, the wiki, and some sample code; but if you want to, there’s a new one out describing version 1.7.  Get that one.  If you want to learn a lot about designing these types of engines, or even just get a feel for how the structure is, I /strongly/ recommend diving into its source code.  Reading is as important, if not more, as writing in this field.  I recommend doing this only once you have a feel for the actual engine documentation though, as otherwise it’ll be far more difficult.

4.  Read some tutorial shader code that’s considered to be very good.  I recommend the GPU Pro or GPU Gems series; anything with snippets and a CD-Rom.  Get an idea of how effects like http://www.jshopf.com/blog/?p=16 linearized depth, http://www.kxcad.net/Softimage_XSI/Softimage_XSI_Documentation/realtime_dx_HLSLProgrammableShaderExamplePerPixelPhongLighting.htm phong shading, and http://www.john-chapman.net/content.php?id=8 SSAO actually work.  Implement something along these lines in OGRE; it’s pretty straightforward.

1.  This is a much more synthetic field, and it has less of a barrier to entry.  That shouldn’t discourage you from trying your hand at graphics, but I personally have found I like this stuff better.  Some are the opposite.
2.  There’s no standard system for implementing game features.  You just have to try things.  They used to say, make a couple of arcade style games.  But I find that’s really not the best way anymore.  Try making very small mods.  No new content, or at least nothing special, just simple gameplay situations.  Try to get on a mod team; this will, if nothing else, teach you how to work with a team and use version control.  But be careful not to join a huge, insanely ambitious and understaffed project.  You can tell these because they have a lot of art but they’ll accept new coders on a dime.

Instead, join openOutcast </shamelessplug> ;).

3.  Learn something about how AI works, even if you’ll be using external tools for this.   I recommend http://www.policyalmanac.org/games/aStarTutorial.htm this, http://bjoernknafla.com/introduction-to-behavior-trees this (if you have trouble reading it at first, skip down to the example), and some tutorial explaining NavMeshes (google helps here).  Also, the AI Game Programming Wisdom series comes highly recommended, and I can verify from the table of contents alone that it provides both a good overview and an excellent trove of newer research and ideas.

So that about wraps it up. There’s all sorts of other stuff you can learn, and a lot of other books that didn’t make it into this post; I’ll probably continually update this stuff, including links as I find more things to mention. Finally, if you have any questions, PM me on twitter (@NeuzeitStudios). 


Agent Fastforwarding

WARNING: The following post describes a planned feature which has not yet been fully implemented, let alone tested.

Last time I spoke about MAS in open world games.  Today I want to share with you one of the major technical problems that immediately surfaced with this type of AI.  It is by no means an insurmountable one, and in fact its solution is more generally applicable as an optimization technique for Agent systems.  So let’s take a look:

The interesting thing about implementing agent-based AI in openOutcast is that there are these portals called “Daoka” that can teleport a player between levels.  As such, any level, and thus almost any part of the game, can be accessed at any time.  Other open-world games have similar mechanisms.  For the most part, this is fine, since most games turn of the NPC behaviors when the player is away.  But we can’t do that, because we’re simulating an entire living society of Talan NPCs…this means that we need a mechanism to keep all AI running, for every ingame agent, at all times.

Luckily, there are a few ways we can optimize this.  Given the assumption that the player is unable to see, hear, or directly interact with a given set of agents, a few things can be modified:

1.  The agents are no longer individual characters, but groups of characters.  They are, for gameplay, an abstract distribution of states with some effect—-nothing more.  So we can use a clustering mechanism similar to graphical LOD techniques to incorporate agents into a single super-agent with a uniform state and a set of shared tasks.  Then, when the player teleports back, the super-agent is subdivided into the correct number of children and distributes its tasks among them.

2.  The agents’ every movement cannot be observed; only the final state.  So tasks can be simulated approximately; instead of walking, an NPC can “commute” to another position.  Instead of picking up items and moving them, it simply “absorbs” items and transports them.  Instead of engaging each other in combat, the outcome of battles is determined statistically.  These “fastforward” behaviors, while carefully timed, are totally instantaneous once triggered. 

3.  MultiAgent plans are generated once per faction/level and the resulting tasks delegated randomly to NPCs or super-agents.

This simplification of the AI should be great enough so that the AI processing for other levels is less unwieldy, but small enough so that the AI upon Cutter’s return is reasonably (i.e., visually) intact.

By the way, did anyone else not know that Weta’s MASSIVE middleware uses MultiAgent Systems (hence the MAS in the acronym; obviously these agents use very simple rulesets and few animations since their function is almost purely visual). 

MAS in Open-World Games


I thought I’d do a little post on what I’ve been working on with openOutcast recently.  I can’t say too much because the demo hasn’t been released yet, but I can give a description of some techniques I’ve been using to enhance combat, gameplay, and NPC behaviors.

Multi-Agent Systems (MAS) is an ongoing field of research in AI, and a fitting starting-point for a sequel to the original Outcast.  Some of the implementation gets very complicated, but the essential technique, and the results it achieves, aren’t difficult to understand, much less to appreciate.  Basically, software “agents” are  entities with well-defined position in a metric space and limited observational capacity.  They communicate their knowledge via some observer pattern and perform some sort of distributed scheduling—cooperative, competitive, or some hybrid approach.  All of this is done asynchronously.

The tremendous advantage here for games, and indeed the conformity of NPCs and groups of NPCs to the notion of software agents, should be pretty apparent even from this minimal description.  After all, the ability to simulate emergent, dynamic, cooperative, and unpredictable behaviors is by all accounts the holy grail of open-world AI, if such a thing exists (I’m quickly reminded of all the pre-release hype surrounding TESV: Oblivion’s “Radiant AI” system, and the subsequent, very real, disappointment when the game was actually released).  And here we are provided with a perfect framework to achieve this ideal.

MAS lends itself very well to a number of different AI features, none of which are novel in an academic sense, but all of which are deemed somewhat radical by lazy AAA-developers (I am aware that that’s a strong word, but given the sheer number of times a Spec-ops agent in any given ego-shooter has taken cover behind volatile explosives, nobody really has an excuse). 

One is the ability to convey uncertain or incomplete information about a situation.  We do this with simple Bayesian logic: an observer pattern and a “certainty value”.  If an agent receives a message saying “I’m 60% sure that such is so”, and he is 20% certain it is not so, his view will be corrected and, if he has a tendency to spread rumors, propagated.  If he is 20% sure that said thing is thus, and his nearby ally agrees with 60% certainty, then both statements are confirmed.  Of course, we have to attenuate this to prevent amplification of signals from two nearby NPCs having the same exact evidence, but that’s easy enough to do.

Another thing we can do is plan tasks long into the future according to a collective set of goals and criteria.  The way we do this is with a modified sort of decision tree and Dijkstra’s algorithm.  Thus, a plan can be formed to satisfy a set of goals given the current set of satisfied goals, and that plan can be executed by multiple NPCs.  This system is not only extremely versatile but very fast, with complex plans frequently being generated in under a second.

On the topic of distributed scheduling, we can use anything from a simple deterministic ruleset to more advanced fuzzy logic or negotiation to determine what state the AI should be in as a group.  Agents will have a likelihood of acceptance for each behavior; this can be interpreted as a probability, as utility function for computing negotiation costs, or something else.  It can be determined by behavioral and emotional tendencies, instinct, reason, or what have you.  The key is, it’s just a number; this means two things.  First, it can be reinforced arbitrarily, introducing an adaptive effect if desired.  Second, though I haven’t tried this yet, complex group behaviors can be easily constructed from simple experimental data.  This should be especially useful when we go to do the animal AI—-twon-has, gamor, and kawaii play an enormous part in shaping the diverse environment of Adelpha.

Combat and neutral behaviors are naturally separated in our system; fighting involves simpler behaviors which are repeated and synthesized according to immediate situational awareness, while Neutral behaviors are more complex, usually being scheduled according to a daily clock.  Nonetheless, the methods used are essentially the same.

The immediate result of all this is A.I. that can think, feel, and behave more realistically.  I don’t care what the folks at AltDevBlogADay will tell you about fighting games where the AI’s sole objective is to place itself in front of explosives before dying, or how the terrorists in Call of Duty just have to /look/ like they’re shooting at you.  The only reason anyone plays those games is because he’s 12 and doesn’t know any better.  Besides, that just doesn’t cut it for an open-world game, in which a real, living society is being conveyed and must withstand as much scrutiny as time and technology allows.  It doesn’t always have to be perfect, ideal, unbeatable, or even remotely sensible, but it shouldn’t be predictable.

Take a hard look at the latest Skyrim gameplay teasers.  Do you see any characters that AREN’T effectively bosses?   What about massive hordes of enemies storming a castle?  What about massive hordes of enemies waiting while a few sneak /into/ the castle at night and unlock the door?  What about bandits attacking merchants on their way out of the city?  What about bandits knowing where on the merchants’ route is optimal to attack, and ambushing them there?

I’m not saying openOutcast is at that point yet.  I’m not even saying game technology as a whole is capable of that.  I’m saying if we /really/ cared all those years that we pretended to—-if we weren’t distracted by the shiny stuff—-we would be there, doing things beyond anything I just imagined.  And MAS is the way forward.