KD-Trees and MagLev
MagLev KD-Tree Example
A few weeks ago, I came across this blog post by Adam Doppelt from Urbanspoon. In the post, Adam describes how Urbanspoon solved a problem they had managing their data to enable nearest neighbor searches. They need to find the nearest N items to a given location, e.g., find the 10 nearest coffee shops to my current location. It’s an interesting post and I encourage you to read it to see the various solutions they tried and rejected. In the end, they solved their problem by writing a KD-Tree C-extension, which they have published as a RubyGem.
When I read the post, it seemed that the problem was a natural fit for MagLev. So, I implemented a pure ruby version of a KD-Tree and compared it to the kdtree gem and to the pure ruby solution running on MRI 1.8.6 and 1.9.2 1.
Bottom line: If performance is your primary concern, then C is still the king of speed. MagLev posts second place in performance, but MagLev really shines when you consider of the other aspects of the problem: deployment, memory footprint, ease of programming etc.
First, we’ll take a look at the problem, then we’ll dive into comparisons of the various solutions. A KD-Tree is a binary tree that stores K-Dimensional data and allows efficient searching of that data. In our case, we have two dimensional data (latitude and longitude), so K=2. See the wikipedia KD-Tree entry for a discussion of KD-Trees.
To get the performance they needed, Urbanspoon wrote the KD-Tree as a ruby C-Extension. This solution allowed quick searches, and the ability to persist the tree to a file, but it also came with a few limitations.
I was curious how a pure ruby implementation running on MagLev would fare against the C-Extension. You can find my implementation and the test code on GitHub. In my comparison, I’ll consider performance and look at the limitations Adam mentioned in his post.
For performance, I got the following back-of-the-envelope numbers from a single run on a 2.4GHz Intel Core 2 Duo (MacBook) with 2G Ram. I created a tree with one million random locations, then I measured the performance of a search for the nearest one hundred nodes to each of a thousand different random locations.
MRI 1.8.6 + C-extension kdtree gem: user system total real Time: 0.040000 0.000000 0.040000 ( 0.037840) Per Query: 0.000040 0.000000 0.000040 ( 0.000038) Maglev: user system total real Time: 0.910000 0.000000 0.910000 ( 0.921060) Per Query: 0.000910 0.000000 0.000910 ( 0.000921) MRI v1_9_2_preview2: user system total real Time: 3.330000 0.160000 3.490000 ( 3.486725) Per Query: 0.003330 0.000160 0.003490 ( 0.003487) MRI v1_8_6_383: user system total real Time: 31.170000 0.070000 31.240000 ( 31.293104) Per Query: 0.031170 0.000070 0.031240 ( 0.031293)
No surprise that the C-code is the fastest, but MagLev running pure ruby code came in second place, with performance that should be acceptable in a real world application.
While MagLev performance is good, MagLev really shines when you take a look at all of the other aspects of writing, deploying and maintaining a web app. First, I’ll take a look at the limitations Adam lists for the kdtree gem, and then I’ll briefly describe some of the additional benefits MagLev offers.
The first item Adam mentions is thread safety:
Not thread safe. In fact, due to my laziness it uses a single static block for storing results. You should only use one kdtree at a time!.
Ok…given motivation, I’m sure the kdtree gem could be re-written to allow multiple instances, so that’s not an intrinsic issue with a C version. The MagLev version, since it uses plain old Ruby objects, allows multiple trees without any fuss or bother, plus, its all Ruby. As to thread safety, since neither implementation supports updates, the trees are essentially read-only, and so both are thread safe from that perspective. The only snag is in updating the tree, which I cover next.
No editing allowed!
One of the difficulties with a KD-Tree is how to handle inserts. When you create the tree, it is, by construction, balanced. That means finding nodes in the tree is O(log n). But unlike a Red-Black Tree or AVL Tree, there isn’t an easy way to re-balance a KD-Tree after you insert new nodes. With repeated inserts of new nodes, eventually, your tree will become unbalanced enough that you’ll lose the O(log n) performance. Urbanspoon’s solution is to disallow live tree updates, and instead generate a new (balanced) tree once a day, that includes the new nodes. Both implementations have this limitation.
So, given periodic tree re-generation, how does MagLev compare to the kdtree gem? With the kdtree gem, you’ll have to (a) generate the new tree (b) persist it to a file (c) copy the new file to all of your servers and (d) kick each server to (safely) read the new tree. You’ll need code and/or processes to support each of those steps.
With MagLev, you’ll have to (a) generate the new tree and (b) commit it. That’s it. Since all VMs share a transactional view of the single tree, when one of the VMs re-generates the tree, all of the other VMs will simply get the updated view at their next transactional boundary. MagLev handles the caching and distribution to multiple machines and VMs for you. In the meantime, each VM has a consistent view of the old tree (the “I” in “ACID” is for “Isolation”).
The tree is stored in one big memory block
This one is a double-edged sword. A lot of the speed that Adam gets in generating a tree is due to the fact he has one big C array of nodes, and he can feed quicksort any sub-array he wants by just passing the appropriate offsets (generating a KD-Tree requires O(n log(n)) sorts of arrays of exponentially diminishing size). C really shines here. On the ruby side, using the standard libraries, I have to create a new array for each sort. That’s a lot of copying and GC overhead. But…re-generation is once a day, and queries are forever.
The other side of this sword is that the C kdtree is essentially one big array. Each ruby process has to have the complete tree in memory to do anything. That’s a lot of duplicated memory. I assume there is no memory overhead for the indexed data, as that is likely to be stored in the DB, so you’ll query the tree, then pull the data of interest out of the DB, suffering no memory bloat on the data side.
With MagLev, instead of holding the index in one array, you have a lot of Tree2D nodes, each with a reference to a right and left child Tree2D. MagLev can take advantage of this finer granularity and keep only the Tree2D objects you need (the working set) loaded in a VM at any given time. Like the gem version, MagLev stores the indexed data as separate objects, so, MagLev will just pull the data of interest into the current VM.
The next caveat in Adam’s list is
Persisted trees are architecture dependent, and may not work across different machines due to endian issues.
I doubt that this is a huge issue in practice. I imagine that most or all of your servers are going to be the same architecture, and this issue will be moot. But if you do find yourself in a multi-architecture situation, you’ll have to add another layer of complexity to the kdtree solution. The MagLev repository, on the other hand, has already dealt with the architectural issues for you. A single MagLev repository can simultaneously serve VMs from several different architectures and MagLev handles the architectural differences for you.
Limit on nearest-k search
Again, this is an artificial limitation in the kdtree gem implementation that Adam admits is “due to my laziness”. The pure ruby implementation does not have this limit.
Other Nice Things with MagLev
This post is already getting a bit long, so I’ll just briefly list a few other interesting aspects of the MagLev solution.
- No ORM overhead. I’ve made the assumption that in the kdtree gem example, the data of interest (descriptions of coffee shops, restaurants etc) is stored in a DB, and that you’ll be accessing it via ActiveRecord, DataMapper or some other ORM. In MagLev, you simply have Ruby objects…no ORM overhead or nonsense to deal with.
- The MagLev solution is pure Ruby, you don’t have to switch gears and write in C (nor have to deal with the supporting tool chain and skill set).
- The C version indexes a set of integers. You have to manage the mapping of integer to actual object you care about on your own. The pure Ruby implementation indexes arbitrary Ruby objects, so no indirection is needed 2.
If you need ultimate speed, and you can live within the known limitations, and extra code needed to support it, then the kdtree gem is a good fit. If you want to have multiple trees, not worry about the persistence format, ORM issues and stay in a pure ruby world, then MagLev offers a compelling alternative. As always, the only real way to compare the different approaches is to try both solutions on your application in your deployment environment.
1 Technically, both the kdtree gem and the implementation I wrote are 2D-Trees, not generic K-dimensional trees. Either implementation could be enhanced to multiple dimensions with little effort.
2 My current implementation saves the KD-Tree and the data all mixed together. To really make things more efficient, I should probably separate the index from the data by first committing just the data Ruby objects, then, in a separate transaction, create and commit the KD-Tree (index). The advantage here is that having separated my index (the KD-Tree) from my data, I can then traverse the index without paging in unnecessary data objects.