Indexing Collections in MagLev
MagLev brings built-in, scalable object persistence to Ruby. You can store large numbers of objects persistently on disk and retrieve them from any VM connected to the repository, all with ACID transactions. Large collections of data often need to be searched. This post introduces MagLev’s indexing and collection querying features.
Primary Tool for Query: Objects and References
In MagLev, the primary tool for finding the right data should be the application’s object graph. Since MagLev natively stores objects, including references to other objects, the saved object graph can go a long way in organizing data for efficient retrieval. E.g., a BlogPost object can simply have an array of Comment objects; no need to go searching for all the comments whose post_id matches the blog post. Likewise, comments can just refer directly to their author object without the need to do an SQL join. This is one of the advantages of a persistent graph versus a relational schema: the structure is explicitly stored and can be used directly without having to reconstruct it on every query.
Ruby Arrays and Hashes provide the means to do simple searches for objects by number or by key. If your data is more richly connected, there are often custom data structures that support efficient searching, e.g., KDTrees for searching spatial data.
Fall-back for Query: Indexed Collections
Sometimes you need to search a collection on criteria that isn’t already organized in an application specific data-structure, or you don’t think it is worth the effort to build and maintain one. MagLev’s indexed collections provide a simple and efficient solution for these cases.
Suppose we have customer data, and we want to find subsets of our customer base that meet various criteria. We don’t know in advance all the queries we’ll be doing, so we’ll put all of our customers into a collection and use indexes for fast search and retrieval.
We have a Person class with some attributes (full code here):
class Person attr_reader :name, :age, :gender, :address, :marital_status # ... end
Where the address is:
class Address attr_reader :street, :city, :state, :zip # ... end
And suppose we have a persistent collection of person objects, named people, that holds all the people we know. We want to be able to search our collection and find various subsets, such as all the males aged 18-25 in the lucrative 45678 zip code. We could use the standard Ruby Enumerable interface:
lucrative_market = people.find_all do |p| p.gender == :male && (18..25).include?(p.age) && p.address.zip == 45678 end
But that requires looking at each object in the collection (the equivalent of a database table scan). We have to read the object off of disk, examine it, possibly remember it, and then move on to the next person object. This works ok for small collections, but is intolerable for large collections.
MagLev has the ability to index collections, which allows us to query the collection without loading each object into memory. Since arrays and hashes each have a natural index, the GemStone VM supports indexes only on unordered collections (bags and sets). We’ll be using an IdentitySet to hold our collection. IdentitySet is a Ruby wrapper around one of the Smalltalk unordered collection classes. It maintains set semantics where “the same” is interpreted as object identity (there are also equality, or equal value, based sets).
Step one: Use the search* API
The first step on the path to an efficient, indexed query, is to re-write the query using the search API (see IdentitySet.rb for API documentation). The strategy is to find subsets of the original set for each of our criteria, and then to intersect the sets.
# Find all males males = people.search([:@gender], :equal, :male) # Find all people between 18 and 25 age_group = people.search_range([:@age], (18..25)) # Find all people whose @address has a @zip equal to 45678 in_zipcode = people.search([:@address, :@zip], :equal, 45678) # Intersect the three sets to get our lucrative_market lucrative_market = males & age_group & in_zipcode
Because we have not yet created indexes on our people set, if we were to run the code, each of the search* calls would do a table scan, i.e., all of the person objects in the set would be read into VM memory. We obviously want to avoid that for large sets.
Step two: Add Indexes
To prevent the table scans, we’ll add indexes to people. There are two kinds of index, identity indexes and equality indexes. An identity index allows queries for all elements of a collection whose instance variable is identical to (the same object as) a target value. An equality index supports identity searches, but also allows search based on the ordering of elements (e.g., search for items less than a given value, or between two values). We have two queries that are candidates for identity indexes:
males = people.search([:@gender], :equal, :male) in_zipcode = people.search([:@address, :@zip], :equal, 45678)
Both of these queries are for identical (:equal) objects. Since there is only one occurrence of a symbol for a given string, and there is only one occurrence of a number for any Fixnum, both of these queries can be done on an identity index.
To create the index, we must specify the path of instance variable access used to get to the instance variable to be indexed. Each element of the path is the name of an instance variable, and they are separated by ‘.’:
For the gender index, since Person has a gender attribute, the path is simply '@gender'. For the zipcode index, we start at a person, look at the address attribute, and then look at the zip attribute of the address. The path we need is '@address.@zip'.
Our other query is for people within a certain age range. We are not searching for all people of a single age, e.g., 25, but rather everyone whose age is in the range (18..25). An identity index does not support queries based on ordering, so we must use an equality index.
The path is simply '@age', and we specify that the age field will be, a Fixnum. The indexing system will use Fixnum‘s comparison methods to do the sorting and comparing.
Now, this becomes an efficient query:
males = people.search([:@gender], :equal, :male) age_group = people.search_range([:@age], (18..25)) in_zipcode = people.search([:@address, :@zip], :equal, 45678) lucrative_market = males & age_group & in_zipcode
This query, with the indexes in place, does not need to read any person objects. The indexes provide enough information to construct the set of matching object ids without the need to look at individual objects. The intersection is also done without the need to access any person objects. It is only if/when the elements of the lucrative_market set are accessed, that objects finally get pulled off of disk and into memory.
You can read more about indexing and queries in chapter 5 of The GemStone Programming Guide. This is the Smalltalk version of the reference, but the ideas apply to MagLev as well.