txRiakIdx is a superset of txRiak that implements transparent secondary indexing of keys. As long as you store your keys as valid JSON dictionaries, txRiakIdx can index them and query those indexes. This requires Riak 0.14.0 or newer since we leverage Key Filters to make index searching faster.
Let's say you ran a diner and you store every order in Riak under primary key of the current UNIX timestamp (order_1299648212). Inside your order keys you store a JSON dictionary representing the order:
{"order_number" : 123456789,
"diner_name" : "Bobbie Jo Rickelbacker"}
Now what if you wanted to find every order "Bobbie Jo Rickelbacker" has placed in the diner? Without secondary indexes, you'd have to write a custom MapReduce job. But with txRiakIdx, you just define an index on diner_name
and then finding all those orders is a simple as:
name_index.query("eq", "Bobbie Jo Rickelbacker")
BSD licensed. Check the headers of the source files for the specifics.
- Python 2.6 or newer
- txRiak 0.3.2 or newer http://github.com/williamsjj/txriak
- Riak 0.14.0 or newer (for key filter support) http://wiki.basho.com/
Installing is as simple as cloning this repo or grabbing it from the "Downloads" button and running:
cd ./txriakidx
python setup.py install
In the compatibility_test
directory is a utility called test.py
. It is designed for for anyone who would like to write their own Riak indexing library compatible with txRiakIdx's indexing scheme . It will validate your index output against the reference implementation.
test.py --load
will build a reference set of indexes against the JSON dictionary intest_record.json
(useful for seeing what the output should be).test.py --validate
will validate that the expected indexes exist for the contents oftest_record.json
.
Indexes work by matching a key prefix and a bucket. For a key name like order_12345
the key prefix would be order
. We do this so you can target your indexes on individual types of keys even within a bucket. Every create, update or delete on a key that matches the bucket and key prefix triggers the indexes for that key to be created, updated or deleted respectively.
You can index multiple fields in the key too. Just make a RiakIndex
definition for each field you want indexed.
An index entry is simply another Riak key in a specially-named bucket, with a specially-named key name. It's easier to show with an example. Let's say you have this Riak key:
Bucket:_ my_orders
Key Name: order_12345
Value: {"order_number" : 123456789, "diner_name" : "Bobbie Jo Rickelbacker"}
And you define your index on the diner_name
field of the JSON dictionary:
idx = riakidx.RiakIndex(bucket="my_orders",
key_prefix="order",
indexed_field="diner_name",
indexed_type="str")
That index definition tells txRiakIdx to create an index for the diner_name
field in keys within the my_orders
bucket that start with the order
key prefix (txRiakIdx expects the key prefix and key name to be separated by _). When you create the order_12345
order key, here's the index key that txRiakIdx creates:
Bucket: idx=my_orders=order=diner_name
Key Name: order_12345/Bobbie Jo Rickelbacker
Value: (empty)
So when decide you want to see all the keys with diner_name
of "Bobbie Jo Rickelbacker", txRiakIdx tells Riak (using key filters) to find all of the keys in idx=my_orders=order=diner_name
where "Bobbie Jo Rickelbacker" is found after the /
in the index key name. But all your program sees back from the .query()
command is:
[("my_orders", "order_12345", "Bobbie Jo Rickelbacker")]
In otherwords, you get back a list of the original data keys (and their buckets) whose "diner_name" field is "Bobbie Jo Rickelbacker".
You might have noticed that the RiakIndex definition takes an argument called indexed_type
. This can be str
, unicode
, int
, bool
or float
. This allows txRiakIdx to perform searches like: show me all of the order numbers smaller than 2000. txRiakIdx will transparently convert the integer to a string when storing the index key, and will then tell the key filter to treat the value like a integer when you query.
The RiakIndex.query()
function accepts any Riak key filter predicate function as a comparison operator. A list of the predicate function names is here: http://wiki.basho.com/Key-Filters.html#Predicate-functions
Since we're using the Riak REST/HTTP API, all of our bucket and key names are URL encoded. So idx=my_orders=order=diner_name
becomes idx%3Dmy_orders=order%3Ddiner_name
, and order_12345/joe
becomes order_123456%2Fjoe
. However, we could run into the issue where the value being indexed contains a /
character which would confuse Riak's key filter tokenizer. So first we URL encode the value being indexed, and then concatenate it to the key name and finally URL encode the entire key name.
So order_12345/Bobbie Jo Rickelbacker
becomes order_12345%2FBobbie%2520Jo%2520Rickelbacker
.
To unpack an index key name:
- URL decode the entire key name.
- Split the decoded key name on
/
to get the data key name and the indexed value. - URL decode the indexed value.
txRiakIdx fully supports indexing Unicode field values (buckets, prefixes and field name must be ASCII though). Just make sure the field values are UTF-8 (no other non-ASCII encodings are supported). All ASCII field values are first converted to UTF-8 before being URL encoded and indexed.