Ordered Search Autocompletion With Redis

综合编程 2012-06-09

I spent the last weekend building a caching system for AnyGood
using Redis
to save API responses for certain amount of time. Since the next item on my TODO list was autocompletion for my search form I started googling around looking for solutions involving Redis. I stumbled upon two great posts examining this particular topic: The first one
written by Salvatore Sanfilippo is a great explanation on how to use Redis for autocompletion and goes in great detail when explainin the algorithm. The second one
by Pat Shaughnessy is a great help when explaining Sanfilippos algorithm and comparing it to Soulmate
, "a tool to help solve the common problem of developing a fast autocomplete feature. It uses Redis's sorted sets to build an index of partially completed words and the corresponding top matching items, and provides a simple sinatra app to query them."

So I spent a good amount of time studying those articles and reading through the source code of Soulmate before I decided to write my own solution. Why? Because it was a rainy sunday afternoon, playing around with Redis is fun and I wanted to learn more about it. Plus: I wanted multiple phrase matching and search result ordering. Soulmate did all that and a bit more but just setting it up wouldn't have the same learning effect. And since I was out to play, I might as well have fun. So let's see how to this...

What it should do

Let's suppose we want autocompletion for our search form that not only returns a single value for each proposed item but also comes with some data that we can present. Also, let's go one step further and say that the search form should present the items available to the user in an ordered way, e.g. by popularity.

So when a user types into the search form, we want to to show him all the possible movies matching his search phrases. That means the first thing we've got to do is dump the data we want to present into Redis. Like Soulmate, we use a Redis hash
here, where each movie has its own unique key. The key can be anything as long as it's unique. If you don't have unique IDs for your data, you could use MD5 to generate some. But let's suppose we do have an unique ID, dumping the data into Redis is pretty simple:

HSET moviesearch:data 1 "{"name":"Kill Bill","year":2003}"
HSET moviesearch:data 2 "{"name":"King Kong","year":2005}"
HSET moviesearch:data 3 "{"name":"Killer Elite","year":2011}"
HSET moviesearch:data 4 "{"name":"Kill Bill 2","year":2004}"
HSET moviesearch:data 5 "{"name":"Kilts for Bill","year":2027}"
HSET moviesearch:data 6 "{"name":"Kids","year":1995}"
HSET moviesearch:data 7 "{"name":"Kindergarten Cop","year":1990}"
HSET moviesearch:data 8 "{"name":"The Green Mile","year":1999}"
HSET moviesearch:data 9 "{"name":"The Dark Knight","year":2008}"
HSET moviesearch:data 10 "{"name":"The Dark Knight Rises","year":2012}"

is the Redis command to save something into a hash. The key for this hash is moviesearch:data
. The single keys for every movie in this hash are the unique IDs: 1, 2, 3, ...
. And the data we're saving are JSON strings, which represent a pretty convenient way of saving objects to Redis. Also, it's easy enough in Ruby to convert a Ruby hash to JSON and after retrieving it from Redis back to a hash:

require 'json'

# Before dumping to Redis:
{name: 'Kill Bill', year: 2003}.to_json
# => "{"name":"Kill Bill","year":2003}"

# After retrieving from Redis:
JSON.parse("{"name":"Kill Bill","year":2003}")
# => {"name"=>"Kill Bill", "year"=>2003}

Oh and btw. I'm using redis-rb
as the Ruby client to connect to Redis, which does not only support Redis transactions and pipelining, but the best thing about this client is that the Redis commands
have the same name and take the same arguments (well, most of the time), which is especially great when getting to know Redis and looking up commands in the documentation. So now we've got the movies in Redis, how do we find them again?

Prefixes everywhere!

People are most likely trying to search for a movie by starting to type its name into the input field. And we want to show them the matching movies before they're even finished typing the whole name, right? That's why we're talking about autocompletion here. That means we need an association between word prefixes and the movies. If someone were to type in The Dar
we want to show The Dark Knight
and The Dark Knight Rises
as possible search terms. Long story short: we need to get the prefixes of every movie we just dumped in our moviesearch:data
hash. For that I'm using a simple method, which is heavily based on the one Sanfilippo uses in his example script:

def prefixes_for(string)
  prefixes = []
  words    = string.downcase.split(' ')

  words.each do |word|
  (1..word.length).each {|i| prefixes << word[0...i] unless i == 1}


The movie with the name The Dark Knight
will result in the following prefixes: ```ruby prefixes_for('The Dark Knight')

=> ["th", "the", "da", "dar", "dark", "kn", "kni", "knig", "knigh", "knight"]

And *The Dark Knight Rises* will result in the following prefixes:
prefixes_for('The Dark Knight Rises')
# => ["th", "the", "da", "dar", "dark", "kn", "kni", "knig", "knigh", "knight", "ri", "ris", "rise", "rises"]

We need to generate those prefixes for every movie we want to use in our search autocompletion and so I use a minimum prefix length of two characters here, because using one character prefixes is a lot of overhead for search completion where most people are going to type in more than one character.

(Of course it's entirely possible to not only use prefixes, but use every range of characters from any position in the word. Instead of using fi, fis, fish
for Fish
, we could use fi, fis, fish, is, ish, sh
. But people are more likely to type the beginning of a word, I guess.)

Save those prefixes to sorted sets!

Redis uses sorted sets
as a list of unique strings that can be ranked by a score. For now, we will ignore the score and just use this as a set where every entry is unique and trying to add one with the same name won't result in a new entry. So let's create a sorted set for every prefix. This set will include the key of the movies in which the prefix occurs and the pattern for those keys is the one Soulmate uses: moviesearch:index:$PREFIX

In order to associate the movie The Dark Knight
with its prefixes we need to do the following for every prefix:

ZADD moviesearch:index:dar 0 9

As you might have guessed, the prefix here is dar
. The next number in this command is the score the member of this sorted set will have, but as I said, ignore this for now and keep in mind that the 9
is the key of our moviesearch:data
hash pointing to the The Dark Knight
. So we need to associate all prefixes of every moviename with its key in the moviesearch:data
hash. Written in Ruby a method doing exactly that would look like this:

def add_movie(movie_name, data_hash_key)
  prefixes = prefixes_for(movie_name)

  prefixes.each do |prefix|
    REDIS.zadd("moviesearch:index:#{prefix}", 0, data_hash_key)

That's pretty simple, isn't it? The method takes two arguments: the name of the movie and the key of the moviesearch:data
hash pointing to its data. After using that method for all the movies we added to our data hash, we have a lot of sorted sets with its members being the keys for our data hash. That means, after adding The Dark Knight
and The Dark Knight Rises
the moviesearch:index:dark
set has two members: 9
and 10
. So, what does that give us?

Let's imagine a user is visiting our website and typing dar
into the input field of the search form.

We now get all the entries in moviesearch:index:dar
, which are the keys of our moviesearch:data hash. The sorted set with the key moviesearch:index:dar
contains 9
and 10
as members. With those numbers we can now just fetch all the hash entries with these as key and present them to the user. But let's see how that works.

Matching more than one word using Redis' ZINTERSTORE

If we want to get all the entries for moviesearch:index:ki
we use the ZRANGE command provided by Redis:

ZRANGE moviesearch:index:dar 0 -1

Starting from the first element ( 0
since Redis starts indexing with 0) and going to the last ( -1
) we get the hash keys for movies whose names contain the prefix 'dar':

$ redis-cli ZRANGE moviesearch:index:dar 0 -1
1) "9"
2) "10"

And now, let's fetch all the entries from our moviesearch:data
with those keys:

$ redis-cli HMGET moviesearch:data 9 10
1) "{"name":"The Dark Knight Rises","year":2008}"
2) "{"name":"The Dark Knight","year":2008}"

Instead of using HGET
we use HMGET
to fetch multiple hash entries. That's quite neat! Now we have all the movies containing a word that starts with 'dar'. And we could present those to the user, who is typing and waiting for suggestions. Let's go one step further though:

Let's suppose a user has typed ki bi
into our search form. We now want to present him Kill Bill
, Kill Bill 2
, Kilts for Bill
as suggestions, but not Killer Elite
and not King Kong
. That means we need to find a movie containing both those prefixes in its name and not only one of them. And this is exactly where Redis' ZINTERSTORE
comes out to play.

creates a new sorted with a given key containing all the members occuring in the sets passed to it. Let's use it to create a temporary set containing the hash keys of the movies having 'ki bi' in their names:

$ redis-cli ZINTERSTORE moviesearch:index:ki|bi 2 moviesearch:index:ki moviesearch:index:bi

What happens here? ZINTERSTORE
looks up which members are in both moviesearch:index:ki
and moviesearch:index:bi
and creates a new sorted set with the key moviesearch:index:ki|bi
containing those members. The pattern for this key is also from Soulmate: dig through the code as there are lots of great ideas! Now we can use the ZRANGE
command again and see which movies contain those prefixes:

$ redis-cli ZRANGE 'moviesearch:index:ki|bi'
1) "1"
2) "4"
3) "5"

And those are exactly the keys pointing to Kill Bill
, Kill Bill 2
and Kilts For Bill
in the moviesearch:data
hash! Great! All we have to do now is use HMGET
to get the data for those keys from the hash and present them to the user.

Score & Popularity

Until now we ignored the score of our sorted sets. But let's say all our users are looking up Kill Bill
by typing in Ki Bi
and hitting enter. Let's also assume there are a lot more users looking up Kill Bill
than there are people interested in Kindergarten Cop
(as weird as this assumption may sound, bear with me here). Remember when we associated the movies with the prefixes? We did this, using 0
as the score:

ZADD moviesearch:index:ki 0 1

Now, everytime a person looks up Kill Bill
instead of Kindergarten Cop
we can increment the score of the 1
entry (pointing to Kill Bill
) in the moviesearch:index:ki
set (and all the other sets containing 1
) using ZINCRBY
increments the score of a given member in a given set by a given value. In order to sort our results by popularity we could increment the score of a given movie everytime a user looks that movie up. A simple method for doing this would probably look like this:

def incr_score_for(movie_name, data_hash_key)
  prefixes    = prefixes_for(movie_name)

  prefixes.each do |prefix|
    REDIS.zincrby(index_key_for(prefix), 1, data_hash_key)

We take every prefix occuring in the movie name and increment the score of the member pointing to the movie's data in the moviesearch:data

Now, before thinking about scores of set members we used ZRANGE
to get the members of a given set. Working with scores now, the next time we try to match a movie with the given prefixes we'll use ZREVRANGE
to return the matching hash keys ordered by their respective score. The following should then give us Kill Bill
at the top after we incremented the score for this movie every time a user looks it up.

ZREVRANGE  moviesearch:index:ki 0 -1

combined we can write a Ruby method to look up movies matching the passed prefixes and order them by score:

def find_by_prefixes(prefixes)
  intersection_key = index_key_for(prefixes)
  index_keys       = prefixes.map {|prefix| index_key_for(prefix)}

  REDIS.zinterstore(intersection_key, index_keys)
  REDIS.expire(intersection_key, 7200)

  data_hash_keys  = REDIS.zrevrange(intersection_key, 0, -1)
  matching_movies = REDIS.hmget(data_key, *data_hash_keys)

  matching_movies.map {|movie| JSON.parse(movie, symbolize_names: true)}

This method takes an array of prefixes as argument, creating the index keys for them, then it creates a temporary sorted set using ZINTERSTORE
tells Redis to delete a given key after the specified time in seconds) containing the data hash keys pointing to the movies. After that ZREVRANGE
gives us all the members of this set ordered by their score and finally HMGET
is used to get the data for all the matching keys from the moviesearch:data
hash. There are a couple of helper methods involved, but it should still be pretty clear what happens. If not, look at the code of the whole MovieMatcher
class I use in Anygood here

So everything combined we use Redis hashes to save our data, sorted sets for every prefix whose members point at our movies in the hash and in order to find movies containing multiple prefixes we use ZINTERSTORE
as a cache to point us to the movies containing both. And now we've got search autocompletion presenting ordered results to the user matching multiple phrases!

If you want to dig deeper, please read through the source code of Soulmate and study those two articles mentioned in the first paragraph. They both do a great job at explaining what exactly is going on here and why to use sorted sets and the other data types as we do.

And if you're not following me on Twitter
you're missing out!

责编内容by:Thorsten Ball (源链)。感谢您的支持!


Redis集群实现原理探讨 Redis集群是一个distribute、fault-tolerant的Redis实现,主要设计目标是达到线性可扩展性、可用性、数据一致性。 ...
PubSub in Service Fabric with Redis I’ve been working on a project that uses Fabric and I’m hosting Redis inside the...
redis数据库安装及简单的增删改查 redis下载地址: https://github.com/MSOpenTech/redis/releases 。 解压之后,运行 redi...
一个Redis Cache实现 需求 应用中需要通过HTTP调用远程的数据,但是这个获取过程需要执行较长时间,而且这个数据本身的变化也不频繁,这种情况最适合用一个cache来优化。 ...
CentOS 7 安装 redis 3.0.6 集群 # yum -y install gcc openssl-devel libyaml-devel libffi-devel readline-devel zli...