Building a Vector Space Indexing Engine in Python


Building a Vector Space Indexing Engine in Python

Ever wanted to code a search engine from scratch? Well actually its a pretty simple thing to do. Here is an example indexer I coded up in less then an hour using Python.

Get the source to everything below here in

The first thing we need to do is have a way to take our documents we want to search on and turn them into an concordance. A concordance for those not in the know is a count of every word that occurs in a document.

def concordance(document):
  if type(document) != str:
    raise ValueError('Supplied Argument should be of type string')
  con = {}
  for word in document.split(' '):
    if con.has_key(word):
      con[word] = con[word] + 1
      con[word] = 1
  return con

The above method simply allows us to pass in a clean text document and get back a concordance of the words in that document.

The only other thing we need is a vector space. A vector space for those not in the know is a way of calculating the distances between two points. Essentially it works the same way calculating the 3rd side of a triangle. Except that instead of 2 planes (x and y) or even 3 planes (x,y,z) you can have as many planes as you want. The actual idea takes a while to understand but you can read about it here, Vector Space Search Engine Theory (PDF).

Thankfully I already have implemented the algorithm in my Decoding CAPTCHA’s post and can just copy paste it from there. I have modified it a little bit to avoid divide by zero issues, check types and to add the above concordance method in since it really does belong together.

class VectorCompare:
  def magnitude(self,concordance):
    if type(concordance) != dict:
      raise ValueError('Supplied Argument should be of type dict')
    total = 0
    for word,count in concordance.iteritems():
      total += count ** 2
    return math.sqrt(total)

  def relation(self,concordance1, concordance2):
    if type(concordance1) != dict:
      raise ValueError('Supplied Argument 1 should be of type dict')
    if type(concordance2) != dict:
      raise ValueError('Supplied Argument 2 should be of type dict')
    relevance = 0
    topvalue = 0
    for word, count in concordance1.iteritems():
      if concordance2.has_key(word):
        topvalue += count * concordance2[word]
    if (self.magnitude(concordance1) * self.magnitude(concordance2)) != 0:
      return topvalue / (self.magnitude(concordance1) * self.magnitude(concordance2))
      return 0

  def concordance(self,document):
    if type(document) != str:
      raise ValueError('Supplied Argument should be of type string')
    con = {}
    for word in document.split(' '):
      if con.has_key(word):
        con[word] = con[word] + 1
        con[word] = 1
    return con

To use it you just supply two concordances (one the document and the other the query) and  it returns a number from 0 to 1 of how related they are. The higher the number the more relevant the search terms are to the document.

So now all we need do, is take every document, build a concordance for it, then compare each one to our search terms, sort the results by the number returned and we are set. The documents I decided to use are the titles and first paragraph of the last 7 blogs I have posted here.

v = VectorCompare()

documents = {
  0:'''At Scale You Will Hit Every Performance Issue I used to think I knew a bit about performance scalability and how to keep things trucking when you hit large amounts of data Truth is I know diddly squat on the subject since the most I have ever done is read about how its done To understand how I came about realising this you need some background''',
  1:'''Richard Stallman to visit Australia Im not usually one to promote events and the like unless I feel there is a genuine benefit to be had by attending but this is one stands out Richard M Stallman the guru of Free Software is coming Down Under to hold a talk You can read about him here Open Source Celebrity to visit Australia''',
  2:'''MySQL Backups Done Easily One thing that comes up a lot on sites like Stackoverflow and the like is how to backup MySQL databases The first answer is usually use mysqldump This is all fine and good till you start to want to dump multiple databases You can do this all in one like using the all databases option however this makes restoring a single database an issue since you have to parse out the parts you want which can be a pain''',
  3:'''Why You Shouldnt roll your own CAPTCHA At a TechEd I attended a few years ago I was watching a presentation about Security presented by Rocky Heckman read his blog its quite good In it he was talking about security algorithms The part that really stuck with me went like this''',
  4:'''The Great Benefit of Test Driven Development Nobody Talks About The feeling of productivity because you are writing lots of code Think about that for a moment Ask any developer who wants to develop why they became a developer One of the first things that comes up is I enjoy writing code This is one of the things that I personally enjoy doing Writing code any code especially when its solving my current problem makes me feel productive It makes me feel like Im getting somewhere Its empowering''',
  5:'''Setting up GIT to use a Subversion SVN style workflow Moving from Subversion SVN to GIT can be a little confusing at first I think the biggest thing I noticed was that GIT doesnt have a specific workflow you have to pick your own Personally I wanted to stick to my Subversion like work-flow with a central server which all my machines would pull and push too Since it took a while to set up I thought I would throw up a blog post on how to do it''',
  6:'''Why CAPTCHA Never Use Numbers 0 1 5 7 Interestingly this sort of question pops up a lot in my referring search term stats Why CAPTCHAs never use the numbers 0 1 5 7 Its a relativity simple question with a reasonably simple answer Its because each of the above numbers are easy to confuse with a letter See the below''',

index = {

searchterm = raw_input('Enter Search Term: ')
matches = []

for i in range(len(index)):
  relation = v.relation(v.concordance(searchterm.lower()),index[i])
  if relation != 0:


for i in matches:
  print i[0],i[1]

Now running it and trying some searches.

Enter Search Term: captcha
0.124034734589 Why You Shouldnt roll your own CAPTCHA At a TechEd I attended a few years ago I was watching a prese
0.0957826285221 Why CAPTCHA Never Use Numbers 0 1 5 7 Interestingly this sort of question pops up a lot in my referr

Enter Search Term: mysql stallman
0.140028008403 Richard Stallman to visit Australia Im not usually one to promote events and the like unless I feel
0.110096376513 MySQL Backups Done Easily One thing that comes up a lot on siteslike Stackoverflow and the like is

Results are not too bad I think! Now there are some problems with this technique. Firstly it doesn’t support boolean searches which can be an issue, although most people tend to just type some terms. Secondly it has problems with larger documents. The way the vector space works is biased towards smaller documents since they are closer to the search term space. You can get around this by breaking larger documents up into smaller ones though. The final and biggest issue though is that it is pretty CPU intensive. I have tested a search like this with 50,000 documents and it was OK but you wouldn’t want to go much further then that. It is a pretty naive implementation though. With some caching and checking which documents are worth comparing you could take this up to millions of documents.

I remember reading somewhere (no source sorry) that Altavista and some of the other early search engines used a technique similar to the above for calculating rankings, so it seems the idea really can be taken to a large scale.

By now I am sure someone is thinking, “Hang on, if its that simple then why is it so hard to make the next Google?”. Well the answer is that its pretty easy to index 10,000 to 100,000,000 pages it gets considerably more difficult to index 1,000,000,000+ pages. You have to shard out to multiple computers and the margin for error is pretty low. You can read this post Why Writing a Search Engine is Hard written by Anna Patterson (one of the co-founders of  Cuil) which explains the problem nicely.

Damn Cool Algorithms: Levenshtein Automata


Damn Cool Algorithms: Levenshtein Automata

In a previous Damn Cool Algorithms post, I talked about BK-trees, a clever indexing structure that makes it possible to search for fuzzy matches on a text string based on Levenshtein distance - or any other metric that obeys the triangle inequality. Today, I'm going to describe an alternative approach, which makes it possible to do fuzzy text search in a regular index: Levenshtein automata.


The basic insight behind Levenshtein automata is that it's possible to construct a Finite state automaton that recognizes exactly the set of strings within a given Levenshtein distance of a target word. We can then feed in any word, and the automaton will accept or reject it based on whether the Levenshtein distance to the target word is at most the distance specified when we constructed the automaton. Further, due to the nature of FSAs, it will do so in O(n) time with the length of the string being tested. Compare this to the standard Dynamic Programming Levenshtein algorithm, which takes O(mn) time, where m and n are the lengths of the two input words! It's thus immediately apparrent that Levenshtein automaton provide, at a minimum, a faster way for us to check many words against a single target word and maximum distance - not a bad improvement to start with!

Of course, if that were the only benefit of Levenshtein automata, this would be a short article. There's much more to come, but first let's see what a Levenshtein automaton looks like, and how we can build one.

Construction and evaluation

The diagram on the right shows the NFA for a Levenshtein automaton for the word 'food', with maximum edit distance 2. As you can see, it's very regular, and the construction is fairly straightforward. The start state is in the lower left. States are named using a ne style notation, where n is the number of characters consumed so far, and e is the number of errors. Horizontal transitions represent unmodified characters, vertical transitions represent insertions, and the two types of diagonal transitions represent substitutions (the ones marked with a *) and deletions, respectively.

Let's see how we can construct an NFA such as this given an input word and a maximum edit distance. I won't include the source for the NFA class here, since it's fairly standard; for gory details, see the Gist. Here's the relevant function in Python:

def levenshtein_automata(term, k):
= NFA((0, 0))
for i, c in enumerate(term):
for e in range(k + 1):
# Correct character
.add_transition((i, e), c, (i + 1, e))
if e < k:
# Deletion
.add_transition((i, e), NFA.ANY, (i, e + 1))
# Insertion
.add_transition((i, e), NFA.EPSILON, (i + 1, e + 1))
# Substitution
.add_transition((i, e), NFA.ANY, (i + 1, e + 1))
for e in range(k + 1):
if e < k:
.add_transition((len(term), e), NFA.ANY, (len(term), e + 1))
.add_final_state((len(term), e))
return nfa

This should be easy to follow; we're basically constructing the transitions you see in the diagram in the most straightforward manner possible, as well as denoting the correct set of final states. State labels are tuples, rather than strings, with the tuples using the same notation we described above.

Because this is an NFA, there can be multiple active states. Between them, these represent the possible interpretations of the string processed so far. For example, consider the active states after consuming the characters 'f' and 'x':

This indicates there are several possible variations that are consistent with the first two characters 'f' and 'x': A single substitution, as in 'fxod', a single insertion, as in 'fxood', two insertions, as in 'fxfood', or a substitution and a deletion, as in 'fxd'. Also included are several redundant hypotheses, such as a deletion and an insertion, also resulting in 'fxod'. As more characters are processed, some of these possibilities will be eliminated, while other new ones will be introduced. If, after consuming all the characters in the word, an accepting (bolded) state is in the set of current states, there's a way to convert the input word into the target word with two or fewer changes, and we know we can accept the word as valid.

Actually evaluating an NFA directly tends to be fairly computationally expensive, due to the presence of multiple active states, and epsilon transitions (that is, transitions that require no input symbol), so the standard practice is to first convert the NFA to a DFA using powerset construction. Using this algorithm, a DFA is constructed in which each state corresponds to a set of active states in the original NFA. We won't go into detail about powerset construction here, since it's tangential to the main topic. Here's an example of a DFA corresponding to the NFA for the input word 'food' with one allowable error:

Note that we depicted a DFA for one error, as the DFA corresponding to our NFA above is a bit too complex to fit comfortably in a blog post! The DFA above will accept exactly the words that have an edit distance to the word 'food' of 1 or less. Try it out: pick any word, and trace its path through the DFA. If you end up in a bolded state, the word is valid.

I won't include the source for powerset construction here; it's also in the gist for those who care.

Briefly returning to the issue of runtime efficiency, you may be wondering how efficient Levenshtein DFA construction is. We can construct the NFA in O(kn) time, where k is the edit distance and n is the length of the target word. Conversion to a DFA has a worst case of O(2n) time - which leads to a pretty extreme worst-case of O(2kn) runtime! Not all is doom and gloom, though, for two reasons: First of all, Levenshtein automata won't come anywhere near the 2n worst-case for DFA construction*. Second, some very clever computer scientists have come up with algorithms to construct the DFA directly in O(n) time, [SCHULZ2002FAST] and even a method to skip the DFA construction entirely and use a table-based evaluation method!


Now that we've established that it's possible to construct Levenshtein automata, and demonstrated how they work, let's take a look at how we can use them to search an index for fuzzy matches efficiently. The first insight, and the approach many papers [SCHULZ2002FAST] [MIHOV2004FAST] take is to observe that a dictionary - that is, the set of records you want to search - can itself be represented as a DFA. In fact, they're frequently stored as a Trie or a DAWG, both of which can be viewed as special cases of DFAs. Given that both the dictionary and the criteria (the Levenshtein automata) are represented as DFAs, it's then possible to efficiently intersect the two DFAs to find exactly the words in the dictionary that match our criteria, using a very simple procedure that looks something like this:

def intersect(dfa1, dfa2):
= [("", dfa1.start_state, dfa2.start_state)]
while stack:
, state1, state2 = stack.pop()
for edge in set(dfa1.edges(state1)).intersect(dfa2.edges(state2)):
=, edge)
=, edge)
if state1 and state2:
= s + edge
.append((s, state1, state2))
if dfa1.is_final(state1) and dfa2.is_final(state2):
yield s

That is, we traverse both DFAs in lockstep, only following edges that both DFAs have in common, and keeping track of the path we've followed so far. Any time both DFAs are in a final state, that word is in the output set, so we output it.

This is all very well if your index is stored as a DFA (or a trie or DAWG), but many indexes aren't: if they're in-memory, they're probably in a sorted list, and if they're on disk, they're probably in a BTree or similar structure. Is there a way we can modify this scheme to work with these sort of indexes, and still provide a speedup on brute-force approaches? It turns out that there is.

The critical insight here is that with our criteria expressed as a DFA, we can, given an input string that doesn't match, find the next string (lexicographically speaking) that does. Intuitively, this is fairly easy to do: we evaluate the input string against the DFA until we cannot proceed further - for example, because there's no valid transition for the next character. Then, we repeatedly follow the edge that has the lexicographically smallest label until we reach a final state. Two special cases apply here: First, on the first transition, we need to follow the lexicographically smallest label greater than character that had no valid transition in the preliminary step. Second, if we reach a state with no valid outwards edge, we should backtrack to the previous state, and try again. This is pretty much the 'wall following' maze solving algorithm, as applied to a DFA.

For an example of this, take a look at the DFA for food(1), above, and consider the input word 'foogle'. We consume the first four characters fine, leaving us in state 3141. The only out edge from here is 'd', while the next character is 'l', so we backtrack one step to the previous state, 21303141. From here, our next character is 'g', and there's an out-edge for 'f', so we take that edge, leaving us in an accepting state (the same state as previously, in fact, but with a different path to it) with the output string 'fooh' - the lexicographically next string in the DFA after 'foogle'.

Here's the Python code for it, as a method on the DFA class. As previously, I haven't included boilerplate for the DFA, which is all here.

  def next_valid_string(self, input):
= self.start_state
= []

# Evaluate the DFA as far as possible
for i, x in enumerate(input):
.append((input[:i], state, x))
= self.next_state(state, x)
if not state: break
.append((input[:i+1], state, None))

if self.is_final(state):
# Input word is already valid
return input

# Perform a 'wall following' search for the lexicographically smallest
# accepting state.
while stack:
, state, x = stack.pop()
= self.find_next_edge(state, x)
if x:
+= x
= self.next_state(state, x)
if self.is_final(state):
return path
.append((path, state, None))
return None

In the first part of the function, we evaluate the DFA in the normal fashion, keeping a stack of visited states, along with the path so far and the edge we intend to attempt to follow out of them. Then, assuming we didn't find an exact match, we perform the backtracking search we described above, attempting to find the smallest set of transitions we can follow to come to an accepting state. For some caveats about the generality of this function, read on...

Also needed is the utility function find_next_edge, which finds the lexicographically smallest outwards edge from a state that's greater than some given input:

  def find_next_edge(self, s, x):
if x is None:
= u'\0'
= unichr(ord(x) + 1)
= self.transitions.get(s, {})
if x in state_transitions or s in self.defaults:
return x
= sorted(state_transitions.keys())
= bisect.bisect_left(labels, x)
if pos < len(labels):
return labels[pos]
return None

With some preprocessing, this could be made substantially more efficient - for example, by pregenerating a mapping from each character to the first outgoing edge greater than it, rather than using binary search to find it in many cases. I once again leave such optimizations as an exercise for the reader.

Now that we have this procedure, we can finally describe how to search the index with it. The algorithm is surprisingly simple:

  1. Obtain the first element from your index - or alternately, a string you know to be less than any valid string in your index - and call it the 'current' string.
  2. Feed the current string into the 'DFA successor' algorithm we outlined above, obtaining the 'next' string.
  3. If the next string is equal to the current string, you have found a match - output it, fetch the next element from the index as the current string, and repeat from step 2.
  4. If the next string is not equal to the current string, search your index for the first string greater than or equal to the next string. Make this the new current string, and repeat from step 2.

And once again, here's the implementation of this procedure in Python:

def find_all_matches(word, k, lookup_func):
"""Uses lookup_func to find all words within levenshtein distance k of word.

word: The word to look up
k: Maximum edit distance
lookup_func: A single argument function that returns the first word in the
database that is greater than or equal to the input argument.
Every matching word within levenshtein distance k from the database.

= levenshtein_automata(word, k).to_dfa()
= lev.next_valid_string(u'\0')
while match:
next = lookup_func(match)
if not next:
if match == next:
yield match
next = next + u'\0'
= lev.next_valid_string(next)

One way of looking at this algorithm is to think of both the Levenshtein DFA and the index as sorted lists, and the procedure above to be similar to App Engine's "zigzag merge join" strategy. We repeatedly look up a string on one side, and use that to jump to the appropriate place on the other side. If there's no matching entry, we use the result of the lookup to jump ahead on the first side, and so forth. The result is that we skip over large numbers of non-matching index entries, as well as large numbers of non-matching Levenshtein strings, saving us the effort of exhaustively enumerating either of them. Hopefully it's apparrent from the description that this procedure has the potential to avoid the need to evaluate either all of the index entries, or all of the candidate Levenshtein strings.

As a side note, it's not true that for all DFAs it's possible to find a lexicographically minimal successor to any string. For example, consider the successor to the string 'a' in the DFA that recognizes the pattern 'a+b'. The answer is that there isn't one: it would have to consist of an infinite number of 'a' characters followed by a single 'b' character! It's possible to make a fairly simple modification to the procedure outlined above such that it returns a string that's guaranteed to be a prefix of the next string recognized by the DFA, which is sufficient for our purposes. Since Levenshtein DFAs are always finite, though, and thus always have a finite length successor (except for the last string, naturally), we leave such an extension as an exercise for the reader. There are potentially interesting applications one could put this approach to, such as indexed regular expression search, which would require this modification.


First, let's see this in action. We'll define a simple Matcher class, which provides an implementation of the lookup_func required by our find_all_matches function:

class Matcher(object):
def __init__(self, l):
self.l = l
self.probes = 0

def __call__(self, w):
self.probes += 1
= bisect.bisect_left(self.l, w)
if pos < len(self.l):
return self.l[pos]
return None

Note that the only reason we implemented a callable class here is because we want to extract some metrics - specifically, the number of probes made - from the procedure; normally a regular or nested function would be perfectly sufficient. Now, we need a sample dataset. Let's load the web2 dictionary for that:

>>> words = [x.strip().lower().decode('utf-8') for x in open('/usr/share/dict/web2')]
>>> words.sort()
>>> len(words)

We could also use a couple of subsets for testing how things change with scale:

>>> words10 = [x for x in words if random.random() <= 0.1]
>>> words100 = [x for x in words if random.random() <= 0.01]

And here it is in action:

>>> m = Matcher(words)
>>> list(automata.find_all_matches('nice', 1, m))
[u'anice', u'bice', u'dice', u'fice', u'ice', u'mice', u'nace', u'nice', u'niche', u'nick', u'nide', u'niece', u'nife', u'nile', u'nine', u'niue', u'pice', u'rice', u'sice', u'tice', u'unice', u'vice', u'wice']
>>> len(_)
>>> m.probes

Working perfectly! Finding the 23 fuzzy matches for 'nice' in the dictionary of nearly 235,000 words required 142 probes. Note that if we assume an alphabet of 26 characters, there are 4+26*4+26*5=238 strings within levenshtein distance 1 of 'nice', so we've made a reasonable saving over exhaustively testing all of them. With larger alphabets, longer strings, or larger edit distances, this saving should be more pronounced. It may be instructive to see how the number of probes varies as a function of word length and dictionary size, by testing it with a variety of inputs:

String length Max strings Small dict Med dict Full dict
1 79 47 (59%) 54 (68%) 81 (100%)
2 132 81 (61%) 103 (78%) 129 (97%)
3 185 94 (50%) 120 (64%) 147 (79%)
4 238 94 (39%) 123 (51%) 155 (65%)
5 291 94 (32%) 124 (43%) 161 (55%)

In this table, 'max strings' is the total number of strings within edit distance one of the input string, and the values for small, med, and full dict represent the number of probes required to search the three dictionaries (consisting of 1%, 10% and 100% of the web2 dictionary). All the following rows, at least until 10 characters, required a similar number of probes as row 5. The sample input string used consisted of prefixes of the word 'abracadabra'.

Several observations are immediately apparrent:

  1. For very short strings and large dictionaries, the number of probes is not much lower - if at all - than the maximum number of valid strings, so there's little saving.
  2. As the string gets longer, the number of probes required increases significantly slower than the number of potential results, so that at 10 characters, we need only probe 161 of 821 (about 20%) possible results. At commonly encountered word lengths (97% of words in the web2 dictionary are at least 5 characters long), the savings over naively checking every string variation are already significant.
  3. Although the size of the sample dictionaries differs by an order of magnitude, the number of probes required increases only a little each time. This provides encouraging evidence that this will scale well for very large indexes.

It's also instructive to see how this varies for different edit distance thresholds. Here's the same table, for a max edit distance of 2:

String length Max strings Small dict Med dict Full dict
1 2054 413 (20%) 843 (41%) 1531 (75%)
2 10428 486 (5%) 1226 (12%) 2600 (25%)
3 24420 644 (3%) 1643 (7%) 3229 (13%)
4 44030 646 (1.5%) 1676 (4%) 3366 (8%)
5 69258 648 (0.9%) 1676 (2%) 3377 (5%)

This is also promising: with an edit distance of 2, although we're having to do a lot more probes, it's a much smaller percentage of the number of candidate strings. With a word length of 5 and an edit distance of 2, having to do 3377 probes is definitely far preferable to doing 69,258 (one for every matching string) or 234,936 (one for every word in the dictionary)!

As a quick comparison, a straightforward BK-tree implementation with the same dictionary requires examining 5858 nodes for a string length of 5 and an edit distance of 1 (with the same sample string used above), while the same query with an edit distance of 2 required 58,928 nodes to be examined! Admittedly, many of these nodes are likely to be on the same disk page, if structured properly, but there's still a startling difference in the number of lookups.

One final note: The second paper we referenced in this article, [MIHOV2004FAST] describes a very nifty construction: a universal Levenshtein automata. This is a DFA that determines, in linear time, if any pair of words are within a given fixed Levenshtein distance of each other. Adapting the above scheme to this system is, also, left as an exercise for the reader.

The complete source code for this article can be found here.

* Robust proofs of this hypothesis are welcome.

[SCHULZ2002FAST] Fast string correction with Levenshtein automata

[MIHOV2004FAST] Fast approximate search in large dictionaries