Ferret picture
Mark CanningMark Canning
Ferret: A Resourceful Substring Search Engine in Go
posted Monday, Aug. 29 by Mark Canning

Ferret

Building the backend infrastructure at Tamber, we have to solve difficult challenges with limited resources and tight budgets. Our entire infrastructure, making real-time recommendations for thousands of users, runs on a simple Amazon EC2 setup for $150/month. We need to make every CPU cycle count.

One of our greatest challenges has been building a flexible, yet high-performance search engine that interfaces with our Go backend. The result is the open-source project Ferret - it handles error correction, sorts the results, and doesn't hog resources

Below is a demo running on a single micro EC2 server (free tier)

The micro instance's network performance limits the demo to 1,000 queries/second, but with a medium EC2 instance we've measured upwards of 100,000 queries/second.

Ferret is open-source and available here.

Designing Ferret

Before Ferret, we had setup a simple regex call that would perform a prefix search over our Mongo database. This had a number of problems:

  1. Reading from the harddrive is very slow. On a good day, with our server, this might have been able to handle 1,000 queries per second, all while locking the database and dragging the server to a halt.

  2. It couldn't handle error correction, and could only do prefix searches, not full substring searches.

  3. It couldn't handle artists with multiple names, only searching their "official" name.

  4. It could be easily hit with a regex injection, where a specially crafted search string could take exponential time and grind the server to a halt.

Before jumping into writing our own search engine, we took some time to research our options. Most existing serach engines (e.g. Lucene) were bloated and had poor (or no) interfaces with Go. Just the simple prefix search would't meet our needs, and naive implementations to solve the remaining problems would be just as slow as the earlier database implementation. We needed a new algorithm.

Enter Ferret

Ferret meets our needs for a flexible and high-performance search engine with an algorithm that we've internally dubbed the "Inverted Suffix." The Inverted Suffix combines an Inverted Index and a Suffix Array. A pure Suffix Array only works for text searches, not dictionary searches, and a pure Inverted Index is unsorted and intended to be used over words, not characters.

Combining them, however, creates an efficient search engine that can perform queries in O(ln(m)+n) time, m being the number of characters in the dictionary, and n being the length of the query. A search over a large dictionary of 100 million characters can be done in less than 100 comparisons.

Sorted queries are done by sorting over the results in O(ln(n)*r) time, n being the number of results to return, and r being the number of results, and could be done fairly efficiently, generally taking little additional time over the search. Error-correction is done by enumerating all the Levenshtein-distance-1 words from the query, then querying for all of them (assuming exact matches weren't already found).

Why Go?

Ferret was built in Go for a number of reasons. First and foremost, our backend is written in Go, and we wanted our search engine to interface with the backend. But then, why write the backend in Go? And why not use Cgo or Plan 9 assembly for better performance?

  1. Go is not much slower than C or Assembly. Rewriting core Ferret functions in Assembly produces only a 20% improvement to the query time, and requires distinct versions for each architecture; and the C interface to Go requires converting the types (especially slices), dramatically slowing each query.

  2. Go is expressive. When we originally ported the backend from Python to Go a year ago, the code-base only increased size by 10%, with many functions getting shorter, due to the wonderful standard Go library. Development time is also correspondingly faster.

Talk is Cheap: Show me the code!

Let's take a look at how some of the Ferret code works:

func New(Words, Results []string, Data []interface{}, Converter func(string) []byte) *InvertedSuffix {
	WordIndex := make([]int, 0)
	SuffixIndex := make([]int, 0)
	NewWords := make([][]byte, 0)
	for i, Word := range Words {
		word := Converter(Word)
		for j := 0; j < len(word); j++ {
			WordIndex = append(WordIndex, i)
			SuffixIndex = append(SuffixIndex, j)
		}
		NewWords = append(NewWords, word)
	}
	sort.Sort(&sortWrapper{WordIndex, SuffixIndex, NewWords})
	Suffixes := &InvertedSuffix{WordIndex, SuffixIndex, NewWords, Results, Data, Converter}
	return Suffixes
}

This code sample initializes the Ferret search engine, and in only 16 lines, largely a testament to the elegance of Go. The sort.Sort call (out of the standard library) sorts the suffixes Words[WordIndex[i]][SuffixIndex[i]:] alphabetically, where WordIndex maps from each character (indexed by 'i') to their containing word, and SuffixIndex holds the position of that character in the word, so that the python-esque slice Word[SuffixIndex[i]:] is the suffix starting with that character.

Once you've initialized Ferret, it's time to start searching. Searching maintains two bounds over the data, a lower-bound, marking the lowest suffix that matches the query substring, and the upper-bound for the highest suffix that matches the query substring (alphabetically sorted). In order to produce results, Ferret updates both of these boundaries for each character in the query, leaving you with the final range of sustring matches.

Here's a segment of the low-level query code:

c := Query[a]
i := low
j := high
// Raise the lower-bound
for i < j {
	h := (i + j) >> 1
	Index := IS.WordIndex[h]
	Word := IS.Words[Index]
	Length := len(Word)
	d := IS.SuffixIndex[h] + a
	if d >= Length {
		i = h + 1
	} else {
		e := Word[d]
		if e < c {
			i = h + 1
		} else {
			j = h
			if e > c {
				high = h
			}
		}
	}
}

This updating of the boundaries is done twice for each character in the query (indexed by 'a', and stored in 'c'), first updating the lower boundary, then the upper boundary. The code itself is your basic binary search. However, instead of using sort.Find, which implements mostly the same thing, we write out the code itself. This is for two reasons. First, writing it out has a significant performance improvement over calling the function from the library, and second, we also adapt the upper-bound in this loop (;if e > c { high = h }), which saves on a lot of comparison time.

In the full code, all the higher-level functions (Query, SortedQuery, etc.) wrap the low-level Search function partially shown above, which you'll hopefully never have to use in your own code.

Using Ferret

Ferret is designed with flexibility in mind, so its interface has a lot of options available, many of which can be left unused.

Initializing Ferret takes several parameters for optimal flexibility, following the definition func New(Words, Results []string, Data []interface{}, Converter func(string) []byte) *InvertedSuffix First, Words, a dictionary of strings which you want to do substring searches for. Next, Results, the results strings to which each word in the dictionary corresponds to (in most use-cases Results will be equivalent to Words), which is useful for having multiple search values point to the same result (e.g. Dr. -> Doctor), or for mapping related values a single result (e.g. Dog/Cat -> Pets). Third, Data, a list of values (as interface{}) which are returned as can also be used for sorting, (e.g. artist popularities). Finally, Converter a conversion function, which takes a word or query string, and returns a byte array used for the low-level searches, used for performance improvements (go strings are slow), and also be used for easy search-space reduction (for example, doing case-insensitive searches, and weird unicode-to-ASCII conversion).

Results sorting is also done with several parameters, with the definition func(string, interface{}, int, int) float64 First, the string representing the matching result (as a catch-all for any weird sorting needs you want). Next, the value (from Data), mapped to the result (which can be used to hold outside values, such as frequency or artist popularity). Third, the length of the matching byte array (which can be different from the result). Last, the index of the start of the substring in the word (which can be used for prioritizing prefix searches).

Ferret at Tamber

Well, let me demonstrate exactly how Ferret is used at Tamber:

import "github.com/argusdusty/Ferret"

var ArtistSearchConverter = ferret.UnicodeToLowerASCII

func ArtistSearchSorter(s string, v interface{}, l int, i int) float64 {
	switch x := v.(type) {
	case float64:
		return (x - 1.0) * float64(l+i)
	}
	return -float64(l + i)
}

func ArtistSearchCorrection(b []byte) [][]byte { return ferret.ErrorCorrect(b, ferret.LowercaseLetters) }

func InitSearch() {
	artists := make([]string, 0)
	data := make([]interface{}, 0)
	for a, _ := range ArtistNames {
		artists = append(artists, a)
		if f, ok := PopularityMap[a]; ok {
			data = append(data, f)
		} else {
			data = append(data, 0.0)
		}
	}
	ArtistSearchData = ferret.New(artists, artists, data, ArtistSearchConverter)
}

func ArtistSearchQuery(Query string, Fast bool) []string {
	if Fast {
		Results, _ := ArtistSearchData.Query(Query, 6)
		return Results
	}
	Results, _, _ := ArtistSearchData.SortedErrorCorrectingQuery(Query, 25, ArtistSearchCorrection, ArtistSearchSorter)
	return Results
}

There you have it. Those 31 lines of code cover all the code needed to run Ferret on our artist database at Tamber. InitSearch() is run at startup, after the artist names and familiarity are loaded. When we want to run an auto-complete search we just call ArtistSearchQuery(Query, true), and for a full search ArtistSearchQuery(Query, false)

The sorter scores the values inversely proportional to the length and index of substring, and proportional to the popularity (familiarity) of the artist. The converter enables lower-case matches, so a query like "floyd" and "FLOYD" match "Floyd". The error-correcter does a Levenshtein-distance 1 with lowercase letters. Since the converter maps the query to lowercase, there's no need to add upper-case ASCII characters into the mix, which makes error-correction faster.

Overall, I had a lot of fun working on this challenge, and I hope you find Ferret (and Tamber) useful. Again Ferret is open source and available on github here.