Fast and Easy Levenshtein distance using a Trie (in C++)

I implemented this clever algorithm described on Steven Hanov’s blog in C++. Actually this algorithm is a bit different from the original because I wanted to know what is the least Levenshtein distance given a word.

#include <iostream>
#include <map>
#include <string>
#include <vector>
#include <algorithm>
#include <cctype>

/*
 * Algorithm: Edit distance using a trie-tree (Dynamic Programming)
 * Author: Murilo Adriano Vasconcelos <muriloufg@gmail.com>
 */

using namespace std;

// Trie's node
struct trie
{
	typedef map<char, trie*> next_t;

	// The set with all the letters which this node is prefix
	next_t next;

	// If word is equal to "" is because there is no word in the
	//	dictionary which ends here.
	string word;

	trie() : next(map<char, trie*>()) {}

	void insert(string w)
	{
		w = string("$") + w;
		
		int sz = w.size();
		
		trie* n = this;
		for (int i = 0; i < sz; ++i) {
			if (n->next.find(w[i]) == n->next.end()) {
				n->next[w[i]] = new trie();
			}

			n = n->next[w[i]];
		}

		n->word = w;
	}
};

// The tree
trie tree;

// The minimum cost of a given word to be changed to a word of the dictionary
int min_cost;

//
void search_impl(trie* tree, char ch, vector<int> last_row, const string& word)
{
	int sz = last_row.size();

	vector<int> current_row(sz);
	current_row[0] = last_row[0] + 1;

	// Calculate the min cost of insertion, deletion, match or substution
	int insert_or_del, replace;
	for (int i = 1; i < sz; ++i) {
		insert_or_del = min(current_row[i-1] + 1, last_row[i] + 1);
		replace = (word[i-1] == ch) ? last_row[i-1] : (last_row[i-1] + 1);

		current_row[i] = min(insert_or_del, replace);
	}

	// When we find a cost that is less than the min_cost, is because
	// it is the minimum until the current row, so we update
	if ((current_row[sz-1] < min_cost) && (tree->word != "")) {
		min_cost = current_row[sz-1];
	}

	// If there is an element wich is smaller than the current minimum cost,
	// 	we can have another cost smaller than the current minimum cost
	if (*min_element(current_row.begin(), current_row.end()) < min_cost) {
		for (trie::next_t::iterator it = tree->next.begin(); it != tree->next.end(); ++it) {
			search_impl(it->second, it->first, current_row, word);
		}
	}
}

int search(string word)
{
	word = string("$") + word;
	
	int sz = word.size();
	min_cost = 0x3f3f3f3f;

	vector<int> current_row(sz + 1);

	// Naive DP initialization
	for (int i = 0; i < sz; ++i) current_row[i] = i;
	current_row[sz] = sz;
	
	
	
	// For each letter in the root map wich matches with a
	// 	letter in word, we must call the search
	for (int i = 0 ; i < sz; ++i) {
		if (tree.next.find(word[i]) != tree.next.end()) {
			search_impl(tree.next[word[i]], word[i], current_row, word);
		}
	}

	return min_cost;
}

The modification for match all words within a given Leveshtein distance and returning that set is trivial. Just modify the line 72 for storing that word if its Leveshtein distance is less than or equal to a given distance.

For more details, read this: Fast and Easy Levenshtein distance using a Trie
See ya!

Written by Murilo Adriano

1 de February de 2011 at 21:46

Posted in C/C++, Estruturas de Dados

Tagged with algorithm, c, data structure, string, trie

8 Responses

Subscribe to comments with RSS.

Véi, seu blog tá massa! =)

Ewerton Araújo

2 de February de 2011 at 19:20

Reply
- Que bom que gostou!
  
  Murilo Adriano
  
  3 de February de 2011 at 14:01
  
  Reply
Great algorithm! Thanks for sharing! =o)

However, performance can be easily improved by having arrays instead of vectors, given that we already know their sizes (word.size()+1). It’s an almost 2x speedup with my big test files.

Also, the init can entirely be done inside the loop, and if the word already exists inside the dictionary, it’s quicker to create a specialized function for that and run it before the Levenshtein algorithm.

Mic

6 de February de 2011 at 20:24

Reply
- Hi Mic,
  
  I liked the suggestion of creating a function to know if the words already exists on the dictionary because no extra memory is required and can save a lot of time.
  About the use of arrays instead of vectors, yes, we can save time by using arrays but I think the most expensive thing is passig the last_row by value instead of passing by reference. I think that this modification can save some time and the performance will be similar of using raw arrays.
  
  Thank you for letting me know about your tests and for commenting!
  
  Murilo Adriano
  
  7 de February de 2011 at 15:39
  
  Reply
Hi,

it seems this code does not work correctly if you exchange the first character in the word. For example, if you insert “smukk”, then search for “bmukk”, you will get garbage (or rather, 0x3f3f3f3f), and not 1.

Audun Mathias Øygard

26 de April de 2011 at 17:05

Reply
- Hi Audun,
  
  Thank you for letting me know this problem.
  I’ve updated the code by adding a common mark (‘$’) on the strings for starting the computation.
  
  Murilo Adriano
  
  27 de April de 2011 at 09:27
  
  Reply
  - Great! Works now, thanks!
    
    Audun Mathias Øygard
    
    27 de April de 2011 at 16:51
[…] this to work for most of my test cases. My implementation is practically a direct translation from Murilo’s C++ version of Steve Hanov’s algorithm. So how should I refactor this algorithm and/or make […]

Implementing a simple Trie for efficient Levenshtein Distance calculation – Java | Ask Programming & Technology

5 de November de 2013 at 05:08

Reply

	affordable lasik on Escrevendo um Código PHP …
	Implementing a simpl… on Fast and Easy Levenshtein dist…
	Murilo Adriano on Deeper look at PHP’s arr…
	aajjbb on Deeper look at PHP’s arr…
	Deeper look at PHP… on PHP’s sort functions are…
	Mais uma dica de blo… on Comparação de desempenho: std:…
	Mais uma dica de blo… on Linguagens de programação…

Murilo :P