Murilo :P

C++, Computação, Programação, Web e afins :)

Fast and Easy Levenshtein distance using a Trie (in C++)

com 7 comentários

I implemented this clever algorithm described on Steven Hanov’s blog in C++. Actually this algorithm is a bit different from the original because I wanted to know what is the least Levenshtein distance given a word.

#include <iostream>
#include <map>
#include <string>
#include <vector>
#include <algorithm>
#include <cctype>

/*
 * Algorithm: Edit distance using a trie-tree (Dynamic Programming)
 * Author: Murilo Adriano Vasconcelos <muriloufg@gmail.com>
 */

using namespace std;

// Trie's node
struct trie
{
	typedef map<char, trie*> next_t;

	// The set with all the letters which this node is prefix
	next_t next;

	// If word is equal to "" is because there is no word in the
	//	dictionary which ends here.
	string word;

	trie() : next(map<char, trie*>()) {}

	void insert(string w)
	{
		w = string("$") + w;
		
		int sz = w.size();
		
		trie* n = this;
		for (int i = 0; i < sz; ++i) {
			if (n->next.find(w[i]) == n->next.end()) {
				n->next[w[i]] = new trie();
			}

			n = n->next[w[i]];
		}

		n->word = w;
	}
};

// The tree
trie tree;

// The minimum cost of a given word to be changed to a word of the dictionary
int min_cost;

//
void search_impl(trie* tree, char ch, vector<int> last_row, const string& word)
{
	int sz = last_row.size();

	vector<int> current_row(sz);
	current_row[0] = last_row[0] + 1;

	// Calculate the min cost of insertion, deletion, match or substution
	int insert_or_del, replace;
	for (int i = 1; i < sz; ++i) {
		insert_or_del = min(current_row[i-1] + 1, last_row[i] + 1);
		replace = (word[i-1] == ch) ? last_row[i-1] : (last_row[i-1] + 1);

		current_row[i] = min(insert_or_del, replace);
	}

	// When we find a cost that is less than the min_cost, is because
	// it is the minimum until the current row, so we update
	if ((current_row[sz-1] < min_cost) && (tree->word != "")) {
		min_cost = current_row[sz-1];
	}

	// If there is an element wich is smaller than the current minimum cost,
	// 	we can have another cost smaller than the current minimum cost
	if (*min_element(current_row.begin(), current_row.end()) < min_cost) {
		for (trie::next_t::iterator it = tree->next.begin(); it != tree->next.end(); ++it) {
			search_impl(it->second, it->first, current_row, word);
		}
	}
}

int search(string word)
{
	word = string("$") + word;
	
	int sz = word.size();
	min_cost = 0x3f3f3f3f;

	vector<int> current_row(sz + 1);

	// Naive DP initialization
	for (int i = 0; i < sz; ++i) current_row[i] = i;
	current_row[sz] = sz;
	
	
	
	// For each letter in the root map wich matches with a
	// 	letter in word, we must call the search
	for (int i = 0 ; i < sz; ++i) {
		if (tree.next.find(word[i]) != tree.next.end()) {
			search_impl(tree.next[word[i]], word[i], current_row, word);
		}
	}

	return min_cost;
}

The modification for match all words within a given Leveshtein distance and returning that set is trivial. Just modify the line 72 for storing that word if its Leveshtein distance is less than or equal to a given distance.

For more details, read this: Fast and Easy Levenshtein distance using a Trie
See ya!

Escrito por Murilo Adriano

terça-feira 1 de fevereiro de 2011 às 21:46

Publicado em C/C++, Estruturas de Dados

Etiquetado com , , , ,

7 Respostas

Assinar os comentários com RSS.

  1. Véi, seu blog tá massa! =)

    Ewerton Araújo

    quarta-feira 2 de fevereiro de 2011 em 19:20

  2. Great algorithm! Thanks for sharing! =o)

    However, performance can be easily improved by having arrays instead of vectors, given that we already know their sizes (word.size()+1). It’s an almost 2x speedup with my big test files.

    Also, the init can entirely be done inside the loop, and if the word already exists inside the dictionary, it’s quicker to create a specialized function for that and run it before the Levenshtein algorithm.

    Mic

    domingo 6 de fevereiro de 2011 em 20:24

    • Hi Mic,

      I liked the suggestion of creating a function to know if the words already exists on the dictionary because no extra memory is required and can save a lot of time.
      About the use of arrays instead of vectors, yes, we can save time by using arrays but I think the most expensive thing is passig the last_row by value instead of passing by reference. I think that this modification can save some time and the performance will be similar of using raw arrays.

      Thank you for letting me know about your tests and for commenting!

      Murilo Adriano

      segunda-feira 7 de fevereiro de 2011 em 15:39

  3. Hi,

    it seems this code does not work correctly if you exchange the first character in the word. For example, if you insert “smukk”, then search for “bmukk”, you will get garbage (or rather, 0x3f3f3f3f), and not 1.

    Audun Mathias Øygard

    terça-feira 26 de abril de 2011 em 17:05

    • Hi Audun,

      Thank you for letting me know this problem.
      I’ve updated the code by adding a common mark (‘$’) on the strings for starting the computation.

      Murilo Adriano

      quarta-feira 27 de abril de 2011 em 09:27

      • Great! Works now, thanks!

        Audun Mathias Øygard

        quarta-feira 27 de abril de 2011 em 16:51


Deixe uma resposta

Preencha os seus dados abaixo ou clique em um ícone para log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Sair / Alterar )

Imagem do Twitter

You are commenting using your Twitter account. Sair / Alterar )

Foto do Facebook

You are commenting using your Facebook account. Sair / Alterar )

Connecting to %s

Seguir

Obtenha todo post novo entregue na sua caixa de entrada.