## Fast and Easy Levenshtein distance using a Trie (in C++)

I implemented this clever algorithm described on Steven Hanov’s blog in C++. Actually this algorithm is a bit different from the original because I wanted to know what is the least Levenshtein distance given a word.

#include <iostream> #include <map> #include <string> #include <vector> #include <algorithm> #include <cctype> /* * Algorithm: Edit distance using a trie-tree (Dynamic Programming) * Author: Murilo Adriano Vasconcelos <muriloufg@gmail.com> */ using namespace std; // Trie's node struct trie { typedef map<char, trie*> next_t; // The set with all the letters which this node is prefix next_t next; // If word is equal to "" is because there is no word in the // dictionary which ends here. string word; trie() : next(map<char, trie*>()) {} void insert(string w) { w = string("$") + w; int sz = w.size(); trie* n = this; for (int i = 0; i < sz; ++i) { if (n->next.find(w[i]) == n->next.end()) { n->next[w[i]] = new trie(); } n = n->next[w[i]]; } n->word = w; } }; // The tree trie tree; // The minimum cost of a given word to be changed to a word of the dictionary int min_cost; // void search_impl(trie* tree, char ch, vector<int> last_row, const string& word) { int sz = last_row.size(); vector<int> current_row(sz); current_row[0] = last_row[0] + 1; // Calculate the min cost of insertion, deletion, match or substution int insert_or_del, replace; for (int i = 1; i < sz; ++i) { insert_or_del = min(current_row[i-1] + 1, last_row[i] + 1); replace = (word[i-1] == ch) ? last_row[i-1] : (last_row[i-1] + 1); current_row[i] = min(insert_or_del, replace); } // When we find a cost that is less than the min_cost, is because // it is the minimum until the current row, so we update if ((current_row[sz-1] < min_cost) && (tree->word != "")) { min_cost = current_row[sz-1]; } // If there is an element wich is smaller than the current minimum cost, // we can have another cost smaller than the current minimum cost if (*min_element(current_row.begin(), current_row.end()) < min_cost) { for (trie::next_t::iterator it = tree->next.begin(); it != tree->next.end(); ++it) { search_impl(it->second, it->first, current_row, word); } } } int search(string word) { word = string("$") + word; int sz = word.size(); min_cost = 0x3f3f3f3f; vector<int> current_row(sz + 1); // Naive DP initialization for (int i = 0; i < sz; ++i) current_row[i] = i; current_row[sz] = sz; // For each letter in the root map wich matches with a // letter in word, we must call the search for (int i = 0 ; i < sz; ++i) { if (tree.next.find(word[i]) != tree.next.end()) { search_impl(tree.next[word[i]], word[i], current_row, word); } } return min_cost; }

The modification for match all words within a given Leveshtein distance and returning that set is trivial. Just modify the line 72 for storing that word if its Leveshtein distance is less than or equal to a given distance.

For more details, read this: Fast and Easy Levenshtein distance using a Trie

See ya!

Véi, seu blog tá massa! =)

Ewerton Araújo2 de February de 2011 at 19:20

Que bom que gostou!

Murilo Adriano3 de February de 2011 at 14:01

Great algorithm! Thanks for sharing! =o)

However, performance can be easily improved by having arrays instead of vectors, given that we already know their sizes (word.size()+1). It’s an almost 2x speedup with my big test files.

Also, the init can entirely be done inside the loop, and if the word already exists inside the dictionary, it’s quicker to create a specialized function for that and run it before the Levenshtein algorithm.

Mic6 de February de 2011 at 20:24

Hi Mic,

I liked the suggestion of creating a function to know if the words already exists on the dictionary because no extra memory is required and can save a lot of time.

About the use of arrays instead of vectors, yes, we can save time by using arrays but I think the most expensive thing is passig the

`last_row`

by value instead of passing by reference. I think that this modification can save some time and the performance will be similar of using raw arrays.Thank you for letting me know about your tests and for commenting!

Murilo Adriano7 de February de 2011 at 15:39

Hi,

it seems this code does not work correctly if you exchange the first character in the word. For example, if you insert “smukk”, then search for “bmukk”, you will get garbage (or rather, 0x3f3f3f3f), and not 1.

Audun Mathias Øygard26 de April de 2011 at 17:05

Hi Audun,

Thank you for letting me know this problem.

I’ve updated the code by adding a common mark (‘$’) on the strings for starting the computation.

Murilo Adriano27 de April de 2011 at 09:27

Great! Works now, thanks!

Audun Mathias Øygard27 de April de 2011 at 16:51

[…] this to work for most of my test cases. My implementation is practically a direct translation from Murilo’s C++ version of Steve Hanov’s algorithm. So how should I refactor this algorithm and/or make […]

Implementing a simple Trie for efficient Levenshtein Distance calculation – Java | Ask Programming & Technology5 de November de 2013 at 05:08