Murilo :P

C++, Computação, Programação, Web e afins :)

Archive for February 2011

PHP’s sort functions are poorly designed

with 6 comments

Everyone knows that is possible to sort an array in O(N log N) where N is the array’s length.

We know also, that Quicksort has expected complexity of O(N log N) but in worst case its complextity is O(N^2), and Quicksort is used more than Mergesort (worst case O(N log N)) because constants are smaller and it requires no extra memory and is easy to avoid Quicksort running on the worst case.

The easiest way to avoid bad inputs for Quicksort is choosing the pivot randomly, which means that a malicious user cannot guess a permutation for cracking it.

What I found at the PHP source is that the pivot choosing is done by picking the element at middle of array.

Here is the code (link line 76):

ZEND_API void zend_qsort(void *base, size_t nmemb, size_t siz, compare_func_t compare TSRMLS_DC)
{
	void           *begin_stack[QSORT_STACK_SIZE];
	void           *end_stack[QSORT_STACK_SIZE];
	register char  *begin;
	register char  *end;
	register char  *seg1;
	register char  *seg2;
	register char  *seg2p;
	register int    loop;
	uint            offset;

	begin_stack[0] = (char *) base;
	end_stack[0]   = (char *) base + ((nmemb - 1) * siz);

	for (loop = 0; loop >= 0; --loop) {
		begin = begin_stack[loop];
		end   = end_stack[loop];

		while (begin < end) {
			offset = (end - begin) >> 1; // HERE: choosing the current subarray's middle element
			_zend_qsort_swap(begin, begin + (offset - (offset % siz)), siz);

			seg1 = begin + siz;
			seg2 = end;

			while (1) {
				for (; seg1 < seg2 && compare(begin, seg1 TSRMLS_CC) > 0;
				     seg1 += siz);

				for (; seg2 >= seg1 && compare(seg2, begin TSRMLS_CC) > 0;
				     seg2 -= siz);

				if (seg1 >= seg2)
					break;

				_zend_qsort_swap(seg1, seg2, siz);

				seg1 += siz;
				seg2 -= siz;
			}

			_zend_qsort_swap(begin, seg2, siz);

			seg2p = seg2;

			if ((seg2p - begin) <= (end - seg2p)) {
				if ((seg2p + siz) < end) {
					begin_stack[loop] = seg2p + siz;
					end_stack[loop++] = end;
				}
				end = seg2p - siz;
			}
			else {
				if ((seg2p - siz) > begin) {
					begin_stack[loop] = begin;
					end_stack[loop++] = seg2p - siz;
				}
				begin = seg2p + siz;
			}
		}
	}
}

The pivot is named offset here.

Tests

I used the usort() function for counting how many comparisons was needed to sort the array in my cmp() function.
So, lets see the tests:

<?php

$counter = 0; // counts how many comparisons
function cmp($a, $b)
{
	global $counter;
	++$counter;
	return $a - $b;
}

$normal = array();
for ($i = 0; $i < 15; ++$i) {
	$normal[] = $i;
}
shuffle($normal);

$counter = 0;
print "<pre>Normal array:\n";
usort($normal, 'cmp');
print $counter . " comparisons\n\n";

$malicious = array(2, 9, 4, 12, 6, 11, 8, 1, 3, 5, 7, 9, 10, 14, 12);

$counter = 0;
print "Malicious array:\n";
usort($malicious, 'cmp');
print $counter . " comparisons\n</pre>";

?>

And the results:
Normal array:
64 comparisons // variable because of the shuffle

Malicious array:
208 comparisons // constant

It means that the PHP’s sort() family depends on the input array’s length and content while a Randomized Quicksort depends only on the input array’s length.

As we can see, usort() sorted the randomly generated array of length 20 very fast with a few number of comparisons, but in the case of a manipulated array of length 20, the number of comparisons are much more.

In the same way, taking an array of length 10.000, these functions do around 50.000.000 comparisons when the expected is around 132.000. This can be a problem if some “bad” guy inserts manipulated data which your script will sort.

Solution

The naive solution is shuffle()‘n the array before sorting.

<?php

$malicious = array(11, 3, 20, 5, 13, 7, 17, 9, 15, 1, 2, 4, 6, 8, 10, 12, 14, 16, 18, 19);
shuffle($malicious);

$counter = 0;
print "No longer malicious:\n";
usort($malicious, 'cmp');
print $counter . " coparisons\n</pre>";

?>

Output:
No longer malicious:
76 comparisons

References

zend_qsort source
PHP: sort() function
PHP: usort() function
PHP: shuffle() function
Quicksort – Wikipedia
Mergesort – Wikipedia

Written by Murilo Adriano

5 de February de 2011 at 14:24

Fast and Easy Levenshtein distance using a Trie (in C++)

with 8 comments

I implemented this clever algorithm described on Steven Hanov’s blog in C++. Actually this algorithm is a bit different from the original because I wanted to know what is the least Levenshtein distance given a word.

#include <iostream>
#include <map>
#include <string>
#include <vector>
#include <algorithm>
#include <cctype>

/*
 * Algorithm: Edit distance using a trie-tree (Dynamic Programming)
 * Author: Murilo Adriano Vasconcelos <muriloufg@gmail.com>
 */

using namespace std;

// Trie's node
struct trie
{
	typedef map<char, trie*> next_t;

	// The set with all the letters which this node is prefix
	next_t next;

	// If word is equal to "" is because there is no word in the
	//	dictionary which ends here.
	string word;

	trie() : next(map<char, trie*>()) {}

	void insert(string w)
	{
		w = string("$") + w;
		
		int sz = w.size();
		
		trie* n = this;
		for (int i = 0; i < sz; ++i) {
			if (n->next.find(w[i]) == n->next.end()) {
				n->next[w[i]] = new trie();
			}

			n = n->next[w[i]];
		}

		n->word = w;
	}
};

// The tree
trie tree;

// The minimum cost of a given word to be changed to a word of the dictionary
int min_cost;

//
void search_impl(trie* tree, char ch, vector<int> last_row, const string& word)
{
	int sz = last_row.size();

	vector<int> current_row(sz);
	current_row[0] = last_row[0] + 1;

	// Calculate the min cost of insertion, deletion, match or substution
	int insert_or_del, replace;
	for (int i = 1; i < sz; ++i) {
		insert_or_del = min(current_row[i-1] + 1, last_row[i] + 1);
		replace = (word[i-1] == ch) ? last_row[i-1] : (last_row[i-1] + 1);

		current_row[i] = min(insert_or_del, replace);
	}

	// When we find a cost that is less than the min_cost, is because
	// it is the minimum until the current row, so we update
	if ((current_row[sz-1] < min_cost) && (tree->word != "")) {
		min_cost = current_row[sz-1];
	}

	// If there is an element wich is smaller than the current minimum cost,
	// 	we can have another cost smaller than the current minimum cost
	if (*min_element(current_row.begin(), current_row.end()) < min_cost) {
		for (trie::next_t::iterator it = tree->next.begin(); it != tree->next.end(); ++it) {
			search_impl(it->second, it->first, current_row, word);
		}
	}
}

int search(string word)
{
	word = string("$") + word;
	
	int sz = word.size();
	min_cost = 0x3f3f3f3f;

	vector<int> current_row(sz + 1);

	// Naive DP initialization
	for (int i = 0; i < sz; ++i) current_row[i] = i;
	current_row[sz] = sz;
	
	
	
	// For each letter in the root map wich matches with a
	// 	letter in word, we must call the search
	for (int i = 0 ; i < sz; ++i) {
		if (tree.next.find(word[i]) != tree.next.end()) {
			search_impl(tree.next[word[i]], word[i], current_row, word);
		}
	}

	return min_cost;
}

The modification for match all words within a given Leveshtein distance and returning that set is trivial. Just modify the line 72 for storing that word if its Leveshtein distance is less than or equal to a given distance.

For more details, read this: Fast and Easy Levenshtein distance using a Trie
See ya!

Written by Murilo Adriano

1 de February de 2011 at 21:46