## Posts Tagged ‘**algorithm**’

## Deeper look at PHP’s array: worst case

As I did in the post about the PHP sort functions, I was looking at the PHP source, I found this an interesting thing: the hashing function used on PHP’s hash tables (aka associative arrays).

Above the function definition, a PHP contributor explained the function:

This is Daniel J. Bernstein’s popular `times 33′ hash function as posted by him years ago on comp.lang.c. It basically uses a function like

`hash(i) hash(i-1) * 33 + str[i]`

. This is one of the best known hash functions for strings. Because it is both computed very fast and distributes very well.

This function, known as DJBX33A, is very fast and indeed it distributes very well when the data follows an uniform distribution. That is the problem. A user can easily produce keys for an array that would collide in the same hash, making the array have a drastically lower performance on its operations than the usual.

The expected complexity of array operations such as lookup, add and update is constant (`O(1)`

) as it is for any hash table. The thing is that is easy to generate keys so that the complexity of such operations turn into linear (`O(N)`

). Such thing could be used by malicious users to produce a certain level of DoS on unaware systems.

### The experiment

In the intent to observe my point, I generate more than 100,000 keys that hashed in the same key and filled an array with those keys and filled other array with the same size with random keys:

define(ARRAY_LEN, 100000); $keys = file('php_hash_output.txt'); // Where I stored the keys $biased = array(); $normal = array(); $i = 0; foreach ($keys as $k) { $biased[$k] = ++$i; $normal[rand()] = $i; if ($i > ARRAY_LEN) break; }

Then I measured the time spent to iterate over the keys of the two arrays (PHP 5.3.15 cli):

$normal_keys = array_keys($normal); $t = microtime(true); $var = 0; for ($i = 0; $i < ARRAY_LEN; ++$i) { $var += $normal[$normal_keys[$i]]; } echo 'Normal: ' . (microtime(true) - $t) . PHP_EOL; $t = microtime(true); $var = 0; for ($i = 0; $i < ARRAY_LEN; ++$i) { $var += $biased[$keys[$i]]; } echo 'Biased: ' . (microtime(true) - $t) . PHP_EOL;

The output on my 2,3 GHz Intel Core i7 was as expected:

`Normal: 0.048759937286377`

Biased: 123.01204395294

That means that a task which runs in less than 0.05 seconds could run in more than 2 minutes, that is, more than 2.510 times worse. This behavior happened because the `for`

iterating over the `$biased`

array performed around 10,000,000,000 (10^10, doesn’t fit on 32-bit integers) operations while the iteration over the `$normal`

array performed only 100,000 (10^5).

### What could be done?

#### Universal Hashing

One of the possible solutions is to use the universal hashing approach. Here is the definition given in the universal hashing article on Wikipedia:

Using

universal hashing(in a randomized algorithm or data structure) refers to selecting a hash function at random from a family of hash functions with a certain mathematical property (see definition below). This guarantees a low number of collisions in expectation, even if the data is chosen by an adversary. Many universal families are known (for hashing integers, vectors, strings), and their evaluation is often very efficient. Universal hashing has numerous uses in computer science, for example in implementations of hash tables, randomized algorithms, and cryptography.

That would guarantee that nobody could generate a set of keys that would collide simply because he/she wouldn’t know what hashing function is being used. But because of that, it would be difficult (to not say impossible) to keep a feature of PHP’s associative arrays: order. This leads to our second alternative.

#### Implement as Balanced Binary Trees

Implementing PHP’s arrays balanced binary search trees like Red-Black Tree or AVL would allow them to keep the order of the keys at the cost of each operation no longer being constant, on the average case, but logarithmic on its size.

### Conclusion

None of the proposed ideas to what can be done is ideal and a good solution would be very hard to come given that PHP is one of the most used programming languages and any major change would impact greatly. As Rasmus himself said a couple of years ago:

We are looking into it. Changing the core hash function in PHP isn’t a trivial change and it will take us some time.

– Rasmus Lerdorf

It is obvious that most part of the array uses in PHP do not suffer from the behavior described in this post because it is used independently of what user inputs. The thing is that programmers should have these things in mind to know how to properly handle user inputs.

### References

- Murilo 😛 – PHP’s sort functions are poorly designed
- Github – PHP Source
- Wikipedia – Hash tables
- Wikipedia – Uniform distribution
- Wikipedia – Universal hashing
- Wikipedia – Balanced binary search trees

## PHP’s sort functions are poorly designed

Everyone knows that is possible to sort an array in `O(N log N)`

where `N`

is the array’s length.

We know also, that Quicksort has expected complexity of `O(N log N)`

but in worst case its complextity is `O(N^2)`

, and Quicksort is used more than Mergesort (worst case `O(N log N)`

) because constants are smaller and it requires no extra memory and is easy to avoid Quicksort running on the worst case.

The easiest way to avoid bad inputs for Quicksort is choosing the pivot randomly, which means that a malicious user cannot guess a permutation for cracking it.

What I found at the PHP source is that the pivot choosing is done by picking the element at middle of array.

Here is the code (link line 76):

ZEND_API void zend_qsort(void *base, size_t nmemb, size_t siz, compare_func_t compare TSRMLS_DC) { void *begin_stack[QSORT_STACK_SIZE]; void *end_stack[QSORT_STACK_SIZE]; register char *begin; register char *end; register char *seg1; register char *seg2; register char *seg2p; register int loop; uint offset; begin_stack[0] = (char *) base; end_stack[0] = (char *) base + ((nmemb - 1) * siz); for (loop = 0; loop >= 0; --loop) { begin = begin_stack[loop]; end = end_stack[loop]; while (begin < end) { offset = (end - begin) >> 1; // HERE: choosing the current subarray's middle element _zend_qsort_swap(begin, begin + (offset - (offset % siz)), siz); seg1 = begin + siz; seg2 = end; while (1) { for (; seg1 < seg2 && compare(begin, seg1 TSRMLS_CC) > 0; seg1 += siz); for (; seg2 >= seg1 && compare(seg2, begin TSRMLS_CC) > 0; seg2 -= siz); if (seg1 >= seg2) break; _zend_qsort_swap(seg1, seg2, siz); seg1 += siz; seg2 -= siz; } _zend_qsort_swap(begin, seg2, siz); seg2p = seg2; if ((seg2p - begin) <= (end - seg2p)) { if ((seg2p + siz) < end) { begin_stack[loop] = seg2p + siz; end_stack[loop++] = end; } end = seg2p - siz; } else { if ((seg2p - siz) > begin) { begin_stack[loop] = begin; end_stack[loop++] = seg2p - siz; } begin = seg2p + siz; } } } }

The pivot is named `offset`

here.

### Tests

I used the `usort()`

function for counting how many comparisons was needed to sort the array in my `cmp()`

function.

So, lets see the tests:

<?php $counter = 0; // counts how many comparisons function cmp($a, $b) { global $counter; ++$counter; return $a - $b; } $normal = array(); for ($i = 0; $i < 15; ++$i) { $normal[] = $i; } shuffle($normal); $counter = 0; print "<pre>Normal array:\n"; usort($normal, 'cmp'); print $counter . " comparisons\n\n"; $malicious = array(2, 9, 4, 12, 6, 11, 8, 1, 3, 5, 7, 9, 10, 14, 12); $counter = 0; print "Malicious array:\n"; usort($malicious, 'cmp'); print $counter . " comparisons\n</pre>"; ?>

And the results:

`Normal array:`

64 comparisons // variable because of the shuffle

Malicious array:

208 comparisons // constant

It means that the PHP’s `sort()`

family depends on the input array’s length and content while a Randomized Quicksort depends only on the input array’s length.

As we can see, `usort()`

sorted the randomly generated array of length 20 very fast with a few number of comparisons, but in the case of a manipulated array of length 20, the number of comparisons are much more.

In the same way, taking an array of length 10.000, these functions do around 50.000.000 comparisons when the expected is around 132.000. This can be a problem if some “bad” guy inserts manipulated data which your script will sort.

### Solution

The naive solution is `shuffle()`

‘n the array before sorting.

<?php $malicious = array(11, 3, 20, 5, 13, 7, 17, 9, 15, 1, 2, 4, 6, 8, 10, 12, 14, 16, 18, 19); shuffle($malicious); $counter = 0; print "No longer malicious:\n"; usort($malicious, 'cmp'); print $counter . " coparisons\n</pre>"; ?>

Output:

`No longer malicious:`

76 comparisons

### References

zend_qsort source

PHP: sort() function

PHP: usort() function

PHP: shuffle() function

Quicksort – Wikipedia

Mergesort – Wikipedia

## Fast and Easy Levenshtein distance using a Trie (in C++)

I implemented this clever algorithm described on Steven Hanov’s blog in C++. Actually this algorithm is a bit different from the original because I wanted to know what is the least Levenshtein distance given a word.

#include <iostream> #include <map> #include <string> #include <vector> #include <algorithm> #include <cctype> /* * Algorithm: Edit distance using a trie-tree (Dynamic Programming) * Author: Murilo Adriano Vasconcelos <muriloufg@gmail.com> */ using namespace std; // Trie's node struct trie { typedef map<char, trie*> next_t; // The set with all the letters which this node is prefix next_t next; // If word is equal to "" is because there is no word in the // dictionary which ends here. string word; trie() : next(map<char, trie*>()) {} void insert(string w) { w = string("$") + w; int sz = w.size(); trie* n = this; for (int i = 0; i < sz; ++i) { if (n->next.find(w[i]) == n->next.end()) { n->next[w[i]] = new trie(); } n = n->next[w[i]]; } n->word = w; } }; // The tree trie tree; // The minimum cost of a given word to be changed to a word of the dictionary int min_cost; // void search_impl(trie* tree, char ch, vector<int> last_row, const string& word) { int sz = last_row.size(); vector<int> current_row(sz); current_row[0] = last_row[0] + 1; // Calculate the min cost of insertion, deletion, match or substution int insert_or_del, replace; for (int i = 1; i < sz; ++i) { insert_or_del = min(current_row[i-1] + 1, last_row[i] + 1); replace = (word[i-1] == ch) ? last_row[i-1] : (last_row[i-1] + 1); current_row[i] = min(insert_or_del, replace); } // When we find a cost that is less than the min_cost, is because // it is the minimum until the current row, so we update if ((current_row[sz-1] < min_cost) && (tree->word != "")) { min_cost = current_row[sz-1]; } // If there is an element wich is smaller than the current minimum cost, // we can have another cost smaller than the current minimum cost if (*min_element(current_row.begin(), current_row.end()) < min_cost) { for (trie::next_t::iterator it = tree->next.begin(); it != tree->next.end(); ++it) { search_impl(it->second, it->first, current_row, word); } } } int search(string word) { word = string("$") + word; int sz = word.size(); min_cost = 0x3f3f3f3f; vector<int> current_row(sz + 1); // Naive DP initialization for (int i = 0; i < sz; ++i) current_row[i] = i; current_row[sz] = sz; // For each letter in the root map wich matches with a // letter in word, we must call the search for (int i = 0 ; i < sz; ++i) { if (tree.next.find(word[i]) != tree.next.end()) { search_impl(tree.next[word[i]], word[i], current_row, word); } } return min_cost; }

The modification for match all words within a given Leveshtein distance and returning that set is trivial. Just modify the line 72 for storing that word if its Leveshtein distance is less than or equal to a given distance.

For more details, read this: Fast and Easy Levenshtein distance using a Trie

See ya!