Murilo :P

C++, Computação, Programação, Web e afins :)

Posts Tagged ‘analysis

Deeper look at PHP’s array: worst case

with 2 comments

As I did in the post about the PHP sort functions, I was looking at the PHP source, I found this an interesting thing: the hashing function used on PHP’s hash tables (aka associative arrays).

Above the function definition, a PHP contributor explained the function:

This is Daniel J. Bernstein’s popular `times 33′ hash function as posted by him years ago on comp.lang.c. It basically uses a function like hash(i) hash(i-1) * 33 + str[i]. This is one of the best known hash functions for strings. Because it is both computed very fast and distributes very well.

This function, known as DJBX33A, is very fast and indeed it distributes very well when the data follows an uniform distribution. That is the problem. A user can easily produce keys for an array that would collide in the same hash, making the array have a drastically lower performance on its operations than the usual.

The expected complexity of array operations such as lookup, add and update is constant (O(1)) as it is for any hash table. The thing is that is easy to generate keys so that the complexity of such operations turn into linear (O(N)). Such thing could be used by malicious users to produce a certain level of DoS on unaware systems.

The experiment

In the intent to observe my point, I generate more than 100,000 keys that hashed in the same key and filled an array with those keys and filled other array with the same size with random keys:

define(ARRAY_LEN, 100000);

$keys = file('php_hash_output.txt'); // Where I stored the keys
$biased = array();
$normal = array();

$i = 0;
foreach ($keys as $k) {
    $biased[$k] = ++$i;
    $normal[rand()] = $i;

    if ($i > ARRAY_LEN) break;
}

Then I measured the time spent to iterate over the keys of the two arrays (PHP 5.3.15 cli):

$normal_keys = array_keys($normal);
$t = microtime(true);
$var = 0;
for ($i = 0; $i < ARRAY_LEN; ++$i) {
        $var += $normal[$normal_keys[$i]];
}
echo 'Normal: ' . (microtime(true) - $t) . PHP_EOL;

$t = microtime(true);
$var = 0;
for ($i = 0; $i < ARRAY_LEN; ++$i) {
        $var += $biased[$keys[$i]];
}
echo 'Biased: ' . (microtime(true) - $t) . PHP_EOL;

The output on my 2,3 GHz Intel Core i7 was as expected:

Normal: 0.048759937286377
Biased: 123.01204395294

That means that a task which runs in less than 0.05 seconds could run in more than 2 minutes, that is, more than 2.510 times worse. This behavior happened because the for iterating over the $biased array performed around 10,000,000,000 (10^10, doesn’t fit on 32-bit integers) operations while the iteration over the $normal array performed only 100,000 (10^5).

What could be done?

Universal Hashing

One of the possible solutions is to use the universal hashing approach. Here is the definition given in the universal hashing article on Wikipedia:

Using universal hashing (in a randomized algorithm or data structure) refers to selecting a hash function at random from a family of hash functions with a certain mathematical property (see definition below). This guarantees a low number of collisions in expectation, even if the data is chosen by an adversary. Many universal families are known (for hashing integers, vectors, strings), and their evaluation is often very efficient. Universal hashing has numerous uses in computer science, for example in implementations of hash tables, randomized algorithms, and cryptography.

That would guarantee that nobody could generate a set of keys that would collide simply because he/she wouldn’t know what hashing function is being used. But because of that, it would be difficult (to not say impossible) to keep a feature of PHP’s associative arrays: order. This leads to our second alternative.

Implement as Balanced Binary Trees

Implementing PHP’s arrays balanced binary search trees like Red-Black Tree or AVL would allow them to keep the order of the keys at the cost of each operation no longer being constant, on the average case, but logarithmic on its size.

Conclusion

None of the proposed ideas to what can be done is ideal and a good solution would be very hard to come given that PHP is one of the most used programming languages and any major change would impact greatly. As Rasmus himself said a couple of years ago:

We are looking into it. Changing the core hash function in PHP isn’t a trivial change and it will take us some time.

– Rasmus Lerdorf

It is obvious that most part of the array uses in PHP do not suffer from the behavior described in this post because it is used independently of what user inputs. The thing is that programmers should have these things in mind  to know how to properly handle user inputs.

References

Written by Murilo Adriano

16 de October de 2013 at 17:06