Murilo :P

C++, Computação, Programação, Web e afins :)

Deeper look at PHP’s array: worst case

with 2 comments

As I did in the post about the PHP sort functions, I was looking at the PHP source, I found this an interesting thing: the hashing function used on PHP’s hash tables (aka associative arrays).

Above the function definition, a PHP contributor explained the function:

This is Daniel J. Bernstein’s popular `times 33′ hash function as posted by him years ago on comp.lang.c. It basically uses a function like hash(i) hash(i-1) * 33 + str[i]. This is one of the best known hash functions for strings. Because it is both computed very fast and distributes very well.

This function, known as DJBX33A, is very fast and indeed it distributes very well when the data follows an uniform distribution. That is the problem. A user can easily produce keys for an array that would collide in the same hash, making the array have a drastically lower performance on its operations than the usual.

The expected complexity of array operations such as lookup, add and update is constant (O(1)) as it is for any hash table. The thing is that is easy to generate keys so that the complexity of such operations turn into linear (O(N)). Such thing could be used by malicious users to produce a certain level of DoS on unaware systems.

The experiment

In the intent to observe my point, I generate more than 100,000 keys that hashed in the same key and filled an array with those keys and filled other array with the same size with random keys:

define(ARRAY_LEN, 100000);

$keys = file('php_hash_output.txt'); // Where I stored the keys
$biased = array();
$normal = array();

$i = 0;
foreach ($keys as $k) {
    $biased[$k] = ++$i;
    $normal[rand()] = $i;

    if ($i > ARRAY_LEN) break;
}

Then I measured the time spent to iterate over the keys of the two arrays (PHP 5.3.15 cli):

$normal_keys = array_keys($normal);
$t = microtime(true);
$var = 0;
for ($i = 0; $i < ARRAY_LEN; ++$i) {
        $var += $normal[$normal_keys[$i]];
}
echo 'Normal: ' . (microtime(true) - $t) . PHP_EOL;

$t = microtime(true);
$var = 0;
for ($i = 0; $i < ARRAY_LEN; ++$i) {
        $var += $biased[$keys[$i]];
}
echo 'Biased: ' . (microtime(true) - $t) . PHP_EOL;

The output on my 2,3 GHz Intel Core i7 was as expected:

Normal: 0.048759937286377
Biased: 123.01204395294

That means that a task which runs in less than 0.05 seconds could run in more than 2 minutes, that is, more than 2.510 times worse. This behavior happened because the for iterating over the $biased array performed around 10,000,000,000 (10^10, doesn’t fit on 32-bit integers) operations while the iteration over the $normal array performed only 100,000 (10^5).

What could be done?

Universal Hashing

One of the possible solutions is to use the universal hashing approach. Here is the definition given in the universal hashing article on Wikipedia:

Using universal hashing (in a randomized algorithm or data structure) refers to selecting a hash function at random from a family of hash functions with a certain mathematical property (see definition below). This guarantees a low number of collisions in expectation, even if the data is chosen by an adversary. Many universal families are known (for hashing integers, vectors, strings), and their evaluation is often very efficient. Universal hashing has numerous uses in computer science, for example in implementations of hash tables, randomized algorithms, and cryptography.

That would guarantee that nobody could generate a set of keys that would collide simply because he/she wouldn’t know what hashing function is being used. But because of that, it would be difficult (to not say impossible) to keep a feature of PHP’s associative arrays: order. This leads to our second alternative.

Implement as Balanced Binary Trees

Implementing PHP’s arrays balanced binary search trees like Red-Black Tree or AVL would allow them to keep the order of the keys at the cost of each operation no longer being constant, on the average case, but logarithmic on its size.

Conclusion

None of the proposed ideas to what can be done is ideal and a good solution would be very hard to come given that PHP is one of the most used programming languages and any major change would impact greatly. As Rasmus himself said a couple of years ago:

We are looking into it. Changing the core hash function in PHP isn’t a trivial change and it will take us some time.

- Rasmus Lerdorf

It is obvious that most part of the array uses in PHP do not suffer from the behavior described in this post because it is used independently of what user inputs. The thing is that programmers should have these things in mind  to know how to properly handle user inputs.

References

About these ads

Written by Murilo Adriano

16 de October de 2013 at 17:06

2 Responses

Subscribe to comments with RSS.

  1. This 33 base is really tricky. I had some bad experiences even in codeforces with this base. I know it’s a complicated feature to change in a language but I think it could be softened by using a predefined set of bases (I would say (37, 43, 71)), and choose one of them randomly in array constructor.

    aajjbb

    16 de October de 2013 at 19:29

    • Working with few bases would result on the same problem. Say we have 4 bases in this set, cracking the keys for one of them would result on bad performance on 1/4 of the runs. If you generate 4 times more keys (1/4 colliding in each base) you would have bad performance on all the time.

      I guess that could be done if there are enough bases with good distribution, but I can’t say as I didn’t investigate that.

      Murilo Adriano

      17 de October de 2013 at 00:05


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 581 other followers

%d bloggers like this: