Php | Murilo :P

Archive for the ‘Php’ Category

Deeper look at PHP’s array: worst case

As I did in the post about the PHP sort functions, I was looking at the PHP source, I found this an interesting thing: the hashing function used on PHP’s hash tables (aka associative arrays).

Above the function definition, a PHP contributor explained the function:

This is Daniel J. Bernstein’s popular `times 33′ hash function as posted by him years ago on comp.lang.c. It basically uses a function like hash(i) hash(i-1) * 33 + str[i]. This is one of the best known hash functions for strings. Because it is both computed very fast and distributes very well.

This function, known as DJBX33A, is very fast and indeed it distributes very well when the data follows an uniform distribution. That is the problem. A user can easily produce keys for an array that would collide in the same hash, making the array have a drastically lower performance on its operations than the usual.

The expected complexity of array operations such as lookup, add and update is constant (O(1)) as it is for any hash table. The thing is that is easy to generate keys so that the complexity of such operations turn into linear (O(N)). Such thing could be used by malicious users to produce a certain level of DoS on unaware systems.

The experiment

In the intent to observe my point, I generate more than 100,000 keys that hashed in the same key and filled an array with those keys and filled other array with the same size with random keys:

define(ARRAY_LEN, 100000);

$keys = file('php_hash_output.txt'); // Where I stored the keys
$biased = array();
$normal = array();

$i = 0;
foreach ($keys as $k) {
    $biased[$k] = ++$i;
    $normal[rand()] = $i;

    if ($i > ARRAY_LEN) break;
}

Then I measured the time spent to iterate over the keys of the two arrays (PHP 5.3.15 cli):

$normal_keys = array_keys($normal);
$t = microtime(true);
$var = 0;
for ($i = 0; $i < ARRAY_LEN; ++$i) {
        $var += $normal[$normal_keys[$i]];
}
echo 'Normal: ' . (microtime(true) - $t) . PHP_EOL;

$t = microtime(true);
$var = 0;
for ($i = 0; $i < ARRAY_LEN; ++$i) {
        $var += $biased[$keys[$i]];
}
echo 'Biased: ' . (microtime(true) - $t) . PHP_EOL;

The output on my 2,3 GHz Intel Core i7 was as expected:

Normal: 0.048759937286377 Biased: 123.01204395294

That means that a task which runs in less than 0.05 seconds could run in more than 2 minutes, that is, more than 2.510 times worse. This behavior happened because the for iterating over the $biased array performed around 10,000,000,000 (10^10, doesn’t fit on 32-bit integers) operations while the iteration over the $normal array performed only 100,000 (10^5).

What could be done?

Universal Hashing

One of the possible solutions is to use the universal hashing approach. Here is the definition given in the universal hashing article on Wikipedia:

Using universal hashing (in a randomized algorithm or data structure) refers to selecting a hash function at random from a family of hash functions with a certain mathematical property (see definition below). This guarantees a low number of collisions in expectation, even if the data is chosen by an adversary. Many universal families are known (for hashing integers, vectors, strings), and their evaluation is often very efficient. Universal hashing has numerous uses in computer science, for example in implementations of hash tables, randomized algorithms, and cryptography.

That would guarantee that nobody could generate a set of keys that would collide simply because he/she wouldn’t know what hashing function is being used. But because of that, it would be difficult (to not say impossible) to keep a feature of PHP’s associative arrays: order. This leads to our second alternative.

Implement as Balanced Binary Trees

Implementing PHP’s arrays balanced binary search trees like Red-Black Tree or AVL would allow them to keep the order of the keys at the cost of each operation no longer being constant, on the average case, but logarithmic on its size.

Conclusion

None of the proposed ideas to what can be done is ideal and a good solution would be very hard to come given that PHP is one of the most used programming languages and any major change would impact greatly. As Rasmus himself said a couple of years ago:

We are looking into it. Changing the core hash function in PHP isn’t a trivial change and it will take us some time.

– Rasmus Lerdorf

It is obvious that most part of the array uses in PHP do not suffer from the behavior described in this post because it is used independently of what user inputs. The thing is that programmers should have these things in mind to know how to properly handle user inputs.

References

Murilo 😛 – PHP’s sort functions are poorly designed
Github – PHP Source
Wikipedia – Hash tables
Wikipedia – Uniform distribution
Wikipedia – Universal hashing
Wikipedia – Balanced binary search trees

Written by Murilo Adriano

16 de October de 2013 at 17:06

Posted in C/C++, Estruturas de Dados, Php, Programação, Segurança

Tagged with algorithm, analysis, array, benchmark, complexity, dos, hash, hash table, Php, php internal

PHP’s sort functions are poorly designed

with 6 comments

Everyone knows that is possible to sort an array in O(N log N) where N is the array’s length.

We know also, that Quicksort has expected complexity of O(N log N) but in worst case its complextity is O(N^2), and Quicksort is used more than Mergesort (worst case O(N log N)) because constants are smaller and it requires no extra memory and is easy to avoid Quicksort running on the worst case.

The easiest way to avoid bad inputs for Quicksort is choosing the pivot randomly, which means that a malicious user cannot guess a permutation for cracking it.

What I found at the PHP source is that the pivot choosing is done by picking the element at middle of array.

Here is the code (link line 76):

ZEND_API void zend_qsort(void *base, size_t nmemb, size_t siz, compare_func_t compare TSRMLS_DC)
{
	void           *begin_stack[QSORT_STACK_SIZE];
	void           *end_stack[QSORT_STACK_SIZE];
	register char  *begin;
	register char  *end;
	register char  *seg1;
	register char  *seg2;
	register char  *seg2p;
	register int    loop;
	uint            offset;

	begin_stack[0] = (char *) base;
	end_stack[0]   = (char *) base + ((nmemb - 1) * siz);

	for (loop = 0; loop >= 0; --loop) {
		begin = begin_stack[loop];
		end   = end_stack[loop];

		while (begin < end) {
			offset = (end - begin) >> 1; // HERE: choosing the current subarray's middle element
			_zend_qsort_swap(begin, begin + (offset - (offset % siz)), siz);

			seg1 = begin + siz;
			seg2 = end;

			while (1) {
				for (; seg1 < seg2 && compare(begin, seg1 TSRMLS_CC) > 0;
				     seg1 += siz);

				for (; seg2 >= seg1 && compare(seg2, begin TSRMLS_CC) > 0;
				     seg2 -= siz);

				if (seg1 >= seg2)
					break;

				_zend_qsort_swap(seg1, seg2, siz);

				seg1 += siz;
				seg2 -= siz;
			}

			_zend_qsort_swap(begin, seg2, siz);

			seg2p = seg2;

			if ((seg2p - begin) <= (end - seg2p)) {
				if ((seg2p + siz) < end) {
					begin_stack[loop] = seg2p + siz;
					end_stack[loop++] = end;
				}
				end = seg2p - siz;
			}
			else {
				if ((seg2p - siz) > begin) {
					begin_stack[loop] = begin;
					end_stack[loop++] = seg2p - siz;
				}
				begin = seg2p + siz;
			}
		}
	}
}

The pivot is named offset here.

Tests

I used the usort() function for counting how many comparisons was needed to sort the array in my cmp() function.
So, lets see the tests:

<?php

$counter = 0; // counts how many comparisons
function cmp($a, $b)
{
	global $counter;
	++$counter;
	return $a - $b;
}

$normal = array();
for ($i = 0; $i < 15; ++$i) {
	$normal[] = $i;
}
shuffle($normal);

$counter = 0;
print "<pre>Normal array:\n";
usort($normal, 'cmp');
print $counter . " comparisons\n\n";

$malicious = array(2, 9, 4, 12, 6, 11, 8, 1, 3, 5, 7, 9, 10, 14, 12);

$counter = 0;
print "Malicious array:\n";
usort($malicious, 'cmp');
print $counter . " comparisons\n</pre>";

?>

And the results:
Normal array: 64 comparisons // variable because of the shuffle

Malicious array:
208 comparisons // constant

It means that the PHP’s sort() family depends on the input array’s length and content while a Randomized Quicksort depends only on the input array’s length.

As we can see, usort() sorted the randomly generated array of length 20 very fast with a few number of comparisons, but in the case of a manipulated array of length 20, the number of comparisons are much more.

In the same way, taking an array of length 10.000, these functions do around 50.000.000 comparisons when the expected is around 132.000. This can be a problem if some “bad” guy inserts manipulated data which your script will sort.

Solution

The naive solution is shuffle()‘n the array before sorting.

<?php

$malicious = array(11, 3, 20, 5, 13, 7, 17, 9, 15, 1, 2, 4, 6, 8, 10, 12, 14, 16, 18, 19);
shuffle($malicious);

$counter = 0;
print "No longer malicious:\n";
usort($malicious, 'cmp');
print $counter . " coparisons\n</pre>";

?>

Output:
No longer malicious: 76 comparisons

References

zend_qsort source
PHP: sort() function
PHP: usort() function
PHP: shuffle() function
Quicksort – Wikipedia
Mergesort – Wikipedia

Written by Murilo Adriano

5 de February de 2011 at 14:24

Posted in C/C++, Estruturas de Dados, Php, Programação

Tagged with algorithm, Php, php internal, quicksort, sort

Escrevendo um Código PHP Seguro

with 4 comments

Há uns dois anos atrás o Diego (rand) e eu (na época com nick “ckR”) fizemos uma adaptação/tradução de um artigo postado no Solidox que aborda alguns métodos para tornar seu código mais seguro. O que acontece é que até hoje vejo inúmeras falhas que muitas vezes passam despercebidas para um leigo mas que, para alguém que conhece um mínimo de programação, facilmente poderia causar problemas para esses sitemas.

Aí eu pensei em colocar aqui para divulgar e que para os recém programadores PHP possam ler e levar consigo ao longo de sua jornada com essa fiel linguagem de programação.

O artigo saiu em diversos sites de certo prestígio na área como o PHPBrasil.com o PHPAvancado.net e o HTMLStaff.org. Vale a pena ler.

Segue abaixo, o texto na integra.

Read the rest of this entry »

Written by Murilo Adriano

4 de October de 2008 at 23:09

Posted in Php, Programação, Segurança

Tagged with Php, Programação, Segurança

	affordable lasik on Escrevendo um Código PHP …
	Implementing a simpl… on Fast and Easy Levenshtein dist…
	Murilo Adriano on Deeper look at PHP’s arr…
	aajjbb on Deeper look at PHP’s arr…
	Deeper look at PHP… on PHP’s sort functions are…
	Mais uma dica de blo… on Comparação de desempenho: std:…
	Mais uma dica de blo… on Linguagens de programação…

Murilo :P

Archive for the ‘Php’ Category

Deeper look at PHP’s array: worst case

The experiment

What could be done?

Universal Hashing

Implement as Balanced Binary Trees

Conclusion

References

PHP’s sort functions are poorly designed

Tests

Solution

References

Escrevendo um Código PHP Seguro

@muriloadriano

[murilo ~]$ ls /var/www/

[murilo ~]$ ls ./categorias/

[murilo@wp ~]$ tags

[murilo ~]$ ls -la

[murilo ~]$ ls ./comentarios/

[murilo ~]$ ls ./blog/

[murilo ~]$ delicious | grep murilo

Blogroll

[murilo ~]$ meta

[murilo ~]$ stats

Categories

[murilo ~]$ top