Murilo :P

C++, Computação, Programação, Web e afins :)

PHP’s sort functions are poorly designed

with 6 comments

Everyone knows that is possible to sort an array in O(N log N) where N is the array’s length.

We know also, that Quicksort has expected complexity of O(N log N) but in worst case its complextity is O(N^2), and Quicksort is used more than Mergesort (worst case O(N log N)) because constants are smaller and it requires no extra memory and is easy to avoid Quicksort running on the worst case.

The easiest way to avoid bad inputs for Quicksort is choosing the pivot randomly, which means that a malicious user cannot guess a permutation for cracking it.

What I found at the PHP source is that the pivot choosing is done by picking the element at middle of array.

Here is the code (link line 76):

ZEND_API void zend_qsort(void *base, size_t nmemb, size_t siz, compare_func_t compare TSRMLS_DC)
{
	void           *begin_stack[QSORT_STACK_SIZE];
	void           *end_stack[QSORT_STACK_SIZE];
	register char  *begin;
	register char  *end;
	register char  *seg1;
	register char  *seg2;
	register char  *seg2p;
	register int    loop;
	uint            offset;

	begin_stack[0] = (char *) base;
	end_stack[0]   = (char *) base + ((nmemb - 1) * siz);

	for (loop = 0; loop >= 0; --loop) {
		begin = begin_stack[loop];
		end   = end_stack[loop];

		while (begin < end) {
			offset = (end - begin) >> 1; // HERE: choosing the current subarray's middle element
			_zend_qsort_swap(begin, begin + (offset - (offset % siz)), siz);

			seg1 = begin + siz;
			seg2 = end;

			while (1) {
				for (; seg1 < seg2 && compare(begin, seg1 TSRMLS_CC) > 0;
				     seg1 += siz);

				for (; seg2 >= seg1 && compare(seg2, begin TSRMLS_CC) > 0;
				     seg2 -= siz);

				if (seg1 >= seg2)
					break;

				_zend_qsort_swap(seg1, seg2, siz);

				seg1 += siz;
				seg2 -= siz;
			}

			_zend_qsort_swap(begin, seg2, siz);

			seg2p = seg2;

			if ((seg2p - begin) <= (end - seg2p)) {
				if ((seg2p + siz) < end) {
					begin_stack[loop] = seg2p + siz;
					end_stack[loop++] = end;
				}
				end = seg2p - siz;
			}
			else {
				if ((seg2p - siz) > begin) {
					begin_stack[loop] = begin;
					end_stack[loop++] = seg2p - siz;
				}
				begin = seg2p + siz;
			}
		}
	}
}

The pivot is named offset here.

Tests

I used the usort() function for counting how many comparisons was needed to sort the array in my cmp() function.
So, lets see the tests:

<?php

$counter = 0; // counts how many comparisons
function cmp($a, $b)
{
	global $counter;
	++$counter;
	return $a - $b;
}

$normal = array();
for ($i = 0; $i < 15; ++$i) {
	$normal[] = $i;
}
shuffle($normal);

$counter = 0;
print "<pre>Normal array:\n";
usort($normal, 'cmp');
print $counter . " comparisons\n\n";

$malicious = array(2, 9, 4, 12, 6, 11, 8, 1, 3, 5, 7, 9, 10, 14, 12);

$counter = 0;
print "Malicious array:\n";
usort($malicious, 'cmp');
print $counter . " comparisons\n</pre>";

?>

And the results:
Normal array:
64 comparisons // variable because of the shuffle

Malicious array:
208 comparisons // constant

It means that the PHP’s sort() family depends on the input array’s length and content while a Randomized Quicksort depends only on the input array’s length.

As we can see, usort() sorted the randomly generated array of length 20 very fast with a few number of comparisons, but in the case of a manipulated array of length 20, the number of comparisons are much more.

In the same way, taking an array of length 10.000, these functions do around 50.000.000 comparisons when the expected is around 132.000. This can be a problem if some “bad” guy inserts manipulated data which your script will sort.

Solution

The naive solution is shuffle()‘n the array before sorting.

<?php

$malicious = array(11, 3, 20, 5, 13, 7, 17, 9, 15, 1, 2, 4, 6, 8, 10, 12, 14, 16, 18, 19);
shuffle($malicious);

$counter = 0;
print "No longer malicious:\n";
usort($malicious, 'cmp');
print $counter . " coparisons\n</pre>";

?>

Output:
No longer malicious:
76 comparisons

References

zend_qsort source
PHP: sort() function
PHP: usort() function
PHP: shuffle() function
Quicksort – Wikipedia
Mergesort – Wikipedia

About these ads

Written by Murilo Adriano

5 de February de 2011 at 14:24

6 Responses

Subscribe to comments with RSS.

  1. Na verdade, escolher o pivô randomicamente não nos livra de, ainda assim, escolhermos o menor elemento do vetor (pela Lei de Murphy, isso VAI acontecer).

    O ideal é utilizar uma mediana de N, escolhendo N elementos arbitrários do array e calcular a média entre eles. Estes N números, sim, podem ser selecionados randomicamente (apesar de não precisar, uma sequência qualquer de 3 números nos daria uma mediana válida).

    Em geral uma mediana de 3 já é o suficiente! Abraço!

    André

    7 de February de 2011 at 13:20

    • Olá André, obrigado pelo comentário!

      Sim, pode acontecer de às vezes escolher o menor (ou o maior) elemento do vetor mas acredito que as forças de Murphy não sejam tão poderosas para que em todos os passos do algoritmo seja escolhido um desses dois casos ruins.

      Utilizar a mediana de N é uma estratégia melhor que a de pegar um elemento fixo porém como a cada passo do algoritmo é possível saber qual elemento será escolhido, essa técnica também é facilmente manipulável.

      Do artigo Introsort no Wikipedia:

      Niklaus Wirth’s variant uses the middle element to prevent these occurrences, degenerating to O(n²) for contrived sequences. The median-of-3 pivot selection algorithm takes the median of the first, middle, and last elements of the list; however, even though this performs well on many real-world inputs, it is still possible to contrive a median-of-3 killer list that will cause dramatic slowdown of a quicksort based on this pivot selection technique. Such inputs could potentially be exploited by an aggressor, for example by sending such a list to an Internet server for sorting as a denial of service attack.

      Porém existem sim técnicas reais para evitar que um Quicksort passe de um certo nível de recursão. Uma delas é utilizada no algoritmo Introsort citado acima que é um Quicksort que se transforma em um Heapsort após um certo nível (que pode ser ajustado) garantindo um pior caso de O(N log N) (se o limite escolhido for log N).

      Um abraço!

      Murilo Adriano

      7 de February de 2011 at 15:26

  2. Excelente artigo!

    Abraços

    felipe tonello

    7 de April de 2011 at 09:19

  3. Holy crap, that falling snow effect made me think i was on drugs!! or that my 1 month old laptop screen was going …. hahah nice effect.

    MaerF0x0

    12 de December de 2012 at 18:49

  4. […] I did in the post about the PHP sort functions, I was looking at the PHP source, I found this an interesting thing: the hashing function used on […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 580 other followers

%d bloggers like this: