Bug in 'randomize' primitive

Started by rickyboy, May 22, 2008, 07:23:45 PM

Previous topic - Next topic

rickyboy

I just discovered today that the randomize function in newLISP is not an unbiased shuffler.



For those who don't know what randomize does: it takes a list and gives it back to you with the elements randomly rearranged.  Another way of saying it is that it gives you a random permutation of the list you provide.  Now, http://www.newlisp.org/downloads/newlisp_manual.html#randomize">in the manual, it doesn't say that the permutation you get is as equally likely as any other you would get.  So technically, randomize is working as advertised, but I think that Lutz and the user base would agree that it would be best to get an equally likely random permutation from randomize.



Here is the culprit code, from nl-math.c, lines 1104-1110, in function p_randomize():
for(i = 0; i < (length - 1); i++)
  {
  j = random() % length;
  cell = vector[i];
  vector[i] = vector[j];
  vector[j] = cell;
  }

According to Don Knuth (The Art of Computer Programming, Volume 2: Seminumerical Algorithms), the index j should be drawn (uniformly) randomly from i to length - 1.



In newLISP, this would be:
;; From Algorithm P of Knuth in TAOCP: Seminumerical Algorithms.
(define (randomize* lst)
  (let ((N-1 (- (length lst) 1)))
    (for (i 0 (- N-1 1))
      (swap i (unif i N-1) lst))
    lst))

;; Uniformly draw an integer in the range {a,...,b}.
(define (unif a b) (+ a (rand (+ 1 (- b a)))))
(λx. x x) (λx. x x)

rickyboy

#1
Here's a frequency analysis in newLISP to demonstrate the difference between the shufflers randomize and randomize*.



Consider the permutations of the characters of the string "12345".  Define this as all-perms:
(define all-perms
  (map (curry apply string)
       (make-k-permutations 5 '(1 2 3 4 5))))

(The function make-k-permutations is defined http://www.alh.net/newlisp/phpbb/viewtopic.php?p=3787#3787">here.)



Now, make a list of 12,000 random permutations of "12345", by repeatedly calling randomize -- call this rs.
(define rs
  (map (fn (x) (apply string (randomize (sequence 1 5) 1)))
       (dup nil 12000)))

Next, setup a hash table to map each permutation to the number of times it shows up on rs.
;; Initialize the frequency hash table.
(define-hash-table freqs)
(dolist (p all-perms) (freqs p 0))

;; Compute the frequencies of the random permutations in 'rs'.
(dolist (r rs) (freqs r (+ 1 (freqs r))))

(define-hash-table is just a macro which makes a hash table in the same manner Lutz does in the manual.)



Now that the frequencies are collected into the hash table, harvest out just the frequencies and put them on a list called just-the-freqs.
(define (just-freqs ftable)
  (let ((res '())
        (default-functor (sym ftable ftable)))
    (dolist (sm (symbols ftable))
      (if (not (= default-functor sm))
          (let ((val (eval sm))
                (key (name sm)))
            (push val res))))
    res))

(define just-the-freqs (just-freqs 'freqs))

Now, run the same gauntlet, this time using randomize*, the Knuth shuffler.  We'll use asterisks in the variable names to distinguish them from the other set.
(define rs*
  (map (fn (x) (apply string (randomize* (sequence 1 5) 1)))
       (dup nil 12000)))
(define-hash-table freqs*)
(dolist (p all-perms) (freqs* p 0))
(dolist (r rs*) (freqs* r (+ 1 (freqs* r))))
(define just-the-freqs* (just-freqs 'freqs*))

Well, since there are 5! = 120 permutations of "12345", if we generated 12,000 random permutations and expected them to be equally likely to be drawn, we should anticipate that the expected number (or average) of the frequencies would be 100.  (Each permutation having a 1 in 120 chance of being drawn.)



But the frequencies on the randomize draws look like:
> just-the-freqs
(74 34 99 58 69 32 100 97 91 52 44 52 115 38 79 46 39 15 34 47 37 21 26 24 66 81
 78 38 71 61 134 75 79 196 128 189 51 84 71 144 65 162 102 86 48 154 64 166 134 88
 69 66 87 32 87 115 188 180 81 221 66 45 92 210 70 182 54 70 161 207 74 203 70 94
 76 26 37 36 156 94 100 221 134 220 100 87 252 261 87 254 78 34 85 306 97 198 96
 40 62 36 52 23 59 94 78 194 55 219 75 43 92 282 94 198 64 42 73 168 65 195)

Whereas the frequencies based on randomize* look like:
> just-the-freqs*
(92 120 93 90 118 118 89 100 103 80 115 106 117 115 95 100 103 87 110 97 119 100
 95 111 87 113 90 97 85 87 100 110 96 91 101 108 110 117 83 96 107 82 112 118 113
 92 83 102 121 89 101 88 100 129 91 119 100 121 92 105 96 89 108 106 100 118 91 95
 94 99 86 92 103 88 115 96 96 103 127 89 91 80 103 84 116 84 97 111 103 127 101 100
 99 86 94 77 75 86 89 100 110 69 103 97 98 105 96 94 106 104 110 84 104 107 105 105
 98 102 112 88)

Here are the standard deviations of each:
> (sqrt (stat:var just-the-freqs))
64.75045829
> (sqrt (stat:var just-the-freqs*))
11.99159369
(λx. x x) (λx. x x)

Lutz

#2
Quotehe index j should be drawn (uniformly) randomly from i to length - 1


The code in nl-math.c:


j = random() % length;

does produce random evenly distributed numbers between 0 and length -1. The number the random function produces gets truncated towards 0 by the integer modulo operation. So this part is correct.



The difference between the 2 methods is based on the way the sequence gets shuffled, while Knuth shuffles by swapping 2 randomly drawn indexes, newLISP shuffles by increasing the first index and only drawing the second swap index randomly.



This could be part of the reason you see a larger standard deviation of positions in the Knuth method. The standard deviation itself is no measure in this case to prove that one or the other randomizing process is working better. It only proves that the Knuth method has more permutations which are at the extreme of the frequency distribution, occurring much more or much less than expected. Relatively speaking the 120 numbers should come closer to N/120 with incleasing N of trials.



Also when using 'randomize' and you want true randomness in shuffling, make sure you specify the optional 'true' flag as in (randomize mylist true). Without the flag outcomes of randomize equal to the input list will be thrown away, which technically is not correct, because strictly speaking a random shuffeling can produce the input list.



Omitting the optional flag is an additional source of variance leading to a higher standard deviation.

rickyboy

#3
Hi Lutz,



Apparently, there is some kind of misunderstanding here.  I'll chalk it up to me not being clear in my message before.



The dispute about the code
j = random() % length;
is NOT whether it draws uniformly, but rather about which RANGE it draws (uniformly) from, at each iteration.



The reason why I claim that randomize is biased is because it draws j from Unif(0, length-1).  Knuth's analysis shows that, for such an algorithm to be unbiased, it instead should draw j  from Unif(i, length-1), 0 <= i <= length - 2.  (Please note the differences marked in red.)



Another way to see why randomize is biased is to consider the outcomes of all the possible steps the algorithm takes, working with a small example -- this time, for N = 3]http://img372.imageshack.us/img372/3596/randomizeoutcomecomparewq4.jpg">http://img372.imageshack.us/img372/3596 ... arewq4.jpg">http://img372.imageshack.us/img372/3596/randomizeoutcomecomparewq4.jpg[/img]

(Please excuse the crappy graphics resolution.)



The root node each  tree contains the input.  Each level in the tree from the root indicates the possible results at the end of an iteration.



Recall that in the current method, at each iteration slot i gets swapped with any other slot; whereas in Knuth's method slot i gets swapped with any slot to the right of it.



Clearly, the current method yields biased results.  The outcome probabilities under the current method can be computed by the diagram above: 123 has probability 2/9,  132 has probability 1/9,  213 has probability 2/9,  231 has probability 2/9,  312 has probability 1/9,  321 has probability 1/9.



By contrast, all outcomes by the Knuth method are equally likely, i.e. they each occur with probability 1/3! = 1/6, as expected.



Also, please note that the larger standard deviation in my frequency analysis was due to randomize, and not randomize* (i.e. Knuth's shuffle).



With that in mind, my previous messages on this thread might be a bit more clear.  I hope that helps.
(λx. x x) (λx. x x)

Lutz

#4
Thanks Rickyboy, the diagram shows it, we will go with Knuth's algorithm for the coming development version.

Lutz

#5
Here the results after implementing Knuth's algorithm:



(dotimes (x 12000) (push (randomize '(1 2 3 4 5) true) r))

(map (fn (x) (apply + x)) (transpose r))



old: (33138 34967 35804 35716 40375)



new: (35784 36098 36142 36193 35783)



on the new version the values are much close to the expected 36000 = (/ (* (+ 1 2 3 4 5) 12000) 5), not favoring higher positions in the list

rickyboy

#6
Quote from: "Lutz"Here the results after implementing Knuth's algorithm:



(dotimes (x 12000) (push (randomize '(1 2 3 4 5) true) r))

(map (fn (x) (apply + x)) (transpose r))



old: (33138 34967 35804 35716 40375)



new: (35784 36098 36142 36193 35783)



on the new version the values are much close to the expected 36000 = (/ (* (+ 1 2 3 4 5) 12000) 5), not favoring higher positions in the list

Wow!  I very much like this way of doing the frequency analysis.  It counts the occurrences of {1,2,3,4,5} in each position by taking the sum of the numbers in each position.



In some cases, though, you'd have to worry about what are the addends that make up the sum.  For instance, as in this case, if I get a sum of 36000, it doesn't necessarily entail that I got 2400 1s, 2400 2s, 2400 3s, 2400 4s and 2400 fives.  I could have had no 1s, 4800 2s, 2400 3s, 4800 4s and no fives.  There are also other distributions of addends to get to a sum of 36000.



But in your analysis you got 40375, probably far enough from 36000 to indicate a bias.  (I don't have enough stats knowledge to confirm it.)



I also like the pithiness of this analysis.  There is no drawn out setup, as in mine.  Very elegant!
(λx. x x) (λx. x x)