Shuffling an array into random order is a common task in machine learning algorithms. For example, if an array holds array-index values such as idxs = [0, 1, 2, 3, . . ] then shuffling the idxs array gives you a way to iterate through training data in random order.
The Python language NumPy library has a built-in numpy.random.shuffle() function. But there are times when you want to implement a custom shuffle() function, and some programming languages don’t have a built-in shuffle() function.
The usual algorithm to shuffle the contents is called the Fisher-Yates shuffle, or sometimes the Knuth shuffle. There are several different variations. The variation I prefer, implemented in Python is:
def shuffle_basic(arr, rndobj): # last i iteration not necessary n = len(arr) for i in range(n): # 0 to n-1 inclusive j = rndobj.randint(i, n) # i to n-1 inclusive tmp = arr[i]; arr[i] = arr[j]; arr[j] = tmp
I iterate forward, i ranging from 0 to n-1 inclusive. This is not the most common technique. The last iteration will always swap the value in the last cell with itself. Therefore, it’s more common to write for i in range(n-1). I don’t mind the extra iteration because 1.) if an array has just one cell, range(n-1) will be out-of-range for some programming languages (not Python), and 2.) the for-loop range and the randint() range are the same which has a pleasant symmetry.
Somewhat weirdly, the Wikipedia article on Fisher-Yates gives a basic algorithm that iterates backwards with the i variable in the outer loop. This baffles me — I can think of no reason to awkwardly iterate backwards when a forward iteration works just fine.
Additionally, I prefer to pass in a local random_object rather than use the global NumPy random_object. This makes reproducibility much easier because no other functions are modifying the random_object.
It’s well known that it’s easy to mess up the Fisher-Yates shuffle algorithm. Specifically:
def shuffle_bad(arr, rndobj): # last i iteration not necessary n = len(arr) for i in range(n): # 0 to n-1 inclusive j = rndobj.randint(0, n) # 0 to n-1 is wrong tmp = arr[i]; arr[i] = arr[j]; arr[j] = tmp
The call to the randint() function looks correct, and the algorithm does spit out random-looking arrangements. However this version is biased towards certain permutations. Specifically, if arr = [0,1,2], then first, fifth and sixth arrangements [0,1,2], [2,0,1], [2,1,0] are slightly less likely than the second, third and fourth arrangements [0,2,1], [1,0,2], [1,2,0].
Strangely, I implemented the bad version of Fisher-Yates and ran it to shuffle [0, 1, 2] 100,000 times. If all six permutations are equally likely, you’d get 1/6 = 0.1667 of each. In my demo I got [ 0.1665 0.1673 0.1641 0.1676 0.1682 0.1664]. This is not at all what theory predicts and I’m not sure what’s going on. I know from a lifetime of experience that this problem is now in the back of my mind and will stay there, nagging at me, until I solve it. But I will.
Demo code below. Continue reading