## The Epsilon-Greedy Algorithm

The epsilon-greedy algorithm (often written using the actual Greek letter epsilon, as in the image below), is very simple and occurs in several areas of machine learning. One common use of epsilon-greedy is in the so-called multi-armed bandit problem. Suppose you are standing in front of k = 3 slot machines. Each machine pays out according to a different probability distribution, and these distributions are unknown to you. And suppose you can play a total of 100 times.

You have two goals. The first goal is to experiment with a few coins to try and determine which machine pays out the best. The second, related, goal is to get as much money as possible. The terms “explore” and “exploit” are used to indicate that you have to use some coins to explore to find the best machine, and you want to use as many coins as possible on the best machine to exploit your knowledge.

Epsilon-greedy is almost too simple. As you play the machines, you keep track of the average payout of each machine. Then, you select the machine with the highest current average payout with probability = (1 – epsilon) + (epsilon / k) where epsilon is a small value like 0.10. And you select machines that don’t have the highest current payout average with probability = epsilon / k.

It much easier to understand with a concrete example. Suppose, after your first 12 pulls, you played machine #1 four times and won \$1 two times and \$0 two times. The average for machine #1 is \$2/4 = \$0.50.

And suppose you’ve played machine #2 five times and won \$1 three times and \$0 two times. The average payout for machine #2 is \$3/5 = \$0.60.

And suppose you’ve played machine #3 three times and won \$1 one time and \$0 two times. The average payout for machine #3 is \$1/3 = \$0.33.

Now you have to select a machine to play on try number 13. You generate a random number p, between 0.0 and 1.0. Suppose you have set epsilon = 0.10. If p > 0.10 (which it will be 90% of the time), you select machine #2 because it has the current highest average payout. But if p < 0.10 (which it will be only 10% of the time), you select a random machine, so each machine has a 1/3 chance of being selected.

Notice that machine #2 might get picked anyway because you select randomly from all machines.

Over time, the best machine will be played more and more often because it will pay out more often. In short, epsilon-greedy means pick the current best option ("greedy") most of the time, but pick a random option with a small (epsilon) probability sometimes.

There are many other algorithms for the multi-armed bandit problem. But epsilon-greedy is incredibly simple, and often works as well as, or even better than, more sophisticated algorithms such as UCB ("upper confidence bound") variations. “Greed” (about 1500) – Hieronymus Bosch

This entry was posted in Machine Learning. Bookmark the permalink.