The Epsilon-Greedy Algorithm for Multi-Armed Bandit Problems

Imagine you are facing three slot machines. You have 8 tokens. The machines pay $1 if you win, and $0 if you lose. Each machine pays according to a different probability distribution that is unknown to you. You want to use a technique that gives you a good chance of winning the most money for your 8 pulls.

There are many algorithms for the multi-armed bandit problem. The four most common are epsilon-first, epsilon-greedy, UCB1 (upper confidence bound), and Thompson Sampling.

The epsilon-greedy algorithm is simple but clever. You keep track of the number of times each machine wins and loses, and for each trial calculate the empirical (“by observation”) probability of each machine winning. For example, if machine [0] currently has 3 wins and 2 losses, its empirical probability of winning is 3 / 5 = 0.60. To pick a machine to play, you set some small value epsilon, say 0.20. You generate a random p in [0.0, 1.0] and if p is less than epsilon you select a machine at random. Otherwise you select the machine with the current highest empirical probability of winning.

The random-epsilon mechanism allows the algorithm to explore other arms and not get stuck on pulling a single arm.

The multi-armed bandit problem occurs in many real-life scenarios. For example, suppose a commercial Web site has three different advertising campaigns to try. They’ll want to determine the most effective campaign as quickly as possible.

In practice, epsilon-greedy isn’t used very often. There are two problems. First, there’s rarely any good way to pick a good value for epsilon. Second, there are other techniques that work better in practice. But epsilon-greedy is often useful to establish a baseline result.

One thing that has always irritated me is that code like:

if p < epsilon then
  m = random(0, N)  // random machine
  m = argmax(empir_probs)  // best curr machine

gives the best machine more than a (1.0 – epsilon) probability of being selected because the best machine might be selected randomly. I’ve never seen an implementation that deals with this issue.

Internet results of a search for “greedy”. From left: Barbie, Marilyn, Olga, Jack.

This entry was posted in Machine Learning. Bookmark the permalink.