


Probability matching is related to the machine learning’s classic multiarmed bandit problem (Gittins, 1989). In this problem, an agent is in a room with k different “onearmed bandit” gambling machines. When a machine’s arm is pulled, the machine pays either 1 or 0 units; each machine has a different (and usually fixed) probability of paying off, which is not known by the agent. At each time t, the agent pulls the arm of one machine. The goal is to choose machines to maximize the total payoff over the game’s duration. A variety of strategies exist that involve a tradeoff between exploration (testing different machines) and exploitation (keeping with one machine that pays off frequently) (Kaelbling, Littman, & Moore, 1996). A “greedy strategy” only pulls the arm of the machine with the highest expected payoff probability, but may not be the best, particularly if early probability estimates were inaccurate. Probability matching reflects a different strategy, called the pursuit algorithm (Thathachar & Sastry, 1985) which updates the probability of reward at each bandit with each visit, and visits different machines with a frequency based on the computed likelihood of reward. When faced with a multiarmed bandit, perceptrons match probabilities, and behave as if they are using the pursuit algorithm (Dawson et al, 2009).
References:
 Dawson, M. R. W., Dupuis, B., Spetch, M. L., & Kelly, D. M. (2009). Simple artificial networks that match probability and exploit and explore when confronting a multiarmed bandit. Ieee Transactions on Neural Networks, 20(8), 13681371.
 Gittins, J. C. (1989). MultiArmed Bandit Allocation Indices. Chichester; New York: Wiley.
 Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4, 237285.
 Thathachar, M. A. L., & Sastry, P. S. (1985). A new approach to the design of reinforcement schemes for learning automata. IEEE Transactions on Systems Man and Cybernetics, 15(1), 168175.
(Added October 2010)



