Self-Play and Using an Expert to Learn to Play Backgammon with Temporal Difference Learning

Marco A. Wiering

doi:10.4236/jilsa.2010.22009

Journal of Intelligent Learning Systems and Applications > Vol.2 No.2, May 2010

Self-Play and Using an Expert to Learn to Play Backgammon with Temporal Difference Learning

Marco A. Wiering
.
DOI: 10.4236/jilsa.2010.22009 PDF HTML 7,483 Downloads 13,227 Views Citations

A promising approach to learn to play board games is to use reinforcement learning algorithms that can learn a game position evaluation function. In this paper we examine and compare three different methods for generating training games: 1) Learning by self-play, 2) Learning by playing against an expert program, and 3) Learning from viewing ex-perts play against each other. Although the third possibility generates high-quality games from the start compared to initial random games generated by self-play, the drawback is that the learning program is never allowed to test moves which it prefers. Since our expert program uses a similar evaluation function as the learning program, we also examine whether it is helpful to learn directly from the board evaluations given by the expert. We compared these methods using temporal difference methods with neural networks to learn the game of backgammon.

Keywords

Board Games, Reinforcement Learning, TD(λ), Self-Play, Learning From Demonstration

Share and Cite:

M. Wiering, "Self-Play and Using an Expert to Learn to Play Backgammon with Temporal Difference Learning," Journal of Intelligent Learning Systems and Applications, Vol. 2 No. 2, 2010, pp. 57-68. doi: 10.4236/jilsa.2010.22009.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1]	L. P. Kaelbling, M. L. Littman and A. W. Moore, “Rein-forcement Learning: A Survey,” Journal of Artificial In-telligence Research, Vol. 4, 1996, pp. 237-285.
[2]	R. S. Sutton and A. G. Barto, “Reinforcement Learning: An Introduction,” The MIT press, Cambridge MA, 1998.
[3]	R. S. Sutton, “Learning to Predict by the Methods of Temporal Differences,” Machine Learning, Vol. 3, 1988, pp. 9-44.
[4]	J. B. Pollack and A. D. Blair, “Why Did TD-Gammon Work,” In: D. S. Touretzky, M. C. Mozer and M. E. Has-selmo, Ed., Advances in Neural Information Processing Systems 8, MIT Press, Cambridge MA, 1996, pp. 10-16.
[5]	D. B. Fogel, “Evolving a Checkers Player without Rely-ing on Human Experience,” Intelligence, Vol. 11, No. 2, 2000, p. 217.
[6]	D. E. Moriarty, “Symbiotic Evolution of Neural Net-works in Sequential Decision Tasks,” PhD thesis, De-partment of Computer Sciences, The University of Texas at Austin, USA, 1997.
[7]	G. Tesauro, “Practical Issues in Temporal Difference Learning,” In: D. S. Lippman, J. E. Moody and D. S. Touretzky, Ed., Advances in Neural Information Proc-essing Systems 4, Morgan Kaufmann, San Mateo, CA, 1992, pp. 259-266.
[8]	G. J. Tesauro, “Temporal Difference Learning and TD-Gammon,” Communications of the ACM, Vol. 38, 1995, pp. 58-68.
[9]	S. Thrun, “Learning to Play the Game of Chess,” In: G. Te-sauro, D. Touretzky and T. Leen, Ed., Advances in Neural Information Processing Systems 7, Morgan Kaufmann, San Fransisco, CA, 1995, pp. 1069-1076.
[10]	J. Baxter, A. Tridgell and L. Weaver, “Knightcap: A Chess Program that Learns by Combining TD(λ) with Minimax Search,” Technical report, Australian National University, Canberra, 1997.
[11]	A. L. Samuel, “Some Studies in Machine Learning Using the Game of Checkers,” IBM Journal on Research and Development, Vol. 3, No. 3, 1959, pp. 210-229.
[12]	A. L. Samuel, “Some Studies in Machine Learning Using the Game of Checkers II—Recent Progress,” IBM Jour-nal on Research and Development, Vol. 11, No. 6, 1967, pp. 601-617.
[13]	J. Schaeffer, M. Hlynka and V. Hussila, “Temporal Dif-ference Learning Applied to a High-Performance Game,” In Seventeenth International Joint Conference on Artifi-cial Intelligence, Seattle, WA, USA, 2001, pp. 529-534.
[14]	N. N. Schraudolph, P. Dayan and T. J. Sejnowski, “Tem-poral Difference Learning of Position Evaluation in the Game of Go,” In: J. D. Cowan, G. Tesauro and J. Al-spector, Ed., Advances in Neural Information Processing Systems, Morgan Kaufmann, San Francisco, CA, 1994, pp. 817-824.
[15]	J. Furnkranz, “Machine Learning in Games: A Survey,” In: J. Furnkranz and M. Kubat, Ed., Machines that learn to Play Games, Nova Science Publishers, Huntington,NY, 2001, pp. 11-59.
[16]	A. Plaat, “Research Re: search and Re-search,” PhD the-sis, Erasmus University Rotterdam, Holland, 1996.
[17]	J. Schaeffer, “The Games Computers (and People) Play,” Advances in Computers, Vol. 50, 2000, pp. 189-266.
[18]	J. Schaeffer and A. Plaat, “Kasparov versus Deep Blue: The Re-match,” International Computer Chess Association Journal, Vol. 20, No. 2, 1997, pp. 95-102.
[19]	R. Coulom, “Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search,” Proceedings of the fifth International Conference on Computers and Games, Turin, Italy, 2006, pp. 72-83.
[20]	R. Bellman, “Dynamic Programming,” Princeton Univer-sity Press, USA, 1957.
[21]	J. N. Tsitsiklis, “Asynchronous Stochastic Approximation and Q-learning,” Machine Learning, Vol. 16, 1994, pp. 185-202.
[22]	A. G. Barto, R. S. Sutton and C. W. Anderson, “Neu-ronlike Adaptive Elements that Can Solve Difficult Learning Control Problems,” IEEE Transactions on Systems, Man and Cybernetics, Vol. 13, 1983, pp. 834-846.
[23]	M. A. Wiering and J. H. Schmidhuber, “Fast Online Q(λ),” Machine Learning, Vol. 33, No. 1, 1998, pp. 105- 116.
[24]	C. M. Bishop, “Neural Networks for Pattern Recognition,” Oxford University, New York, 1995.
[25]	D. E. Rumelhart, G. E. Hinton and R. J. Williams, “Learning Internal Representations by Error Propagation,” In: D. E. Rumelhart and J. L. Mcclelland, Ed., Parallel Distributed Processing, MIT Press, USA, 1986, pp. 318-362.
[26]	L.-J. Lin, “Reinforcement Learning for Robots Using Neural Networks,” PhD thesis, Carnegie Mellon Univer-sity, Pittsburgh, 1993.
[27]	H. Berliner, “Experiences in Evaluation with BKG—A Program that Plays Backgammon,” In Proceedings of the International Joint Conference on Artificial Intelligence, Vol. 1, 1977, pp. 428-433.
[28]	J. A. Boyan, “Modular Neural Networks for Learning Context-Dependent Game Strategies,” Master’s thesis, University of Chicago, USA, 1992.
[29]	R. Caruana and V. R. de Sa, “Promoting Poor Features to Supervisors: Some Inputs Work Better as Outputs,” In M. C. Mozer, M. I. Jordan and T. Petsche, Ed., Advances in Neural Information Processing Systems 9, Morgan Kauf-mann, San Mateo, CA, 1997, pp.246-252.
[30]	A. Sperduti and A. Starita, “Speed up Learning and Network Optimization with Extended Backpropagation,” Neural Networks, Vol. 6, 1993, pp. 365-383.
[31]	X. Xu, D. Hu and X. Lu, “Kernel-Based Least Squares Policy Iteration for Reinforcement Learning,” IEEE Transactions on Neural Networks, Vol. 18, No. 4, 2007, pp. 973-992.
[32]	W. Smart and L. Kaelbling, “Effective Reinforcement Learning for Mobile Robots,” Proceedings of the IEEE International Conference on Robotics and Automation, Washington, DC, USA, 2002, pp. 3404-3410.

Journals Menu

Follow SCIRP

	+1 323-425-8868
	customer@scirp.org
	+86 18163351462(WhatsApp)
	1655362766

	Paper Publishing WeChat

Journals Menu

Home

About SCIRP

Service

Policies