TD-Gammon: From neural network to OS/2 game
by Jeri Dube
When most people think of IBM Research, they tend to think of fractals,
scanning tunneling electron microscopes or high temperature superconductivity.
Games are not usually one of the thoughts that come to mind. However, the
TD-Gammon game included in the IBM Family FunPak for OS/2 Warp was
developed by IBM Research.
By virtue of being created at such an auspicious place, you would think that
this version of backgammon is quite special. Well, to be quite honest and not so
humble, it is! TD-Gammon is the most advanced computer version of
backgammon. It can play at the most advanced levels. If the system were a human,
it would be rated as a World Class Master.
TD-Gammon was developed by IBM Research Staff Member, Gerry Tesauro.
Gerry is not a game developer, rather he is a theoretical physicist who has been
working in the area of neural networks and artificial intelligence for several
years. He did not initially intend to develop an OS/2 game for the Family
FunPak. All he wanted to develop was a basic research project to study
learning algorithms that would enable a computer to teach itself a task.
Gerry chose backgammon as the task because it appeared to be a good domain in
which a neural network might work well. At this point you may be wondering now
that I've mentioned it twice, what is a neural network? Well, in short, it's a
model of interconnected neurons (also known as nodes) that was inspired by the
logical neurons in the human nervous system. Each connection between neurons has
a particular weight value associated with it.
In the case of backgammon, the state of the backgammon board is fed into
input neurons that have connections to hidden neurons (or units). These hidden
neurons in turn connect to an output layer that holds the value of the state
(that is, the chances of winning from that particular state). The computation
between the input neurons and the hidden neurons is a weighted linear summation
of all the input neurons. The result of the summation is put through a
thresholding function. This function compresses the value to lie within a
certain range of probabilities. (In case it ever comes up in conversation, the
function is known as a squashing function.) The squashing function is a
non-linear function. The non-linearity allows a system to learn more complex
functions.
To use this model to teach a system backgammon, all the initial weights
between the neurons are randomly set. The neural network starts from the opening
backgammon position and plays both sides until one of the sides wins. The
outcome of the game is used as a reward signal for reinforcement learning. That
is, the neural network takes the outcome of the game and adjusts the weights
accordingly. The adjustments improve the network's ability to evaluate board
states for subsequent plays of the game.
This learning process is repeated hundreds and thousands of times. Using an
RS/6000 computer, the learning actually took about two weeks. Gerry and his
colleagues were amazed at how well the neural network learned to play
backgammon. The system kept getting better and better until it reached the world
class master status. Actually, the neural network could improve its play even
more with further training and a larger network.
Although playing backgammon on a computer that plays as well as a world class
master seems somewhat awe-inspiring, you can work up to it. The game comes with
five skill settings, where each higher setting uses an increasing larger and
more complex neural network as its underlying engine. If you want to use
TD-Gammon to improve your backgammon skills, it is quite good as a
learning device. Not only do you get feedback from the results of your playing
but the system is quite supportive of you. It gives a modest, 'I win' message
when you lose and a hearty 'Congratulations, you win!', when the computer loses.
To embody this expert backgammon-playing neural network into an OS/2 game,
IBM Research hired Keith Weiner, a professional PC game developer, to add a
front end written for OS/2's presentation manager. TD-Gammon is fully
32-bit and takes full advantage of OS/2 Warp's multi-threading capabilities.
Like all presentation manager programs, TD-Gammon comes with a settings
notebook where you can set things such as the background color and the animation
speed.
Given the success of the TD-Gammon game, I asked Gerry what his next
neural network game would be. He told me that researchers have used other games
such as Chess, Othello, and Go with varying degrees of success to study neural
network learning. None have been as successful as backgammon. Gerry theorizes
that the stochastic element of backgammon (i.e. throwing the dice) is what makes
backgammon so useful in modeling the self-learning process. With that in mind,
Gerry's next venture into self-learning is with financial time series analysis.
If that project is as successful at learning as the backgammon game, then I'm
really looking forward to that program.
For more information on Gerry's work, you may want to read his article "Temporal Difference Learning
and TD-Gammon" published in Communications of the ACM, volume
38, number 3, pp. 58-68 (March 1995).
The above article originally appeared on IBM's Web site at www.austin.ibm.com. It is reproduced here
with their kind permission.
Return to the main backgammon page.
Stephen
Turner University of Cambridge Statistical Laboratory E-mail: sret1@cam.ac.uk
Page last modified: 11-Nov-99
|