[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: computer-go: Temporal Differences again
> Cursory experiments indicate that it is better to play to the end of
> the game, then go back and teach the system that this is the expected
> result for each board position encountered along the way.
I'm not much of an expert on TDL, but here is my understanding of
things for what it's worth. Maybe it will help or maybe it's wrong
(in my chess program Don Beal implemented the TDL stuff, I only
provide the scaffolding for the chess program.)
Starting at final position and working backwards:
Generate new training signal value. All values are expressed as
probabilities (such as win loss draw in chess) and converted to scores
later. At final position, the training signal is the result of the
game, otherwise it will be some fraction of the distance between the
current evaluation and the PREVIOUS training singal. The fraction is
very large, like 95% or something. Even if the evaluation is really
stupid the training signal from the result of the game should have
heavy influence for a while.
I'm not sure how you are using this, but in my chess program with TDL,
a win represents an infinite score, so the evaluation will try to
expand forever. You can limit this by making full wins impossible,
i.e. a won game is really only 98% won or something like that.
I hope that is some help.
Don
Date: Tue, 12 Aug 2003 16:08:20 -0700
Content-Type: text/plain; charset=US-ASCII; format=flowed
From: Peter Drake <drake@xxxxxxxxxxxxxxxxx>
Sender: owner-computer-go@xxxxxxxxxxxxxxxxx
Precedence: bulk
Reply-To: computer-go@xxxxxxxxxxxxxxxxx
I'm having real trouble with my temporal difference learning. As far
as I can tell, the problem stems from the fact that, except at the end
of the game, the reinforcement signal is simply the system's own
estimate of the board value. This noise seems to overwhelm the real
signal that appears at the end of the game.
Cursory experiments indicate that it is better to play to the end of
the game, then go back and teach the system that this is the expected
result for each board position encountered along the way.
Thoughts?
Peter Drake
Assistant Professor of Computer Science
Lewis & Clark College
http://www.lclark.edu/~drake/