[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: computer-go: Temporal Differences again

To: computer-go@xxxxxxxxxxxxxxxxx
Subject: Re: computer-go: Temporal Differences again
From: Don Dailey <drd@xxxxxxx>
Date: Tue, 12 Aug 2003 22:46:21 -0400
Cc: computer-go@xxxxxxxxxxxxxxxxx
In-reply-to: <DD82EB6A-CD19-11D7-933F-0003937E1CFC@xxxxxxxxxxxxxxxxx> (message fromPeter Drake on Tue, 12 Aug 2003 16:08:20 -0700)
References: <DD82EB6A-CD19-11D7-933F-0003937E1CFC@xxxxxxxxxxxxxxxxx>
Reply-to: computer-go@xxxxxxxxxxxxxxxxx
Sender: owner-computer-go@xxxxxxxxxxxxxxxxx

> Cursory experiments indicate that it is better to play to the end of 
> the game, then go back and teach the system that this is the expected 
> result for each board position encountered along the way.

I'm not  much of  an expert on  TDL, but  here is my  understanding of
things for  what it's worth.  Maybe  it will help or  maybe it's wrong
(in  my chess  program  Don Beal  implemented  the TDL  stuff, I  only
provide the scaffolding for the chess program.)

Starting at final position and working backwards:

Generate  new training  signal  value.  All  values  are expressed  as
probabilities (such as win loss draw in chess) and converted to scores
later.  At  final position, the training  signal is the  result of the
game, otherwise it  will be some fraction of  the distance between the
current evaluation and the  PREVIOUS training singal.  The fraction is
very large, like  95% or something.  Even if  the evaluation is really
stupid the  training signal  from the result  of the game  should have
heavy influence for a while.

I'm not sure how you are using this, but in my chess program with TDL,
a  win represents an  infinite score,  so the  evaluation will  try to
expand forever.   You can limit  this by making full  wins impossible,
i.e. a won game is really only 98% won or something like that.

I hope that is some help.

Don




   Date: Tue, 12 Aug 2003 16:08:20 -0700
   Content-Type: text/plain; charset=US-ASCII; format=flowed
   From: Peter Drake <drake@xxxxxxxxxxxxxxxxx>
   Sender: owner-computer-go@xxxxxxxxxxxxxxxxx
   Precedence: bulk
   Reply-To: computer-go@xxxxxxxxxxxxxxxxx

   I'm having real trouble with my temporal difference learning.  As far 
   as I can tell, the problem stems from the fact that, except at the end 
   of the game, the reinforcement signal is simply the system's own 
   estimate of the board value.  This noise seems to overwhelm the real 
   signal that appears at the end of the game.

   Cursory experiments indicate that it is better to play to the end of 
   the game, then go back and teach the system that this is the expected 
   result for each board position encountered along the way.

   Thoughts?

   Peter Drake
   Assistant Professor of Computer Science
   Lewis & Clark College
   http://www.lclark.edu/~drake/

References:
- computer-go: Temporal Differences again
  - From: Peter Drake

Prev by Date: computer-go: Temporal Differences again
Next by Date: Re: computer-go: Temporal Differences again
Previous by thread: computer-go: Temporal Differences again
Next by thread: Re: computer-go: Temporal Differences again
Index(es):
- Date
- Thread