[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: computer-go: Temporal Difference Learning



At 03:31 16-7-2003 -0400, Don Dailey wrote:
>
>Hi Vincent,
>
>>  I do not share your TD learning experiences in chess. Neither does the
>>  general consensus do so in computerchess. 
>>  All the strong chess programmers
>>  of today i know which have toyed with TD learning came to the
conclusion that ...
>
>Am I really at odds with the other computer chess experts?  
>
>Here is how I summarized my experience with TDLeaf:
>
>  "I  can only say  I really  liked the  way it  played and  that this
>  seemed to be a better way  to choose weights than what I was capable
>  of doing on my own."
>
>
>So nothing you observed was contrary to my own observations.  

My conclusion and that of everyone i spoke with was that within 1
sequential movement through the evaluation function i can pick better
values than TD learning in 2 years at a 500 processor machine will learn,
under the assumption that the evaluation function is the evaluation
function of a reasonable strength chessprogram with not too many bugs (and
because of that first condition the number of parameters is several
thousands trivially or more). What does it take me to go through the
parameters of a chessprogram in 1 sequential movement, 3 hours or 6 hours
at most for a big evaluation function?

Setting up the TD learning experiment takes longer already...

>Don

>   X-Sender: diep@xxxxxxxxxxxxxxxxx
>   Date: Wed, 16 Jul 2003 03:38:24 +0100
>   From: Vincent Diepeveen <diep@xxxxxxxxxxxxxxxxx>
>   Cc: computer-go@xxxxxxxxxxxxxxxxx
>   Content-Type: text/plain; charset="us-ascii"
>   Sender: owner-computer-go@xxxxxxxxxxxxxxxxx
>   Precedence: bulk
>   Reply-To: computer-go@xxxxxxxxxxxxxxxxx
>
>   Don,
>
>   I do not share your TD learning experiences in chess. Neither does the
>   general consensus do so in computerchess. All the strong chess programmers
>   of today i know which have toyed with TD learning came to the
conclusion that
>
>     a) it is tuning more and more aggressive until it goes completely crazy
>   and the learning doesn't realize it. Some have very complicated theories
>   why, i'll skip mentionning them. In the end a piece which for sure is
worth
>   more than 3 pawns (and in some programs that's 3.6 pawns in others it's
>   closer to 4 pawns) it is putting it more aggressive until 2 pawns or
>   similar just to play an attack. In go you can compare this that in
order to
>   save a group which is occupying a territory that delivers 2 points, about
>   10 stones to. 
>     b) complex evaluation functions you can forget in advance that they can
>   get tuned
>     c) the accuracy at which it tunes is not good enough; what i mean is
that
>   in positions where the correct tuning of for example some parameter (like
>   doubled pawn lies somewhere between 0.1 and 0.3 pawns penalty, it gets
some
>   crazy value like bonus of 0.5 pawns. Again some have crazy theories why it
>   is doing that, but my only theory is that it is randomly flipping a few
>   parameters basically because there is no software that can lay a
>   relationship between why it lost and what parameters to change. Other
>   parameters where it guessed the + and - correctly, it again forgets that
>   sometimes strong moves get found at a 0.032 pawns difference and it is not
>   even tuning correctly within a domain of 1 pawn. 
>
>   Also the original program at which TD learning was tested: Knightcap
showed
>   all these behaviours. That's why all the stronger programs completely
>   annihilated knightcap after a while always.
>
>   Please do take me wrong that i find the experiments as conducted by the
>   knightcap programmer a hell of an achievement, because the average
>   'learning' expert in AI doesn't get further than writing a 10 page
story on
>   how he imagines learning in software actually works (without any practical
>   experiences nor any practical proof), because he has heart from other
>   persons who just work with paper, that it is maybe possible to make
something.
>
>   When it was really doing too poor this knightcap, then the programmer of
>   knightcap usually reset the learning experiment as it got out of hand. Any
>   2 minute tuning from my hand (and i'm for sure not the best tuner)
>   trivially beats with knightcap any knightcap version. 
>
>   A basic program that knightcap had was that it was just too weak to
>   actually do the experiments with. If you make an incredible weak go
program
>   that is near to randomly laying down stones at a board. So strength 200
>   kyu, then of course any new algorithm is going to work.
>
>   Basically that is a big problem for scientists because making a strong
>   program is not easy. That applies to your 4 ply experiments too. Those
play
>   at utter beginner level of course.
>
>   The strong and also many commercial programmers include about 50% of the
>   programmers that join the 2003 world champs. They all drew the same
>   conclusions above. The rest doesn't waste time to learning experiments
>   other than book learning.
>
>   At 19:25 15-7-2003 -0400, Don Dailey wrote:
>   >
>   >Markus,
>   >
>   >I  don't  understand  why   you  say  TDLeaf  is  exponentially  slow.
>   >Generating a principal variation from a search is almost free and that
>   >is what  you train against, the  final node position  of the principal
>   >variation.  Are you  comparing to something that doesn't  need to do a
>   >search such as getting positions from game records?
>   >
>   >Me and  Don Beal did something like  this also with chess.   We took a
>   >fairly  complicated evaluation  function and  tuned the  weights using
>   >TDLeaf.   We  actually did  4  ply  searches  and played  hundreds  of
>   >thousands of games over a several  week time period.  To save time, we
>   >pre-tuned the  weights with 2 ply  games until they  got farily stable
>   >and then went from there with the 4 ply games.  I also did some things
>   >to optimize the  search time of very shallow  searches to speed things
>   >up.
>   >
>   >From  time to  time we  would play  some matches  with the  hand tuned
>   >version of the  program and we watched it improve  over the weeks.  We
>   >were  fairly surprised  by  the results.   When  we stopped  we had  a
>   >program that could  beat our standard program almost  70% of the time.
>   >When we looked  at the weights the program chose,  many of them seemed
>   >odd, but the program was indeed better.  The best surprise was that it
>   >played much more interesting chess.
>   >
>   >I think one  advantage of this kind of thing is  that the algorithm is
>   >immune from the fears and predjudices that a human will impose upon it
>   >when engineering the  weights manually.  In our case,  the program was
>   >not  afraid  to play  what  seemed like  much  riskier  moves such  as
>   >sacrafices,  moves it  would never  have tried  before.  But  this new
>   >style from the point of view of the algorithm wasn't risky, it was the
>   >surest path to sucess as measured by TDLeaf.
>   >
>   >One very desirable characteristic of  the new evaluation weights was a
>   >de-emphasis  on material  values.  It  seems  that the  values of  the
>   >pieces had  more to do with  their total positional value  and less on
>   >the static fixed values that we usually assign to pieces.
>   >
>   >It is very  hard to claim success despite what  I just related because
>   >it is not clear how good  the initial hand tuned weight actually were.
>   >I can only say  I really liked the way it played  and that this seemed
>   >to be a better way to choose  weights that what I was capable of doing
>   >on my own.
>   >
>   >Unfortunately, the  program was in heavy development  during the weeks
>   >it was  being tuned by  TDLeaf.  The evaluation  changed significantly
>   >and  the new  weights were  out  of date.   We never  actually got  to
>   >benefit from  the technique since  we did not  have the time  to start
>   >over.
>   >
>   >
>   >
>   >Don
>   >
>   >
>   >
>   >
>   >
>   >
>   >   Date: Tue, 15 Jul 2003 12:24:19 -0600
>   >   From: Markus Enzenberger <compgo@xxxxxxxxxxxxxxxxx>
>   >   Content-type: text/plain; charset=iso-8859-1
>   >   Content-disposition: inline
>   >   User-Agent: KMail/1.5.1
>   >   Sender: owner-computer-go@xxxxxxxxxxxxxxxxx
>   >   Precedence: bulk
>   >   Reply-To: computer-go@xxxxxxxxxxxxxxxxx
>   >
>   >   > > there is an algorithm called TDLeaf, but I am not
>   >   > > convinced that it is useful.
>   >   >
>   >   > A quick web search found a paper by Baxter, Tridgell, and
>   >   > Weaver.  Is this the canonical one?
>   >
>   >   yes.
>   >
>   >   > Also, can you say why you're not convinced this is
>   >   > useful?
>   >
>   >   it was used for training evaluation functions in chess that
>   >   used the material value of of the position as input.
>   >   Then you have the disadvantage that the material value
>   >   can change at every move in an exchange of pieces
>   >   which would give you horrible training patterns.
>   >   TDLeaf avoids this by using search to get more appropriate 
>   >   target positions for training (e.g. after the exchange has 
>   >   happened).
>   >   But you pay a very high price for it, because move 
>   >   generation during self-play is now exponentially slower.
>   >   IMHO it would have been better to do a quiescence search for 
>   >   determining the material value of a position used as input 
>   >   for the evaluation function and choose the moves during 
>   >   self-play by 1-ply look-ahead.
>   >   However I haven't performed any experiments and the neural 
>   >   network in NeuroGo is much to slow to use TDLeaf.
>   >
>   >   > > NeuroGo in its most recent version uses local
>   >   > > connectivity and single-point eyes as additional
>   >   > > outputs that are trained with TD. I will present a
>   >   > > paper about this at ACG2003 which takes place together
>   >   > > with the Computer Olympiad in Graz/Austria in November.
>   >   >
>   >   > So when and how do those of us stuck stateside get ahold
>   >   > of it?  :-)
>   >
>   >   I'll put the paper online when the final version is ready.
>   >
>   >   - Markus
>   >
>   >
>   >
>
>
>