[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: computer-go: Temporal Difference Learning



Don,

I do not share your TD learning experiences in chess. Neither does the
general consensus do so in computerchess. All the strong chess programmers
of today i know which have toyed with TD learning came to the conclusion that

  a) it is tuning more and more aggressive until it goes completely crazy
and the learning doesn't realize it. Some have very complicated theories
why, i'll skip mentionning them. In the end a piece which for sure is worth
more than 3 pawns (and in some programs that's 3.6 pawns in others it's
closer to 4 pawns) it is putting it more aggressive until 2 pawns or
similar just to play an attack. In go you can compare this that in order to
save a group which is occupying a territory that delivers 2 points, about
10 stones to. 
  b) complex evaluation functions you can forget in advance that they can
get tuned
  c) the accuracy at which it tunes is not good enough; what i mean is that
in positions where the correct tuning of for example some parameter (like
doubled pawn lies somewhere between 0.1 and 0.3 pawns penalty, it gets some
crazy value like bonus of 0.5 pawns. Again some have crazy theories why it
is doing that, but my only theory is that it is randomly flipping a few
parameters basically because there is no software that can lay a
relationship between why it lost and what parameters to change. Other
parameters where it guessed the + and - correctly, it again forgets that
sometimes strong moves get found at a 0.032 pawns difference and it is not
even tuning correctly within a domain of 1 pawn. 

Also the original program at which TD learning was tested: Knightcap showed
all these behaviours. That's why all the stronger programs completely
annihilated knightcap after a while always.

Please do take me wrong that i find the experiments as conducted by the
knightcap programmer a hell of an achievement, because the average
'learning' expert in AI doesn't get further than writing a 10 page story on
how he imagines learning in software actually works (without any practical
experiences nor any practical proof), because he has heart from other
persons who just work with paper, that it is maybe possible to make something.

When it was really doing too poor this knightcap, then the programmer of
knightcap usually reset the learning experiment as it got out of hand. Any
2 minute tuning from my hand (and i'm for sure not the best tuner)
trivially beats with knightcap any knightcap version. 

A basic program that knightcap had was that it was just too weak to
actually do the experiments with. If you make an incredible weak go program
that is near to randomly laying down stones at a board. So strength 200
kyu, then of course any new algorithm is going to work.

Basically that is a big problem for scientists because making a strong
program is not easy. That applies to your 4 ply experiments too. Those play
at utter beginner level of course.

The strong and also many commercial programmers include about 50% of the
programmers that join the 2003 world champs. They all drew the same
conclusions above. The rest doesn't waste time to learning experiments
other than book learning.

At 19:25 15-7-2003 -0400, Don Dailey wrote:
>
>Markus,
>
>I  don't  understand  why   you  say  TDLeaf  is  exponentially  slow.
>Generating a principal variation from a search is almost free and that
>is what  you train against, the  final node position  of the principal
>variation.  Are you  comparing to something that doesn't  need to do a
>search such as getting positions from game records?
>
>Me and  Don Beal did something like  this also with chess.   We took a
>fairly  complicated evaluation  function and  tuned the  weights using
>TDLeaf.   We  actually did  4  ply  searches  and played  hundreds  of
>thousands of games over a several  week time period.  To save time, we
>pre-tuned the  weights with 2 ply  games until they  got farily stable
>and then went from there with the 4 ply games.  I also did some things
>to optimize the  search time of very shallow  searches to speed things
>up.
>
>From  time to  time we  would play  some matches  with the  hand tuned
>version of the  program and we watched it improve  over the weeks.  We
>were  fairly surprised  by  the results.   When  we stopped  we had  a
>program that could  beat our standard program almost  70% of the time.
>When we looked  at the weights the program chose,  many of them seemed
>odd, but the program was indeed better.  The best surprise was that it
>played much more interesting chess.
>
>I think one  advantage of this kind of thing is  that the algorithm is
>immune from the fears and predjudices that a human will impose upon it
>when engineering the  weights manually.  In our case,  the program was
>not  afraid  to play  what  seemed like  much  riskier  moves such  as
>sacrafices,  moves it  would never  have tried  before.  But  this new
>style from the point of view of the algorithm wasn't risky, it was the
>surest path to sucess as measured by TDLeaf.
>
>One very desirable characteristic of  the new evaluation weights was a
>de-emphasis  on material  values.  It  seems  that the  values of  the
>pieces had  more to do with  their total positional value  and less on
>the static fixed values that we usually assign to pieces.
>
>It is very  hard to claim success despite what  I just related because
>it is not clear how good  the initial hand tuned weight actually were.
>I can only say  I really liked the way it played  and that this seemed
>to be a better way to choose  weights that what I was capable of doing
>on my own.
>
>Unfortunately, the  program was in heavy development  during the weeks
>it was  being tuned by  TDLeaf.  The evaluation  changed significantly
>and  the new  weights were  out  of date.   We never  actually got  to
>benefit from  the technique since  we did not  have the time  to start
>over.
>
>
>
>Don
>
>
>
>
>
>
>   Date: Tue, 15 Jul 2003 12:24:19 -0600
>   From: Markus Enzenberger <compgo@xxxxxxxxxxxxxxxxx>
>   Content-type: text/plain; charset=iso-8859-1
>   Content-disposition: inline
>   User-Agent: KMail/1.5.1
>   Sender: owner-computer-go@xxxxxxxxxxxxxxxxx
>   Precedence: bulk
>   Reply-To: computer-go@xxxxxxxxxxxxxxxxx
>
>   > > there is an algorithm called TDLeaf, but I am not
>   > > convinced that it is useful.
>   >
>   > A quick web search found a paper by Baxter, Tridgell, and
>   > Weaver.  Is this the canonical one?
>
>   yes.
>
>   > Also, can you say why you're not convinced this is
>   > useful?
>
>   it was used for training evaluation functions in chess that
>   used the material value of of the position as input.
>   Then you have the disadvantage that the material value
>   can change at every move in an exchange of pieces
>   which would give you horrible training patterns.
>   TDLeaf avoids this by using search to get more appropriate 
>   target positions for training (e.g. after the exchange has 
>   happened).
>   But you pay a very high price for it, because move 
>   generation during self-play is now exponentially slower.
>   IMHO it would have been better to do a quiescence search for 
>   determining the material value of a position used as input 
>   for the evaluation function and choose the moves during 
>   self-play by 1-ply look-ahead.
>   However I haven't performed any experiments and the neural 
>   network in NeuroGo is much to slow to use TDLeaf.
>
>   > > NeuroGo in its most recent version uses local
>   > > connectivity and single-point eyes as additional
>   > > outputs that are trained with TD. I will present a
>   > > paper about this at ACG2003 which takes place together
>   > > with the Computer Olympiad in Graz/Austria in November.
>   >
>   > So when and how do those of us stuck stateside get ahold
>   > of it?  :-)
>
>   I'll put the paper online when the final version is ready.
>
>   - Markus
>
>
>