[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: computer-go: Temporal Difference Learning
Don,
I do not share your TD learning experiences in chess. Neither does the
general consensus do so in computerchess. All the strong chess programmers
of today i know which have toyed with TD learning came to the conclusion that
a) it is tuning more and more aggressive until it goes completely crazy
and the learning doesn't realize it. Some have very complicated theories
why, i'll skip mentionning them. In the end a piece which for sure is worth
more than 3 pawns (and in some programs that's 3.6 pawns in others it's
closer to 4 pawns) it is putting it more aggressive until 2 pawns or
similar just to play an attack. In go you can compare this that in order to
save a group which is occupying a territory that delivers 2 points, about
10 stones to.
b) complex evaluation functions you can forget in advance that they can
get tuned
c) the accuracy at which it tunes is not good enough; what i mean is that
in positions where the correct tuning of for example some parameter (like
doubled pawn lies somewhere between 0.1 and 0.3 pawns penalty, it gets some
crazy value like bonus of 0.5 pawns. Again some have crazy theories why it
is doing that, but my only theory is that it is randomly flipping a few
parameters basically because there is no software that can lay a
relationship between why it lost and what parameters to change. Other
parameters where it guessed the + and - correctly, it again forgets that
sometimes strong moves get found at a 0.032 pawns difference and it is not
even tuning correctly within a domain of 1 pawn.
Also the original program at which TD learning was tested: Knightcap showed
all these behaviours. That's why all the stronger programs completely
annihilated knightcap after a while always.
Please do take me wrong that i find the experiments as conducted by the
knightcap programmer a hell of an achievement, because the average
'learning' expert in AI doesn't get further than writing a 10 page story on
how he imagines learning in software actually works (without any practical
experiences nor any practical proof), because he has heart from other
persons who just work with paper, that it is maybe possible to make something.
When it was really doing too poor this knightcap, then the programmer of
knightcap usually reset the learning experiment as it got out of hand. Any
2 minute tuning from my hand (and i'm for sure not the best tuner)
trivially beats with knightcap any knightcap version.
A basic program that knightcap had was that it was just too weak to
actually do the experiments with. If you make an incredible weak go program
that is near to randomly laying down stones at a board. So strength 200
kyu, then of course any new algorithm is going to work.
Basically that is a big problem for scientists because making a strong
program is not easy. That applies to your 4 ply experiments too. Those play
at utter beginner level of course.
The strong and also many commercial programmers include about 50% of the
programmers that join the 2003 world champs. They all drew the same
conclusions above. The rest doesn't waste time to learning experiments
other than book learning.
At 19:25 15-7-2003 -0400, Don Dailey wrote:
>
>Markus,
>
>I don't understand why you say TDLeaf is exponentially slow.
>Generating a principal variation from a search is almost free and that
>is what you train against, the final node position of the principal
>variation. Are you comparing to something that doesn't need to do a
>search such as getting positions from game records?
>
>Me and Don Beal did something like this also with chess. We took a
>fairly complicated evaluation function and tuned the weights using
>TDLeaf. We actually did 4 ply searches and played hundreds of
>thousands of games over a several week time period. To save time, we
>pre-tuned the weights with 2 ply games until they got farily stable
>and then went from there with the 4 ply games. I also did some things
>to optimize the search time of very shallow searches to speed things
>up.
>
>From time to time we would play some matches with the hand tuned
>version of the program and we watched it improve over the weeks. We
>were fairly surprised by the results. When we stopped we had a
>program that could beat our standard program almost 70% of the time.
>When we looked at the weights the program chose, many of them seemed
>odd, but the program was indeed better. The best surprise was that it
>played much more interesting chess.
>
>I think one advantage of this kind of thing is that the algorithm is
>immune from the fears and predjudices that a human will impose upon it
>when engineering the weights manually. In our case, the program was
>not afraid to play what seemed like much riskier moves such as
>sacrafices, moves it would never have tried before. But this new
>style from the point of view of the algorithm wasn't risky, it was the
>surest path to sucess as measured by TDLeaf.
>
>One very desirable characteristic of the new evaluation weights was a
>de-emphasis on material values. It seems that the values of the
>pieces had more to do with their total positional value and less on
>the static fixed values that we usually assign to pieces.
>
>It is very hard to claim success despite what I just related because
>it is not clear how good the initial hand tuned weight actually were.
>I can only say I really liked the way it played and that this seemed
>to be a better way to choose weights that what I was capable of doing
>on my own.
>
>Unfortunately, the program was in heavy development during the weeks
>it was being tuned by TDLeaf. The evaluation changed significantly
>and the new weights were out of date. We never actually got to
>benefit from the technique since we did not have the time to start
>over.
>
>
>
>Don
>
>
>
>
>
>
> Date: Tue, 15 Jul 2003 12:24:19 -0600
> From: Markus Enzenberger <compgo@xxxxxxxxxxxxxxxxx>
> Content-type: text/plain; charset=iso-8859-1
> Content-disposition: inline
> User-Agent: KMail/1.5.1
> Sender: owner-computer-go@xxxxxxxxxxxxxxxxx
> Precedence: bulk
> Reply-To: computer-go@xxxxxxxxxxxxxxxxx
>
> > > there is an algorithm called TDLeaf, but I am not
> > > convinced that it is useful.
> >
> > A quick web search found a paper by Baxter, Tridgell, and
> > Weaver. Is this the canonical one?
>
> yes.
>
> > Also, can you say why you're not convinced this is
> > useful?
>
> it was used for training evaluation functions in chess that
> used the material value of of the position as input.
> Then you have the disadvantage that the material value
> can change at every move in an exchange of pieces
> which would give you horrible training patterns.
> TDLeaf avoids this by using search to get more appropriate
> target positions for training (e.g. after the exchange has
> happened).
> But you pay a very high price for it, because move
> generation during self-play is now exponentially slower.
> IMHO it would have been better to do a quiescence search for
> determining the material value of a position used as input
> for the evaluation function and choose the moves during
> self-play by 1-ply look-ahead.
> However I haven't performed any experiments and the neural
> network in NeuroGo is much to slow to use TDLeaf.
>
> > > NeuroGo in its most recent version uses local
> > > connectivity and single-point eyes as additional
> > > outputs that are trained with TD. I will present a
> > > paper about this at ACG2003 which takes place together
> > > with the Computer Olympiad in Graz/Austria in November.
> >
> > So when and how do those of us stuck stateside get ahold
> > of it? :-)
>
> I'll put the paper online when the final version is ready.
>
> - Markus
>
>
>