[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [computer-go] SlugGo vs Many Faces, newest data
Handicap = 2
42 games
Black won 11 (by: 11, 3, 45, 66, 31, 36, 3, 17, 27, 7, 3)
White won 31( by: 33, 31, 40, 2, 25, 23, 9, 56, 57, 17, 14, 21,
65, 37, 77, 16, 91, 51, 78, 15, 4, 2, 74, 6, 35, 23, 30, 18, 9, 31, 51)
Handicap = 3
6 games
Black won 4 (by: 25, 135, 30, 107)
White won 2(by: 21, 51)
Handicap = 4
18 games
Black won 8 (by: 11, 36, 58, 15, 12, 6, 13, 56)
White won 9 ( by: 11, 104, 79, 49, 20, 72, 51, 24, 10)
One tie
I feel that these results are very good evidence that the new programme is
at least two stones
stronger than many faces, and probably better than that. I think the 31-11
result alone gives
a confidence level of about 98.5% that it is two stones stronger. The
confidence intervals most
often used in my experience are the 95% ones.
To take a baysian approach, one needs to specify priors (that is the
probability that slug is better than many faces estimated before they have
been tested). This has to involve a lot of subjective opinion I think, but
the fact that people were prepared to put in the effort to code up the
experiment suggests to me that they thought it would probably give a
reasonable improvement, and I can think of no reason to disagree with
them. For this reason, I think the uniform (50-50) prior is
sensible. Accepting this, there is a 98.5% chance that Slug go is 2 stones
stronger than MF, and the other test data only go to make this more convincing.
I disagree with the person who was suggesting we should use very strong
confidence limits. They are only appropriate if you are trying to prove
something which a priori seems very unlikely, or if a false positive has
very bad consequences.
Don, I think you're observations about new ideas giving odd results at
first, must be due to some psychological factor. Perhaps you tend to
remember the suprising results, or perhaps you test a lot of ideas that
don't lead to improved performance, but quickly dismiss those that don't
get off to a good start in the tests.
There is an area of statistics called sequential testing, which is about
making a decision based on as few tests as possible. The method you use
(waiting for one programme or other to take a lead of n games) is precisely
what sequential testing proves is the optimal test to decide which of the
programmes is stronger (the number of games lead by each programme to
achieve victory can be varied independently to account for the fact that
there might be a greater cost to incorrectly passing one programme, or for
some a priori information).
I agree with Mark, that these results seem to show a big step in go
performance, and I'd love to see slug go compete in the computer go
ladder. Is it too slow at present?
Tom
---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.548 / Virus Database: 341 - Release Date: 05/12/2003
_______________________________________________
computer-go mailing list
computer-go@xxxxxxxxxxxxxxxxx
http://www.computer-go.org/mailman/listinfo/computer-go/