[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [computer-go] SlugGo vs Many Faces, newest data





Handicap = 2
42 games
Black won 11 (by: 11, 3, 45, 66, 31, 36, 3, 17, 27, 7, 3)
White won 31( by: 33, 31, 40, 2, 25, 23, 9, 56, 57, 17, 14, 21, 65, 37, 77, 16, 91, 51, 78, 15, 4, 2, 74, 6, 35, 23, 30, 18, 9, 31, 51)

Handicap = 3
6 games
Black won 4 (by: 25, 135, 30, 107)
White won 2(by: 21, 51)

Handicap = 4
18 games
Black won 8 (by: 11, 36, 58, 15, 12, 6, 13, 56)
White won 9 ( by: 11, 104, 79, 49, 20, 72, 51, 24, 10)
One tie

I feel that these results are very good evidence that the new programme is at least two stones
stronger than many faces, and probably better than that. I think the 31-11 result alone gives
a confidence level of about 98.5% that it is two stones stronger. The confidence intervals most
often used in my experience are the 95% ones.

To take a baysian approach, one needs to specify priors (that is the probability that slug is better than many faces estimated before they have been tested). This has to involve a lot of subjective opinion I think, but the fact that people were prepared to put in the effort to code up the experiment suggests to me that they thought it would probably give a reasonable improvement, and I can think of no reason to disagree with them. For this reason, I think the uniform (50-50) prior is sensible. Accepting this, there is a 98.5% chance that Slug go is 2 stones stronger than MF, and the other test data only go to make this more convincing.

I disagree with the person who was suggesting we should use very strong confidence limits. They are only appropriate if you are trying to prove something which a priori seems very unlikely, or if a false positive has very bad consequences.

Don, I think you're observations about new ideas giving odd results at first, must be due to some psychological factor. Perhaps you tend to remember the suprising results, or perhaps you test a lot of ideas that don't lead to improved performance, but quickly dismiss those that don't get off to a good start in the tests.

There is an area of statistics called sequential testing, which is about making a decision based on as few tests as possible. The method you use (waiting for one programme or other to take a lead of n games) is precisely what sequential testing proves is the optimal test to decide which of the programmes is stronger (the number of games lead by each programme to achieve victory can be varied independently to account for the fact that there might be a greater cost to incorrectly passing one programme, or for some a priori information).

I agree with Mark, that these results seem to show a big step in go performance, and I'd love to see slug go compete in the computer go ladder. Is it too slow at present?

Tom
---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.548 / Virus Database: 341 - Release Date: 05/12/2003
_______________________________________________
computer-go mailing list
computer-go@xxxxxxxxxxxxxxxxx
http://www.computer-go.org/mailman/listinfo/computer-go/