[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[computer-go] Statistical significance (was: SlugGo vs Many Faces,newest data)
On Tue, 7 Sep 2004, Tom Cooper wrote:
> I disagree with the person who was suggesting we should use very strong
> confidence limits. They are only appropriate if you are trying to prove
> something which a priori seems very unlikely, or if a false positive has
> very bad consequences.
I think it is a priori very unlikely that a modification will improve the
performance of gnugo. :-) Seriously, of course all depends on what you
want to use the statistical test for - weak evidence may be all that's
required to answers questions such as "Should I continue this line of
work?" resp. "Should I try to duplicate this approach?" We should not,
however, harbour illusions about the statistical strength of evidence.
One problem with confidence limits is that to non-experts they tend
to look much stronger than to statisticians. If you're analyzing the
data available in some limited realm - say, cancer cases - and get 97%
confidence that one of a few plausible factors affects the outcome,
that may be strong enough to make major public policy decisions.
In algorithm development, however, where there's an infinite supply
of plausible ideas and data, 97% is not strong at all - for instance,
assuming that just 15 computer go ideas are developed and tested each
year, every other year one of them would appear to be a significant
advance at 97% confidence even if there was no real progress at all.
In my work (machine learning algorithm development), if I can't quickly
reach 3-5 "nines" of confidence that some new idea is better on a simple
benchmark, it's not worth pursuing. (If I do reach that confidence level,
the real work starts.) I'd say computer go lies somewhere inbetween.
Even worse, all these tests rest on the assumption that the data samples
are statistically independent, which is patently false in computer go
unless one of the programs plays essentially random. To give an extreme
example, if two programs are fully deterministic, one will keep winning,
and after a few games you'll claim with very high confidence that that
one is better - when in fact you've just played the same single game
over and over.
Yes, go programs are randomized, but they do tend to repeat the same
patterns from game to game. This means you need to play a lot more
games than the independence-based statistical tests would suggest to
be really confident about which is better. A better remedy would be
more independence between games, e.g., by playing against a variety
of opponents. (Although the existence of this mailing list suggests
that the different go programs are not completely independent of each
other either!)
NB, I'm not trying to belittle these particular results at all - they
do look good, if tentative. What I am cautioning against is confusing
statistical statements based on questionable independence assumptions
with some Platonic truth. Caveat emptor. Lies, damned lies, etc.
Regards,
- nic
--
Dr. Nicol N. Schraudolph http://n.schraudolph.org/
Sonnenkopfweg 17
D-87527 Sonthofen, Germany
_______________________________________________
computer-go mailing list
computer-go@xxxxxxxxxxxxxxxxx
http://www.computer-go.org/mailman/listinfo/computer-go/