[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[computer-go] Re: Statistical Significance (was: SlugGo v.s. ManyFaces, newest data)




On Mon, 6 Sep 2004, David G Doshay wrote:

> On Sep 6, 2004, at 8:38 PM, Douglas Ridgway wrote:

> > I wrote something for gnugo-devel on March 2, 2004 on essentially this.
> > Fortunately, neither 10/13 nor 11/14 is quite long enough to meet a 5%
> > test, so even if you got only one of these streaks then stopped, 
> > standard
> > statistics avoids being misled.

> The paper "Statistical Significance of a Match" by Rémi Coulom, posted
> here on 4 Sept and available at
> 	http://remi.coulom.free.fr/WhoIsBest.zip
> contains a table claiming that with 3 losses and 10 wins the confidence
> level is 97% that the winner of 10 is "better." This result is based 
> upon
> a formula from Bayes.

I don't think Bayes comes into it: this paper sets up a uniform prior for
the null and alternative hypotheses, which is, I think, the standard
incantation for reproducing conventional hypothesis test results in a
Bayesian framework. I could be confused, though: Bayesian stuff usually
makes my head spin.

As for why the numbers are different, this paper appears to be using a
one-sided test, i.e. "What is the likelihood that the first program is
better than the second one?" (the paper's second sentence), rather than
"is one program better than the other one". While there are times a
one-sided test is appropriate, I don't think this is one of them, because
it's also possible that the second program is better than the first. In
particular, if you're going to apply a one-sided test, you're *not*
allowed to pick which side you apply it to after you see the data. Do
that, and you double your type I error rate, which is I think what's
effectively happening here. Throw in rounding, and the 97% column in the
table ends up listing results which are significant in a two-sided test at
about the 93% confidence level, which most people would consider pretty 
marginal.

The issue can be read directly off the table. Flip a coin twice. Were the
results the same? If so, the table seems to invite us to consider the coin
as biased with 80% confidence. But this happens fully 50% of the time. If
we are doing a one-tailed test, i.e. is the coin biased to heads, we get
HH only 25% of the time, so perhaps we could identify this as biased if
we're willing to get the right answer on unbiased noise only 75% of the
time, which is (almost) the 80% in the table. The downside is that no
matter how many tails we see in a row, we can never identify the coin as
biased to tails.

> > show a difference for a player who wins 60% (see the Mar 2 email), 
> > which
> > is at the outside of the likely differential at 4 stones. For 3 or 5
> > stones, it'd be okay.
> 
> His table also claims 97% Cl at 40 losses and 59 wins, essentially the 
> 60%
> you mention.

Okay, 59 successes, 99 trials, p = 0.0699, which would be significant if
we're willing to accept finding an effect in pure unbiased noise 7% of the
time. Now, assume that the true win rate is 60%. What fraction of the time
will our trial of 99 games be significant at our chosen 93% confidence
level? Our cutoff for significance is 59 successes, which, as the
expectation value, is right in the middle of the distribution of trials.
Half the time it'll be okay, but the other half the time it will be low,
and the trial will be nonsignificant even though the effect is real. This
is type II error, and to reduce it, the sample size needs to be increased
more.

More examples. Nici quotes David MacKay's analysis of 140 heads in 250
tosses. In this case, p = 0.066, not even marginally significant for most
people. With sufficient external reasons to believe in bias, you might
feel encouraged to take more data, but that's about as far as you could
go. Don suggests not losing sleep over 9 wins in 10 games. p = 0.021 <
0.05, which is significant, albeit not overwhelmingly. Being a 5% kind of
guy, I'd feel duty bound to at least consider the possibility that this is
not due to chance alone, although significance tests by themselves can
never prove anything conclusively. 10/13: p = 0.092, not significant.
11/14: p = 0.057, almost, but not quite. Moreover, these were selected by
hand out of a larger dataset precisely because they were long streaks, so
it doesn't really make sense to analyze them out of that context.

Don also suggests "keep testing until one version [is] ahead by N games,
say 20 or 30." The distance travelled by a random walk goes like sqrt(t),
so this will eventually test positive, regardless of whether there is bias
or not.

Here's the code I use to get p-values:

In octave:
p = 2*(1 - binomial_cdf (max(wins,losses)-1, wins+losses, 0.5))
(Except when wins=losses, in which case p=1. For matlab, replace 
binomial_cdf with binocdf from the stats package.)

In R:
binom.test(wins, wins+losses)

Happy coin tossing!

doug.



_______________________________________________
computer-go mailing list
computer-go@xxxxxxxxxxxxxxxxx
http://www.computer-go.org/mailman/listinfo/computer-go/