[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [computer-go] Statistical Significance (was: SlugGo v.s. ManyFaces, newest data)

To: computer-go@xxxxxxxxxxxxxxx
Subject: Re: [computer-go] Statistical Significance (was: SlugGo v.s. ManyFaces, newest data)
From: Don Dailey <drd@xxxxxxx>
Date: Tue, 7 Sep 2004 10:49:57 -0400
Cc: ridgway@xxxxxxxxxxxx,computer-go@xxxxxxxxxxxxxxx
Delivered-to: computer-go@xxxxxxxxxxxxxxxxx
In-reply-to: <Pine.LNX.4.44.0409070859370.1030-100000@xxxxxxxxxxxxxxxxx>(compgo@xxxxxxxxxxxxxxxxx)
List-archive: <http://computer-go.org/pipermail/computer-go>
List-help: <mailto:computer-go-request@xxxxxxxxxxxxxxxxx?subject=help>
List-id: computer-go <computer-go.computer-go.org>
List-post: <mailto:computer-go@xxxxxxxxxxxxxxxxx>
List-subscribe: <http://hosting.midvalleyhosting.com/mailman/listinfo/computer-go>,<mailto:computer-go-request@xxxxxxxxxxxxxxxxx?subject=subscribe>
List-unsubscribe: <http://hosting.midvalleyhosting.com/mailman/listinfo/computer-go>,<mailto:computer-go-request@xxxxxxxxxxxxxxxxx?subject=unsubscribe>
References: <Pine.LNX.4.44.0409070859370.1030-100000@xxxxxxxxxxxxxxxxx>
Reply-to: drd@xxxxxxx,computer-go <computer-go@xxxxxxxxxxxxxxx>
Sender: computer-go-bounces@xxxxxxxxxxxxxxx

   > contains a table claiming that with 3 losses and 10 wins the confidence
   > level is 97% that the winner of 10 is "better." This result is based 
   > upon a formula from Bayes. I am not enough of a statistician to know.

I have an interesting slant  on this based on a practical observation.

If you  try a new idea or  make an experimental change  to an existing
program and  it tests extemely well  after 10 or 20  games, you should
trust the results even less than the purely statistical considerations
would tell  you.  There is a  great deal of  "empirical" evidence that
any given  NEW idea will,  if anything, test slightly  negative.  Even
though it's  difficult to scientifically factor  this observation into
the statistics, it's a very real  phenomenon.  So if I do something to
my program and it tests 9-1, for instance, I just laugh.

This would not be a consideration given two completely random programs
with no previous knowledge of either.

If I started  from scratch and wrote a new  program that used entirely
different ideas and  techniques, I would get more  excited about a 9-1
results, still keeping  in mind that even 9-1 out  of 10 samples isn't
yet anything worth losing sleep over.

Of  course I'm  not being  critical of  the results  reported  so far,
especially  since  they  seem  to  be  reported  objectively  with  no
extravagant claims being made yet.   I think the results as they stand
give a lot of reason for optimism.   One thing you can say for sure is
that it is much more likely this is an improvement than it is not.

There is something I often do when making a decision to keep a version
I suspected  is stronger.  I used a  quick and dirty rule  which is to
keep testing  until one version  was ahead by  N games, say 20  or 30.
Let's say 20  for example.  If a version  is significantly stronger it
will get this lead quickly and you can have some confidence that it is
probably not  weaker!  If  the version is  only slightly  better, it's
unlikely to get behind by 20  games and will eventually get ahead.  If
it takes a lot of games to get ahead 20 games, you can be sure that at
worst it can't be signficantly weaker and more than likely is at least
slightly  stronger.    You  can   do  this  more   scientfically  with
statistics, but this works pretty well in practice.


- Don






   Date: Tue, 7 Sep 2004 14:09:28 +0200 (CEST)
   From: "Nicol N. Schraudolph" <compgo@xxxxxxxxxxxxxxxxx>

   > 	http://remi.coulom.free.fr/WhoIsBest.zip
   >
   > contains a table claiming that with 3 losses and 10 wins the confidence
   > level is 97% that the winner of 10 is "better." This result is based 
   > upon a formula from Bayes. I am not enough of a statistician to know.

   While the 97% confidence level from the above paper is technically correct,
   it is very easy to overinterpret such numbers.  A Bayesian analysis that
   leads to more conservative results that (to me, unlike confidence levels)
   intuitively feel "right" is in terms of likelihood ratios, as can be seen
   in the one-page paper

       http://www.cs.toronto.edu/~mackay/euro.pdf

   The author (a well-known Bayesian statistician) analyses coin tosses that
   appear "suspicious" at a 93% confidence level in terms of likelihood
   ratios.  Replace "coin" with "go", "toss" with "game", "heads" with
   "win", and "tails" with "loss", and you can directly apply his analysis.
   What I get for the numbers above (with a uniform prior) is a likelihood
   ratio of just 2:1 in favor of the hypothesis "one program is better"
   ("the coin is biased") over "both programs are equally strong" ("the
   coin is fair").

   This should be considered *very weak* evidence in favor of one program
   being better.  For a convincing result one would like to see a likelihood
   ratio at least an order of magnitude higher.  So don't put too much
   confidence in confidence levels!

   Regards,

   - nic

   -- 
       Dr. Nicol N. Schraudolph                 http://n.schraudolph.org/
       Sonnenkopfweg 17
       D-87527 Sonthofen, Germany

   _______________________________________________
   computer-go mailing list
   computer-go@xxxxxxxxxxxxxxxxx
   http://www.computer-go.org/mailman/listinfo/computer-go/

_______________________________________________
computer-go mailing list
computer-go@xxxxxxxxxxxxxxxxx
http://www.computer-go.org/mailman/listinfo/computer-go/

Follow-Ups:
- RE: [computer-go] Statistical Significance (was: SlugGo v.s.ManyFaces, newest data)
  - From: Mark Boon

References:
- Re: [computer-go] Statistical Significance (was: SlugGo v.s. ManyFaces, newest data)
  - From: Nicol N. Schraudolph

Prev by Date: Re: [computer-go] Statistical Significance (was: SlugGo v.s. ManyFaces, newest data)
Next by Date: Re: [computer-go] Statistical Significance (was: SlugGo v.s. ManyFaces, newest data)
Previous by thread: Re: [computer-go] Statistical Significance (was: SlugGo v.s. ManyFaces, newest data)
Next by thread: RE: [computer-go] Statistical Significance (was: SlugGo v.s.ManyFaces, newest data)
Index(es):
- Date
- Thread