Yes, go programs are randomized, but they do tend to repeat the same
patterns from game to game. This means you need to play a lot more
games than the independence-based statistical tests would suggest to
be really confident about which is better.
Which of course is a problem. My program is extremely unlikely to
repeat a position even after the first 4 moves (two for each side.)
But like you imply, that doesn't mean the same idea's and themes are
not being repeated. To attempt to give the program new things to chew
on, I play the first few moves randomly. I avoid moves to the edges
when doing this in an attempt to avoid some of the weakest moves.
The problem with this is that it tends to equalize opponents. If
version B is stronger than version A, giving them random starting
positions will tend to give a little advantage to the version A
because now the version A will tend to get some starting positions
that are heavily in his favor. In practice I'm not sure this hurts
the testing very much but is an issue. I think this becomes a bigger
concern as my program gets stronger. I don't use this technique if
I'm testing against a foreign program in order to measure progress.
I considered making a database of a few thousand random starting
positions and use only those, keeping statistics to determine which of
those positions are grossly unfair and culling them out. I may do
this eventually.
There would seem to be to be an even better way of solving this problem: