Another idea is to symmetrize the training data by presenting each
pattern under all reflections and rotations. This might lead to a better
evaluation function eventually, but primarily it slows down learning
quite a bit. We didn't look into it very much, but our hypothesis was
that the symmetric data makes it much harder for the hidden units to
break symmetry, something which they have to do in the learning process.
Backprop leads to local optimum, and which one may depend on the order
in which the samples are presented. Even in the methodology proposed
above it may happen that different corners of the table reach different
weights values, because of a different convergence points, or simply
because they reach the same peak with different speeds or from different
directions.