[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[computer-go] A New Razor
>
>>> There are experiments which consistently show that given a simple
>>> algorithm such as neural nets or decision trees can perfectly learn a
>>> concept on the training data, and at the same time aggregated
>>> versions of the same algorithms will always do better on the test
>>> set.
>
>>> There are proofs that "explain" these results by claiming that a
>>> representation exist for the neural network and or decision tree
>>> which represents the concept learned by the aggregated version, but
>>> the result is computationally infeasible to find.
>
>>> In any case the actuall more complicated explanation fits the data as
>>> well as the actuall simple version, but the more complicated one
>>> generalizes better and therefore is preferred. If Occam's Razor
>>> cannot explain this result without much explanation, then we need to
>>> get a new theory.
>
> Without citations I'm not 100% sure what you are referring to, but here
> is a suggestion. How complex an explanation is is not obvious-- which is
> one of the reasons it is useful to study the problem
> formally. You may say Newton's laws are simple and explain a vast
> array of phenomena, but how do you quantify that? They appeal
> to other concepts in the mind.
>
> Likewise, if you set out to train a million weight neural net,
> but use a training algorithm that will basically only explore a
> tiny fraction of the space, you may effectively have little more
> flexibility in the hypotheses you will produce than a linear fit,
> so your explanation, while at first appearing complex, will
> be very simple and generalize well. Chapter 6 of What is Thought?
> discusses this phenomenon, referring to research that indicates
> this is why large dumb neural nets often generalize remarkably
> well. A 1000 weight neural net trained by backprop
> may have vastly less effective expressive power
> (and thus more generalization ability on small data sets)
> than a 10 weight polynomial trained exactly using linear
> programming on the coefficients.
>
> Likewise, the vote of many slightly randomized predictors may have much
> less complexity than even a single such predictor, by washing over much
> of the expressive power of the predictors.
I hestitate to give the citation, because I read the article and looked at
their data, and came to alternate conclusions, however I have seen
presentations, on boosting and aggregation(bagging), which showed the same
improvements.
"Boosting the Margin A New Explanation for the Effectiveness of Voting
Methods", Robert E Schapire, Yoav Freund, Peter Bartlett, Wee Sun Lee. If
you would like more references, I will find some more.
What you are referring to below is the ability of a machine to "Shatter a
hypothesis space". Given an arbitrary data set with with their
classifications, can a machine represent every data item in all possible
combinations of the data, if so then the machine can "Shatter the
Hypothesis space". The size of the hypothesis space which is shatterable
by a particular machine is called the Vapnick-Chervonenkis Dimensions(VC
Dimension). One cannot use neural networks, or bayesian analysis, because
these algorithms have an infinite VC-dimension. These algorithms, can
potentially represent an inifite number of hypothesis, but might not
represent the hypothesis that is in the data.
Furthermore, the Margin is a term used in the support vector machine
community, and has been generalized by the Machine learning community to
represent the space between data points which have been classified into
opposing categories. The idea is that one wants to draw the seperating
hyperplane as close to the half way point in euclidean terms as possible.
However inductive bias being what it is there is no gauruntee that this is
the best approximation of the target hypothesis. Neural Networks are
usually satisfied with a hyperplane which is any where in the margin, by
bagging these algorithms one can move the margin closer to halfway between
the opposing classifications. SVM's fit gaussians to the the seperating
hyperplane as close as is possible.
Sincerely,
Robin Kramer
_______________________________________________
computer-go mailing list
computer-go@xxxxxxxxxxxxxxxxx
http://computer-go.org/mailman/listinfo/computer-go