Loading
There is still a lingering belief from classical machine learning that bigger models overfit and thus don’t generalize well. This is described by the bias-variance trade-off, but this no longer holds in the new age of machine learning. This is empirically shown by phenomena like double descent, where higher-complexity models perform better than lower-complexity ones. The reason why this happens remains counterintuitive for most people, so I aim to address it here:
In general, the more dimensions you add, the higher the likelihood that a local minimum is not a true local minimum. There will likely be some dimensions that slope downward, allowing gradient descent to escape to lower minima.
Now, points 1 and 2 are not disconnected—they are two sides of the same coin. While the model is trying out different structures that don’t affect its loss (point 1), gradient descent is roaming around the local minima without changing the loss (point 2). At some point, it may find a path out by discovering a dimension that slopes downward—a ‘dimensional alleyway’ out of the local minimum, so to speak. This traversal out of the local minimum to a lower point corresponds to the model finding a simpler solution, i.e., the generalized structure.
(Even though the generalized structure might not reduce the loss directly, the regularization penalty on top of the loss surface ensures that the generalized structure will have a lower total loss than memorization.)
My apologies if the text is a bit hard to read. Let me know if there is a demand for a video that more clearly explains this topic. I will upload this on https://www.youtube.com/@paperstoAGI
submitted by /u/PianistWinter8293
[link] [comments]