CSE 5334 An Overfitting Summary

3 minute read

This is a blog post for Assignment 2 from my Data Mining class.

Jupyter Notebook can be found here.

What is Overfitting?

While the title says overfitting, there’s actually more into this topic. We will be covering what is overfit, what is underfit, what are some ways to detect them, and how to avoid them preemptively.

Definition of Overfit

Overfit, by practical definition, is when your model performs well in training, but unable to predict unseen data while testing.

Definition of Underfit

Underfit is not quite the opposite of overfit, as underfit refers to both training and testing performance are consistent with each other (that is, the model did not overfit) but unable to make accurate prediction past a certain threshold.

What Cause Over/Underfitting?

Overfit often happen when the model is too complex for the data. This could either mean the model itself overestimates the data, or the data is simply too lacking to sufficiently train the model to predict unseen data.

Underfit often happen when the model isn’t sophisticated enough for the data that it is training on.

While those are the few main causes, other issue might be too much regularization, not enough feature in data, etc.

Graphical Examples

To demonstrate the concepts, we will be using a simple mathematical expression, and do polynomial regression on it. All data points (\(X\)) are uniformly generated, and noise (\(\mathcal{N}\)) was introduced via normal distribution. The function is

\[Y = f(X) = \sin(2\pi X) + 0.1\mathcal{N}\]

We then generate a few data points (20), and did polynomial regression of different degrees (0 to 9) on this function. The weights are calculated as follows

Weights of different polynomial regression order
Weights of different polynomial regression order

We can not plot these regression lines, and see what they look like along with the data.

Plot data (blue), actual function(orange), and regression(green)
Plot data (blue), actual function(orange), and regression(green)

Now, from the look of it, the top two (degree = 0 and 1) is underfitting, the bottom left (degree = 3) fits quite nicely, and the bottom right (degree = 9) is overfitting.

Let’s have a look at errors in training and testing.

Errors diverges after 5th degree polynomial regression
Errors diverges after 5th degree polynomial regression

Now we know what it looks like, let’s try and prevent it.

Handling Over/Underfitted Situations

There are quite a few ways to correct for overfitting. In this article we will just use some naive approaches.

Tune the Data

And by tuning I mean adding more points. Instead of the original 10 (half was used for testing of the said 20), let’s do 15, and then 100. We will retrain the model exactly how we trained it before.

9th degree regression on 15 and 100 points
9th degree regression on 15 and 100 points

As you can see, adding a few points doesn’t do much, but adding “enough” points, the model actually fits really well. In this example, our data complexity has been corrected to match the number of features necessary to train the high degree regression, so it no longer overfits.

Tune How the Model Trains

Instead of increasing the complexity of data, we can regularize how the model train, by introducing what is effectively a weight penalty (or decay). We introduce a regularization method known as L2 regularization, which is implemented as follows

\[\tilde{E}(W) = \frac{1}{2}\sum^{N}_{n=1}\left[y(x_n,w) - t_n\right]^2 + \frac{\lambda}{2}\vert\vert w \vert\vert^2\]

In my application, I just used Ridge Regression and pass lambda in. Now, the 9th degree polynomial behaves like this with just the original 10 points.

Ridge Regression with various lambda for 9th degree with 10 data points
Ridge Regression with various lambda for 9th degree with 10 data points

Difficult Decisions

it all boils down to, again, tuning. There are so many ways to do so many of these steps that it comes to experimentation and educated deduction of what can be good (or bad) for the performance of your model.

One way to overcome these challenges, as I have learned through this exercise, is that when the model is not too complicated and doesn’t take long to train, try all combinations. Then pick out the best one.

Contribution

Obligatory contribution section? Not sure what they quite mean in the assignment description. If anything, I have done enough iteration to tell which is a good degree and L2 regularization lambda value so that my model does not overfit.

Best Model Selection

For such an exercise, it is easy to see that without regularization, any polynomial regression with degree \(3 \leq M \leq 6\) can be reasonably accurate. If we count the L2 Ridge Regression variant, then 9th order with lambda = 0.0001 do quite well, although error rate does not really compares with non-regularize version.

Non Regularize Polynomial Regression Error
Non Regularize Polynomial Regression Error
L2 Regularize with Lambda Ridge Regression
L2 Regularize with Lambda Ridge Regression

References

[1] scikit-learn Underfitting and Overfitting
[2] scikit-learn Polynomial Regression
[3] scikit-learn Ridge Regression
[4] Pandas DataFrame