# Overfitting

# Errors

### Measuring Errors

• Goal of regression: Find $g(\theta, X) : X \rightarrow Y$ based on $\left\{(x^1,y^1), ..., (x^n,y^n)\right\}$ to predict $y$ for any example of $\mathcal{U}$
• Training Error: measured in the training set
• Not a good indicator of the error for new examples

### Measure error outside the training set

• Split the data: training set and test set to estimate true error

### Measurable Errors

• Training error: error measured in the training data and used to fit the parameters.
• Empirical Error or Sample Error
• Test error: error measured in the test data to estimate the true error.

### Unmeasurable Errors

• True error: expected error for all $\mathcal{U}$.
• Generalization error: difference between True Error and Training error

# Validation and Selection

### Choosing best test error makes estimate biased

• We cannot use the test error to choose the best model
• Solution: we need three sets
• Training set to fit parameters and choose the best hypothesis in each hypothesis class.
• Validation set to choose best hypothesis class (model).
• Test set to estimate true error of the final hypothesis.
• Note: we'll see better ways of doing this. For now it's the idea that matters.

# Regularization

### Two ways of dealing with overfitting:

• Model Selection: Pick model that best predicts outside training set.
• Regularization: Change learning algorithm to avoid overfitting.

### Regularization: modify training to reduce overfitting

• Example: add penalty as a function of the magnitude of the parameters.
• Ridge regression $$J(\theta) = \sum_{t=1}^{n} \left[ y^t - g(x^t|\theta) \right]^2 + \lambda \sum_{j=1}^{m} \theta_j^2$$

# Example

### Data on Life Expectancy vs GDP for 2003

• http://www.indexmundi.com

1200	31.3
9000	32.26
800	35.25
3000	36.94
1900	36.96
600	37.98
...


### Numerical problems

• Large differences in value magnitudes (or even large values) can cause problems.
• Numerical instability
• Finding parameters
• Reproductibility of approach

### One simple solution:

• Rescale values to 0..1
• (We'll see more on this later)

### Prepare data

• Split the data.

import numpy as np

def random_split(data,test_points):
"""return two matrices splitting the data at random
"""
ranks = np.arange(data.shape[0])
np.random.shuffle(ranks)
train = data[ranks>=test_points,:]
test = data[ranks<test_points,:]
return train,test

scale=np.max(data,axis=0)
data=data/scale
train, temp = random_split(data, 90)
valid, test = random_split(temp, 45)


### Find the best model (hypothesis class)


def mean_square_error(data,coefs):
"""Return mean squared error
X on first column, Y on second column
"""
pred = np.polyval(coefs,data[:,0])
error = np.mean((data[:,1]-pred)**2)
return error

best_err = 10000000 # very large number
for degree in range(1,9):
coefs = np.polyfit(train[:,0],train[:,1],degree)
valid_error = mean_square_error(valid,coefs)
if valid_error < best_err:
best_err = valid_error
best_coef = coefs
best_degree = degree

test_error = mean_square_error(test,best_coef)
print(best_degree,test_error)


### Selecting the best hypothesis:

• Degree 3, Validation error:0.0150
• Note: error estimates depend on the (random) split
• Maybe we should average them? Later...

### Regularization

• Different approach: use a high degree polynomial (degree 10)
• But regularize the fit (with ridge regression)
$$J(\theta) = \sum_{t=1}^{n} \left[ y^t - g(x^t|\theta) \right]^2 + \lambda \sum_{j=1}^{m} \theta_j^2$$

### In Practice

• The Scikit-learn library has a Ridge class to do ridge regression
• It's a linear model, but we can work with that by adding the necessary terms to our data

### Loading and expanding the data to $x^{10}$


def expand(data,degree):
"""expands the data to a polynomial of specified degree"""
expanded = np.zeros((data.shape[0],degree+1))
expanded[:,0]=data[:,0]
expanded[:,-1]=data[:,-1]
for power in range(2,degree+1):
expanded[:,power-1]=data[:,0]**power
return expanded

scale=np.max(orig_data, axis=0)
orig_data=orig_data/scale
data = expand(orig_data,10)
train, temp = random_split(data, 90)
valid, test = random_split(temp, 45)


### Finding the best $\lambda$


from sklearn.linear_model import Ridge

lambs = np.linspace(0.01,0.2)
best_err = 100000
for lamb in lambs:
solver = Ridge(alpha = lamb, solver='cholesky',tol=0.00001)
solver.fit(train[:,:-1],train[:,-1])
ys = solver.predict(valid[:,:-1])
valid_err = np.mean((ys-valid[:,-1])**2)
if valid_err<best_err:
# keep the best


# Summary

### Summary

• Error estimates (stochastic)
• Training error
• Test error, unbiased estimate of true error
• Validation error: monitor overfitting
• Model selection and Regularization