Aprendizagem Automática
Overfitting and Classification
Ludwig Krippahl
Overfitting and Classification
Summary
 Scoring classifiers
 Cross validation
 Logistic regression: model selection and regularization
Overfitting and Classification
Scoring classifiers
Scoring classifiers
Scoring binary classifiers
 For regression, sum of squared distance to the line was useful:
$$E(\theta\mathcal{X})=\sum_{t=1}^{n} [y^t  g(x^t\theta)]^2$$
 With Logistic Regression the function to minimize is:
$$E(\widetilde{w}) =  \sum_{n=1}^{N} \left[ t_n \ln g_n + (1t_n) \ln (1g_n) \right]$$
$$g_n = \frac{1}{1+e^{(\vec{w}^T\vec{x_n}+w_0)}}$$
 (The regression in logistic regression)
Scoring classifiers
Scoring binary classifiers
 How good is the hypothesis?
 The logistic loss or cross entropy function
$$L(\widetilde{w}) = \frac{1}{N} \sum_{n=1}^{N} H (p_n,q_n) = \frac{1}{N}\sum \left[t_n \ln g_n + (1t_n) \ln (1g_n) \right]$$

Brier score: average quadratic error between the predicted probability and the class.
$$E(\widetilde{w})=\frac{1}{N}\sum_{n=1}^{N} [t_n  g_n]^2$$
 Note: this is not a quadratic error to the discriminant but to the predicted probability.
Scoring classifiers
Scoring binary classifiers
 Brier score: error between $g_n$ and $t_n$
Scoring classifiers
Scoring binary classifiers
 Example Class 
Prediction  Class 1  Class 0 
Class 1  True Positive  False Positive 
Class 0  False Negative  True Negative 
Scoring classifiers
Scoring binary classifiers
$$ accuracy = \frac{true\ positives + true\ negatives}{N}$$
$$precision = \frac{true\ positives}{true\ positives + false\ positives}$$
$$recall = \frac{true\ positives}{true\ positives + false\ negatives}$$
Scoring classifiers
Scoring binary classifiers
 F1 Score: harmonic mean of precision and recall
$$F_1 = 2 \times \frac{precision \times recall}{precision + recall}$$
$$F_1 = \frac{2 \times true\ positives}{2 \times true\ pos. + false\ pos. + false\ neg.}$$
Scoring classifiers
Scoring binary classifiers
 Tipically, consider $g_n\geq 0.5$ for predicting Class 1
 More generally, consider $g_n\geq \alpha,\ \alpha \in [0,1]$
Scoring classifiers
Scoring binary classifiers
 Tipically, consider $g_n\geq 0.5$ for predicting Class 1
 More generally, consider $g_n\geq \alpha,\ \alpha \in [0,1]$
 As $\alpha$ changes, the number of false and true positives also changes
Scoring classifiers
Receiver Operating Characteristic
 Plot true pos. vs false pos. for different $\alpha$
 We can use area under ROC curve as a possible score for binary classifier.
Scoring classifiers
Scoring binary classifiers
 Logistic Regression: Loss function, Brier score
 Precision, Recall, Accuracy
 Confusion Matrix
 Receiver Operating Characteristic curve
Classifiers in Scikit Learn include a score
method
 Returns accuracy, the fraction of correctly classified examples
from sklearn.linear_model import LogisticRegression
reg = LogisticRegression()
reg.fit(X_r,Y_r)
test_error = 1reg.score(X_t,Y_t)
Overfitting and Classification
CrossValidation
CrossValidation
Train, Validate, Test
 Previously, we saw:
 Minimize Training Error to generate hypothesis
 Measure Validation Error on hypothesis to select
 Measure Test Error to estimate True Error
 Problem: select model based on one estimate (one hypothesis)
CrossValidation
 Each train and validation split is one sample
CrossValidation
Average hypotheses from the same class (model)
 Split data into Training Set and Test Set
 The Test Set is for an unbiased estimate of the true error of the final hypothesis
 Validate with average of N hypotheses
 Split Training Set into N folds
 Average N validations training on the N1 folds
CrossValidation
CrossValidation
CrossValidation
CrossValidation
CrossValidation
CrossValidation
Kfold CrossValidation
 Split points into K folds
 With N folds: leaveoneout crossvalidation.
CrossValidation and Testing
 Possible problem: class imbalance.
 Unless large data set, class proportions may change in folds or training and test.
 Stratified sampling: random sampling that preserves proportions.
Overfitting and Classification
Example
Example
Find the best model for classifying these data:
Example
Load, shuffle, standardize features:
import numpy as np
from sklearn.utils import shuffle
mat = np.loadtxt('dataset_90.txt',delimiter=',')
data = shuffle(mat)
Ys = data[:,0]
Xs = data[:,1:]
means = np.mean(Xs,axis=0)
stdevs = np.std(Xs,axis=0)
Xs = (Xsmeans)/stdevs
Example
We want to select the best model expanding the data:
 $\{x_1, x_2\}$ (original data)
 $\{x_1, x_2, x_1\times x_2\}$
 $\{x_1, x_2, x_1\times x_2, x_1^2\}$
 ...
def poly_16features(X):
"""Expand data polynomially
"""
X_exp = np.zeros((X.shape[0],X.shape[1]+14))
X_exp[:,:2] = X
X_exp[:,2] = X[:,0]*X[:,1]
X_exp[:,3] = X[:,0]**2
X_exp[:,4] = X[:,1]**2
#... rest of the expansion here
return X_exp
Example
Expand and split (stratified):
from sklearn.model_selection import train_test_split
Xs=poly_16features(Xs)
X_r,X_t,Y_r,Y_t = train_test_split(Xs, Ys, test_size=0.33, stratify = Ys)
Generate the fold indexes with KFold (example)
from sklearn.model_selection import KFold
x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
kf = KFold(n_splits=4)
for train, valid in kf.split(x):
print (train, valid)
[ 3 4 5 6 7 8 9 10 11] [0 1 2]
[ 0 1 2 6 7 8 9 10 11] [3 4 5]
[ 0 1 2 3 4 5 9 10 11] [6 7 8]
[ 0 1 2 3 4 5 6 7 8] [9 10 11]
Example
Stratified kfolds:
 Tries to keep the same distribution of classes in each fold
from sklearn.model_selection import StratifiedKFold
ys = [1,1,0,1,0,1,0,1,0,1,1,1]
kf = StratifiedKFold(n_splits=4)
for train,valid in kf.split(ys,ys):
print(train,valid)
[ 3 4 5 6 7 8 9 10 11] [0 1 2]
[ 0 1 2 6 7 8 9 10 11] [3 4 5]
[ 0 1 2 3 4 5 8 10 11] [6 7 9]
[ 0 1 2 3 4 5 6 7 9] [8 10 11]
Example
 Compute train and validation Brier error: (assumes classes 0,1)
from sklearn.linear_model import LogisticRegression
def calc_fold(feats, X,Y, train_ix,valid_ix,C=1e12):
"""return error for train and validation sets"""
reg = LogisticRegression(C=C, tol=1e10)
reg.fit(X[train_ix,:feats],Y[train_ix])
prob = reg.predict_proba(X[:,:feats])[:,1]
squares = (probY)**2
return np.mean(squares[train_ix]),np.mean(squares[valid_ix])
folds = 10
kf = StratifiedKFold(n_splits=folds)
for feats in range(2,16):
tr_err = va_err = 0
for tr_ix,va_ix in kf.split(Y_r,Y_r):
r,v = calc_fold(feats,X_r,Y_r,tr_ix,va_ix)
tr_err += r
va_err += v
print(feats,':', tr_err/folds,va_err/folds)
Example
Example
Example
Example
Example
Example
Example
 Crossvalidation error (red), best model: 9 features
Example
 Retrain on whole training set, measure test error.
Example
Model selection with crossvalidation
 Preprocess: shuffle; normalize or standardize
 Reserve a test set (stratified?)
 Partition training set into n folds (stratified?)
 Cross validation to select the best model
 Train best model with all training set
 Estimate true error of final hypothesis on test set
 Why?
 Crossvalidation error is unbiased and could be a good estimate of the true error
 But becomes biased by using it (and those examples) to select the best model
Overfitting and Classification
Regularization
Regularization
Logistic regression can be regularized
 For example, with L2 norm
$$\lambda \sum_{j=1}^{m} w_j^2$$
 LogisticRegression class offers a $C$ parameter:
$$\frac{1}{C} \sum_{j=1}^{m} w_j^2$$
Regularization
Ridge regression is intuitive:
$$y = \theta_1 x^2 + \theta_2 x + \theta_3$$
$$J(\theta) = \sum_{t=1}^{n} \left[ y^t  g(x^t\theta) \right]^2 + \lambda \sum_{j=1}^{m} \theta_j^2$$
In logistic regression this seems less intuitive:
$$g(\vec{x},\widetilde{w}) = \frac{1}{1+e^{\widetilde{w}^T\vec{x}}}$$
$$\frac{1}{C} \sum_{j=1}^{m} w_j^2$$
Regularization
$$\widetilde{w}^T \vec{x} = 0$$
Regularization
 Large $\widetilde{w}$, steep logistic function
$$g(\vec{x},\widetilde{w}) = \frac{1}{1+e^{\widetilde{w}^T\vec{x}}}$$
Regularization
 A small $\widetilde{w}$ lowers the slope
$$g(\vec{x},\widetilde{w}) = \frac{1}{1+e^{\widetilde{w}^T\vec{x}}}$$
Regularization
 No regularization, overfitting
 Regularization, large margins
Regularization
 We'll use the 15 features model (overfitting)
Regularization
 Crossval. error vs $log_{10}(C)$
Overfitting and Classification
Summary
Overfitting and Classification
Summary
 Scoring and evaluating classifiers
 Crossvalidation and model selection
 Crossvalidation and regularization
 Stratified sampling: train, folds, test
Further reading