# Multiclass, Bias and Variance

## Multiclass, Bias and Variance

### Summary

• Multiclass classification
• Bias: deviation from expected value
• Variance: scattering of predictions
• Computing Bias and Variance with bootstrapping
• Relation to underfitting and overfitting

# Multiclass

## Multiclass Classification

### Some clasifiers naturally handle multiple classes

• k-NN: output majority class from k neighbours.

• Naïve Bayes: output class with greatest joint probability
• $$C^{Naïve Bayes} = \underset{k \in \{0,1,...,K\}} {\mathrm{argmax}} \ln p(C_k)+\sum \limits_{j=1}^{N}\ln p(x_j|C_k)$$

## Multiclass Classification

CC BY-SA: Gordon, Robertson
• 3 classes:
• Setosa
• Versicolor
• Virginica
• Attributes:
• Sepal length
• Sepal width
• Petal length
• Petal width

## Multiclass Classification

• Example: Iris dataset

CC BY-SA Setosa: Szczecinkowaty; Versicolor: Gordon, Robertson; Virginica: Mayfield

## Multiclass Classification

• Example: Iris dataset

## Multiclass Classification

• k-NN classification (k = 15)

## Multiclass Classification

### Linear disciminant classifiers need some adaptation

$$\vec{w}^T \vec{x} + w_0 = 0$$
• (Logistic Regression, SVM, MLP...)

### Generic solutions for binary classifiers:

• One versus the rest: K-1
• One versus one: K(K-1)/2
• One versus the rest: K, max

## Multiclass Classification

### Multiclass: one vs the rest (K-1)

• Create K-1 classifiers
• For each k in {1 ... K-1}, set class k as 1 and others as 0
• Assign class {1 ... K-1} according to which classifier returns 1, or class K if none.

## Multiclass Classification

• One vs the rest (K-1), Classifier for Setosa

## Multiclass Classification

• One vs the rest (K-1), Classifier for Versicolor

## Multiclass Classification

• One vs the rest (K-1), Final classifier (ambiguous areas)

## Multiclass Classification

### Multiclass: one vs one K(K-1)/2

• Build classifiers for all pairs.

## Multiclass Classification

• One vs one K(K-1)/2, Setosa vs Versicolor

## Multiclass Classification

• One vs one K(K-1)/2, Setosa vs Virginica

## Multiclass Classification

• One vs one K(K-1)/2, Versicolor vs Virginica

## Multiclass Classification

• One vs one K(K-1)/2, Final: also ambiguous areas

## Multiclass Classification

### Multiclass: one vs rest K (or one vs all)

• Better: K OvR classifiers

## Multiclass Classification

• One vs rest: Classify using max of decision function

## Multiclass Classification

### Pros of multiclass classification with one vs rest:

• Classify using max of decision function
• Helps avoid ambiguous classifications (depending on decision function)

### Cons of multiclass classification with one vs rest:

• Classifiers may be unbalanced (more negative than positive)
• The confidence values of the decision function may not be directly comparable

## Multiclass Classification

• Specific solutions for some classifiers
• Logistic Regression, extend cross-entropy error function:
• $$p(T|w_1,...,w_K) = \prod\limits_{n=1}^{N} \prod \limits_{k=1}^{K}p(C_k|\phi_n)^{t_{nk}} = \prod\limits_{n=1}^{N} \prod \limits_{k=1}^{K} y_{nk}^{t_{nk}}$$ $$E(w_1,...,w_K) = - \sum\limits_{n=1}^{N} \sum \limits_{k=1}^{K} {t_{nk}} \ln y_{nk}$$
• Where $t_nk$ is 1 if the point is in class $k$, 0 otherwise

## Multiclass Classification

• Logistic regression, Scikit-Learn
• 
from sklearn.linear_model import LogisticRegression

#One versus rest, max
logreg = LogisticRegression(C=1e5,multi_class='ovr')
logreg.fit(X, Y)

#Cross entropy
logreg = LogisticRegression(C=1e5,multi_class='multinomial')
logreg.fit(X, Y)


## Multiclass Classification

• Multilayer Perceptron
• For 2 classes, need only one output neuron
• For K classes, use K output neurons
• Train for 1 in one class and zeros for all the rest.
• Classify in class corresponding to highest output neuron.

## Multiclass Classification

• OneVsRestClassifier
• Fits one classifier per class, calling fit()
• Predicts from maximum of
decision_function()
• 
ovr = OneVsRestClassifier(
SVC(kernel='rbf', gamma=0.7, C=10))
ovr.fit(X, Y)
ovr.predict(test_set)


## Multiclass Classification

• OneVsRestClassifier: automate OvR, max. decision

# Bias

## Bias

• Suppose the model cannot fit the data

## Bias

• Even averaging over samples...

## Bias

• ...there is a large bias in the prediction

## Bias

• Bias: deviation of the average estimate from the target value

## Bias

• Bias: deviation of the average estimate from the target value
• The bias for example $n$ is the squared error between the true value for $n$ and the average of the predictions for $n$, over all hypotheses trained on different samples:
• $$bias_n = \left( \bar{y}(x_n) - t_n \right)^2$$
• The bias for the model is the average bias for all examples:
• $$bias = \frac{1}{N} \sum \limits_{n=1}^{N} \left( \bar{y}(x_n) - t_n \right)^2$$
• Note: the bias is often written as $bias^2$, but this is just to denote the squared error.

# Variance

## Variance

• If the model is overfitting, it adjusts the training data too much

## Variance

• Varies over different training sets

## Variance

• Variance: dispersion of predictions around their average

## Variance

• Variance: dispersion of predictions around their average
• Variance of predictions for point $n$, over all hypotheses:
• $$var_n = \frac{1}{M} \sum \limits_{m=1}^{M} \left( \bar{y}(x_n) - y_m(x_n) \right)^2$$
• (Average square dist. to average)
• Variance for the model is the average over all points
• $$var = \frac{1}{NM} \sum \limits_{n=1}^{N} \sum \limits_{m=1}^{M} \left( \bar{y}(x_n) - y_m(x_n) \right)^2$$

## Bias and Variance

### Bias: squared deviation from true value

$$bias = \frac{1}{N} \sum \limits_{n=1}^{N}\left( \bar{y}(x_n) - t_n \right)^2$$

### Variance: squared deviation from mean prediction

$$var = \frac{1}{NM} \sum \limits_{n=1}^{N} \sum \limits_{m=1}^{M} \left( \bar{y}(x_n) - y_m(x_n) \right)^2$$
• Note: Bias and Variance depend on what the model does on average and not on any single hypothesis

# Bias-variance decomposition

## Bias-variance decomposition

• If we have a squared loss function, then the expected error is:
• $$E\left( (y-t)^2 \right) = \left(E(y)-E(t)\right)^2 + E\left( \left(y-E(y)\right)^2 \right) + E\left( \left(t-E(t)\right)^2 \right)$$

$$bias + var + noise$$

### Bias-variance decomposition

• How to compute: average over different training sets
• But we have only one training set
• We need a resampling method

## Bias-variance decomposition

### Bootstrapping

• Method of generating random samples from one data set
• From the training set, sample at random with replacement
• Generate M replicas of N points each
• Measure over the replicas

### We also want an error estimate

• Bias and variance inside the training set not the same as outside
• To get unbiased estimates of the errors, we need to measure it outside the training set.
• So, for each replica, randomly choose the points used for training (~2/3) and use those left out for testing (~1/3)

## Bias-variance decomposition

• Bootstrapping

def bootstrap(replicas,data):
train_sets = np.zeros((replicas,data.shape[0],data.shape[1]))
test_sets = []
for replica in range(replicas):
ix = np.random.randint(data.shape[0],size=data.shape[0])
train_sets[replica,:] = data[ix,:]
in_train = np.unique(ix)
return train_sets,test_sets

• Training sets all have the same size, N, sampled with reposition
• The test sets are just for estimating the error and not for Bias-Variance decomposition

## Bias-variance decomposition

• Decomposing, polynomial

def bv_poly(degree, train_sets,test_sets, data):
replicas = train_sets.shape[0]
predicts = np.zeros((replicas,data.shape[0]))
for ix in range(replicas):
coefs = np.polyfit(train_sets[ix,:,0],
train_sets[ix,:,1],degree)
predicts[ix,:] = np.polyval(coefs,data[:,0])
counts = np.zeros(data.shape[0])
sum_preds = np.zeros(data.shape[0])
sum_squares = np.zeros(data.shape[0])
for ix,test in enumerate(test_sets):
counts[test] += 1
preds = predicts[ix,test]
sum_preds[test] = sum_preds[test]+preds
sum_squares[test] = sum_squares[test]+preds**2
var_point = sum_squares/counts - (sum_preds/counts)**2
mean_point = sum_preds/counts
bias = np.mean((data[:,-1]-mean_point)**2,axis=None)
var = np.mean(var_point, axis = None)
return bias,var

$$E\left( (x-\bar{x})^2 \right) = \bar{x^2}-\bar{x}^2$$

## Bias-variance decomposition

• Bias-variance decomposition, polynomial regression

## Bias-variance decomposition

• Lowest total error

## Bias-variance decomposition

### B-V with classifiers

• With a 0/1 loss function, the main prediction for point $i$ is the mode
• Assuming there is no noise, the Bias for point $i$ is:
• $$bias_i = L (Mo(y_{i,m}), t_i) \qquad bias_i \in \{0,1\}$$
• And the Variance is:
• $$var_i = E \left(L(Mo(y_{i,m}),y_{i,m})\right)$$
• To compute total error, we must consider that:
• If $bias_i = 0$, $var_i$ increases error.
• If $bias_i = 1$, $var_i$ decreases error.
• So for the error we must add or subtract the variances accordingly
• $$E \left(L(t,y)\right)= E \left(B(i)\right) + E \left(V_{unb.}(i)\right) – E \left(V_{biased}(i)\right)$$

## Bias-variance decomposition

• SVM, with RBF kernel. Optimize $\gamma$

## Bias-variance decomposition

• Bias-variance decomposition with SVM

def bv_svm(gamma, C, train_sets, test_sets, data):
replicas = train_sets.shape[0]
predicts = np.zeros((replicas,data.shape[0]))
for ix in range(replicas):
sv = svm.SVC(kernel='rbf', gamma=gamma,C=C)
sv.fit(train_sets[ix,:,:-1],train_sets[ix,:,-1])
predicts[ix,:] = sv.predict(data[:,:-1])
counts = np.zeros(data.shape[0])
sum_preds = np.zeros(data.shape[0])
for ix,test in enumerate(test_sets):
counts[test] += 1
sum_preds[test] = sum_preds[test]+predicts[ix,test]
main_preds = np.round(sum_preds/counts)
bias_point = np.abs(data[:,-1]-main_preds)
var_point = np.abs(sum_preds-counts*main_preds)/counts
u_var = np.sum(var_point[bias_point == 0])/data.shape[0]
b_var = np.sum(var_point[bias_point == 1])/data.shape[0]
bias = np.mean(bias_point)
return bias,u_var-b_var


## Bias-variance decomposition

• Bias-variance decomposition with SVM

## Bias-variance decomposition

• Bias-variance SVM, lowest error (apart from noise)

## Bias-variance decomposition

• Reducing $bias$ increases $variance$

• Note: Bias-Variance decomposition is useful for understanding the components of the error but, in practice, it is easier to use cross-validation and just consider the total error.

# Summary

## Bias-variance decomposition

### Summary

• Bias: average deviation from true value
• Variance: dispersion around the average prediction
• Classification: variance increases or decreses error depending on bias
• Bias and variance related to underfitting and overfitting