Aprendizagem Automática

Multiclass, Bias and Variance

Ludwig Krippahl

Multiclass, Bias and Variance

Summary

  • Multiclass classification
  • Bias: deviation from expected value
  • Variance: scattering of predictions
  • Computing Bias and Variance with bootstrapping
  • Relation to underfitting and overfitting

Multiclass, Bias and Variance

Multiclass

Multiclass Classification

Some clasifiers naturally handle multiple classes

  • k-NN: output majority class from k neighbours.

  • Naïve Bayes: output class with greatest joint probability
  • $$C^{Naïve Bayes} = \underset{k \in \{0,1,...,K\}} {\mathrm{argmax}} \ln p(C_k)+\sum \limits_{j=1}^{N}\ln p(x_j|C_k)$$

Multiclass Classification

CC BY-SA: Gordon, Robertson
  • 3 classes:
    • Setosa
    • Versicolor
    • Virginica
  • Attributes:
    • Sepal length
    • Sepal width
    • Petal length
    • Petal width

Multiclass Classification

  • Example: Iris dataset

CC BY-SA Setosa: Szczecinkowaty; Versicolor: Gordon, Robertson; Virginica: Mayfield

Multiclass Classification

  • Example: Iris dataset

Multiclass Classification

  • k-NN classification (k = 15)

Multiclass Classification

Linear disciminant classifiers need some adaptation

    $$\vec{w}^T \vec{x} + w_0 = 0$$
  • (Logistic Regression, SVM, MLP...)

Generic solutions for binary classifiers:

  • One versus the rest: K-1
  • One versus one: K(K-1)/2
  • One versus the rest: K, max

Multiclass Classification

Multiclass: one vs the rest (K-1)

  • Create K-1 classifiers
  • For each k in {1 ... K-1}, set class k as 1 and others as 0
  • Assign class {1 ... K-1} according to which classifier returns 1, or class K if none.

Multiclass Classification

  • One vs the rest (K-1), Classifier for Setosa

Multiclass Classification

  • One vs the rest (K-1), Classifier for Versicolor

Multiclass Classification

  • One vs the rest (K-1), Final classifier (ambiguous areas)

Multiclass Classification

Multiclass: one vs one K(K-1)/2

  • Build classifiers for all pairs.

Multiclass Classification

  • One vs one K(K-1)/2, Setosa vs Versicolor

Multiclass Classification

  • One vs one K(K-1)/2, Setosa vs Virginica

Multiclass Classification

  • One vs one K(K-1)/2, Versicolor vs Virginica

Multiclass Classification

  • One vs one K(K-1)/2, Final: also ambiguous areas

Multiclass Classification

Multiclass: one vs rest K (or one vs all)

  • Better: K OvR classifiers

Multiclass Classification

  • One vs rest: Classify using max of decision function

Multiclass Classification

    Pros of multiclass classification with one vs rest:

    • Classify using max of decision function
    • Helps avoid ambiguous classifications (depending on decision function)

    Cons of multiclass classification with one vs rest:

    • Classifiers may be unbalanced (more negative than positive)
    • The confidence values of the decision function may not be directly comparable

Multiclass Classification

  • Specific solutions for some classifiers
  • Logistic Regression, extend cross-entropy error function:
  • $$p(T|w_1,...,w_K) = \prod\limits_{n=1}^{N} \prod \limits_{k=1}^{K}p(C_k|\phi_n)^{t_{nk}} = \prod\limits_{n=1}^{N} \prod \limits_{k=1}^{K} y_{nk}^{t_{nk}} $$ $$E(w_1,...,w_K) = - \sum\limits_{n=1}^{N} \sum \limits_{k=1}^{K} {t_{nk}} \ln y_{nk} $$
  • Where $t_nk$ is 1 if the point is in class $k$, 0 otherwise

Multiclass Classification

  • Logistic regression, Scikit-Learn
  • 
    from sklearn.linear_model import LogisticRegression
    					
    #One versus rest, max
    logreg = LogisticRegression(C=1e5,multi_class='ovr')
    logreg.fit(X, Y)
    
    #Cross entropy
    logreg = LogisticRegression(C=1e5,multi_class='multinomial')
    logreg.fit(X, Y)
    		

Multiclass Classification

  • Multilayer Perceptron
  • For 2 classes, need only one output neuron
  • For K classes, use K output neurons
  • Train for 1 in one class and zeros for all the rest.
  • Classify in class corresponding to highest output neuron.

Multiclass Classification

  • OneVsRestClassifier
  • Fits one classifier per class, calling fit()
  • Predicts from maximum of
    decision_function()
  • 
    ovr = OneVsRestClassifier(
            SVC(kernel='rbf', gamma=0.7, C=10))
    ovr.fit(X, Y)
    ovr.predict(test_set)
    		

Multiclass Classification

  • OneVsRestClassifier: automate OvR, max. decision

Bias and Variance

Bias

Bias

  • Suppose the model cannot fit the data

Bias

  • Even averaging over samples...

Bias

  • ...there is a large bias in the prediction

Bias

  • Bias: deviation of the average estimate from the target value

Bias

  • Bias: deviation of the average estimate from the target value
  • The bias for example $n$ is the squared error between the true value for $n$ and the average of the predictions for $n$, over all hypotheses trained on different samples:
  • $$bias_n = \left( \bar{y}(x_n) - t_n \right)^2$$
  • The bias for the model is the average bias for all examples:
  • $$bias = \frac{1}{N} \sum \limits_{n=1}^{N} \left( \bar{y}(x_n) - t_n \right)^2$$
  • Note: the bias is often written as $bias^2$, but this is just to denote the squared error.

Bias and Variance

Variance

Variance

  • If the model is overfitting, it adjusts the training data too much

Variance

  • Varies over different training sets

Variance

  • Variance: dispersion of predictions around their average

Variance

  • Variance: dispersion of predictions around their average
  • Variance of predictions for point $n$, over all hypotheses:
  • $$var_n = \frac{1}{M} \sum \limits_{m=1}^{M} \left( \bar{y}(x_n) - y_m(x_n) \right)^2$$
    • (Average square dist. to average)
  • Variance for the model is the average over all points
  • $$var = \frac{1}{NM} \sum \limits_{n=1}^{N} \sum \limits_{m=1}^{M} \left( \bar{y}(x_n) - y_m(x_n) \right)^2$$

Bias and Variance

Bias: squared deviation from true value

$$bias = \frac{1}{N} \sum \limits_{n=1}^{N}\left( \bar{y}(x_n) - t_n \right)^2$$

Variance: squared deviation from mean prediction

$$var = \frac{1}{NM} \sum \limits_{n=1}^{N} \sum \limits_{m=1}^{M} \left( \bar{y}(x_n) - y_m(x_n) \right)^2$$
  • Note: Bias and Variance depend on what the model does on average and not on any single hypothesis

Bias and Variance

Bias-variance decomposition

Bias-variance decomposition

  • If we have a squared loss function, then the expected error is:
  • $$E\left( (y-t)^2 \right) = \left(E(y)-E(t)\right)^2 + E\left( \left(y-E(y)\right)^2 \right) + E\left( \left(t-E(t)\right)^2 \right)$$

    $$bias + var + noise$$

Bias-variance decomposition

  • How to compute: average over different training sets
  • But we have only one training set
  • We need a resampling method

Bias-variance decomposition

Bootstrapping

  • Method of generating random samples from one data set
  • From the training set, sample at random with replacement
  • Generate M replicas of N points each
  • Measure over the replicas

We also want an error estimate

  • Bias and variance inside the training set not the same as outside
  • To get unbiased estimates of the errors, we need to measure it outside the training set.
  • So, for each replica, randomly choose the points used for training (~2/3) and use those left out for testing (~1/3)

Bias-variance decomposition

  • Bootstrapping

def bootstrap(replicas,data):
    train_sets = np.zeros((replicas,data.shape[0],data.shape[1]))
    test_sets = []
    for replica in range(replicas):
        ix = np.random.randint(data.shape[0],size=data.shape[0])
        train_sets[replica,:] = data[ix,:]
        in_train = np.unique(ix)
        mask = np.ones(data.shape[0],dtype = bool)
        mask[in_train] = False
        test_sets.append(np.arange(data.shape[0])[mask])
    return train_sets,test_sets
		
  • Training sets all have the same size, N, sampled with reposition
  • The test sets are just for estimating the error and not for Bias-Variance decomposition

Bias-variance decomposition

  • Decomposing, polynomial
				
def bv_poly(degree, train_sets,test_sets, data):
    replicas = train_sets.shape[0]
    predicts = np.zeros((replicas,data.shape[0]))        
    for ix in range(replicas):        
        coefs = np.polyfit(train_sets[ix,:,0],
                     train_sets[ix,:,1],degree)
        predicts[ix,:] = np.polyval(coefs,data[:,0])
    counts = np.zeros(data.shape[0])
    sum_preds = np.zeros(data.shape[0])
    sum_squares = np.zeros(data.shape[0])
    for ix,test in enumerate(test_sets):
        counts[test] += 1
        preds = predicts[ix,test]
        sum_preds[test] = sum_preds[test]+preds
        sum_squares[test] = sum_squares[test]+preds**2    
    var_point = sum_squares/counts - (sum_preds/counts)**2
    mean_point = sum_preds/counts
    bias = np.mean((data[:,-1]-mean_point)**2,axis=None)
    var = np.mean(var_point, axis = None)
    return bias,var
    	
$$E\left( (x-\bar{x})^2 \right) = \bar{x^2}-\bar{x}^2$$

Bias-variance decomposition

  • Bias-variance decomposition, polynomial regression

Bias-variance decomposition

  • Lowest total error

Bias-variance decomposition

B-V with classifiers

  • With a 0/1 loss function, the main prediction for point $i$ is the mode
  • Assuming there is no noise, the Bias for point $i$ is:
  • $$bias_i = L (Mo(y_{i,m}), t_i) \qquad bias_i \in \{0,1\}$$
  • And the Variance is:
  • $$var_i = E \left(L(Mo(y_{i,m}),y_{i,m})\right)$$
  • To compute total error, we must consider that:
    • If $bias_i = 0$, $var_i$ increases error.
    • If $bias_i = 1$, $var_i$ decreases error.
  • So for the error we must add or subtract the variances accordingly
  • $$E \left(L(t,y)\right)= E \left(B(i)\right) + E \left(V_{unb.}(i)\right) – E \left(V_{biased}(i)\right)$$

Bias-variance decomposition

  • SVM, with RBF kernel. Optimize $\gamma$

Bias-variance decomposition

  • Bias-variance decomposition with SVM
				
def bv_svm(gamma, C, train_sets, test_sets, data):
    replicas = train_sets.shape[0]
    predicts = np.zeros((replicas,data.shape[0]))        
    for ix in range(replicas):
        sv = svm.SVC(kernel='rbf', gamma=gamma,C=C)
        sv.fit(train_sets[ix,:,:-1],train_sets[ix,:,-1])
        predicts[ix,:] = sv.predict(data[:,:-1])
    counts = np.zeros(data.shape[0])
    sum_preds = np.zeros(data.shape[0])
    for ix,test in enumerate(test_sets):
        counts[test] += 1
        sum_preds[test] = sum_preds[test]+predicts[ix,test]
    main_preds = np.round(sum_preds/counts)
    bias_point = np.abs(data[:,-1]-main_preds)    
    var_point = np.abs(sum_preds-counts*main_preds)/counts    
    u_var = np.sum(var_point[bias_point == 0])/data.shape[0]
    b_var = np.sum(var_point[bias_point == 1])/data.shape[0]
    bias = np.mean(bias_point)
    return bias,u_var-b_var
		

Bias-variance decomposition

  • Bias-variance decomposition with SVM

Bias-variance decomposition

  • Bias-variance SVM, lowest error (apart from noise)

Bias-variance decomposition

Bias-variance tradeoff

  • Reducing $bias$ increases $variance$

  • Note: Bias-Variance decomposition is useful for understanding the components of the error but, in practice, it is easier to use cross-validation and just consider the total error.

Bias-variance decomposition

Summary

Bias-variance decomposition

Summary

  • Bias: average deviation from true value
  • Variance: dispersion around the average prediction
    • Classification: variance increases or decreses error depending on bias
  • Bias and variance related to underfitting and overfitting
  • Bias-variance tradeoff

Further reading

  • Alpaydin, Section 4.3
  • Bishop, Sections 4.1.2, 4.3.4, 7.1.3
  • Optional:

Valentini, Dietterich. "Bias-variance analysis of support vector machines for the development of SVM-based ensemble methods." (2004): 725-775.
Domingos, Pedro. "A unified bias-variance decomposition." Proceedings of 17th International Conference on Machine Learning. Stanford CA Morgan Kaufmann. 2000.