Aprendizagem Automática

Ensemble Methods

Ludwig Krippahl

Ensemble methods

Summary

  • Ensemble methods
    • Bagging and bragging
    • Boosting and stumping

Ensemble methods

Ensemble methods

Ensemble methods

Ensemble methods

  • Combining groups of classifiers to improve classification

  • We'll focus on two different aproaches:

  • Bootstrap aggregating: bootstrapping to train, combine predictions to reduce variance
  • Boosting: training a linear combination of weak classifiers (mainly) to reduce bias

Ensemble methods

Bagging

  • Bootstrap aggregating
  • Use bootstrapping to generate replicas of training set
  • Train one model per replica
  • Aggregate the output of the models. Example: for regression, average the predictions

Ensemble methods: Bagging

  • Example: regression

Ensemble methods: Bagging

  • Example: regression, mean

def bootstrap(replicas,data):
    train_sets = np.zeros((replicas,data.shape[0],data.shape[1]))
    test_sets = []
    for replica in range(replicas):
        ix = np.random.randint(data.shape[0],size=data.shape[0])
        train_sets[replica,:] = data[ix,:]
        in_train = np.unique(ix)
        mask = np.ones(data.shape[0],dtype = bool)
        mask[in_train] = False
        test_sets.append(np.arange(data.shape[0])[mask])
    return train_sets,test_sets  

train_sets, _ = bootstrap(replicas,data)
px = np.linspace(ax_lims[0],ax_lims[1],points)
preds = np.zeros((replicas,points))
for ix in range(replicas):        
    coefs = np.polyfit(train_sets[ix,:,0],
                 train_sets[ix,:,1],degree)        
    preds[ix,:] = np.polyval(coefs,px)
mean = np.mean(preds,axis=0)   
					

Ensemble methods: Bagging

  • Example: regression, mean

Ensemble methods: Bragging

  • Variation: median instead of mean

Ensemble methods: Bagging

  • Classification example: SVM

Ensemble methods: Bagging

  • Classification example: SVM (majority vote)
  • 
    train_sets,_ = bootstrap(replicas,data)
    gamma = 2
    C=10000
    svs = []
    pX,pY = np.meshgrid(pxs,pys)
    pZ = np.zeros((len(pxs),len(pys)))    
    for ix in range(replicas):
        sv = svm.SVC(kernel='rbf', gamma=gamma,C=C)
        sv.fit(train_sets[ix,:,:-1],train_sets[ix,:,-1])
        svs.append(sv)
        preds = sv.predict(np.c_[pX.ravel(),pY.ravel()]).reshape(pZ.shape)
        pZ = pZ + preds
    pZ = np.round(pZ/float(replicas))
    					

Ensemble methods: Bagging

  • 50 SVM, trained with bootstrapping

Ensemble methods: Bagging

  • Majority class of 50 SVM

Ensemble methods: Bagging

Bagging (Bootstrap aggregating)

  • Averaging reduces variance and overfitting, increasing probability of correct classification as number of classifiers increases
$$\sum \limits_{k = T/2+1}^{T} \binom {T} {k} p^k (1-p)^{T-k}$$

Ensemble methods: Bagging

Bagging (Bootstrap aggregating)

  • Bootstrap aggregating classifiers $$\sum \limits_{k = T/2+1}^{T} \binom {T} {k} p^k (1-p)^{T-k}$$

  • This assumes classifiers are independent

  • If classifiers are correlated, this does not work so well

  • Bagging is best for unstable algorithms
    (susceptible to input variations)

Aprendizagem Automática

Boosting

Ensemble methods: Boosting

  • Learn a linear combination of weak classifiers
  • Classifiers must have error rate below 0.5
  • Combination of classifiers has a lower bias and better classification power

Ensemble methods: Boosting

AdaBoost

  • Initialize sample weights: $w_n = 1/N$
  • Fit classifier $y_m(x)$ by minimizing weighted error $$J_m = \sum \limits_{n=1}^{N}w_m^n I(y_m(x^n)\neq t^n)$$
  • Compute weighted error on training set: $$\epsilon_m = \frac{\sum \limits_{n=1}^{N}w_m^n I(y_m(x^n)\neq t^n)} {\sum \limits_{n=1}^{N}w_m^n}$$

Ensemble methods: Boosting

  • Compute classifier weight: $$\alpha_{m} = \ln \frac{1-\epsilon_m}{\epsilon_m}$$
    • Original (Freund and Schapire,2003): $\alpha_{m} = \frac{1}{2} \ln \frac{1-\epsilon_m}{\epsilon_m}$
  • Compute new sample weights (and normalize): $$w_{m+1}^n = w_m^n \exp \left(\alpha_m I(y_m(x^n)\neq t^n)\right)$$
    • Increases weight of misclassified points
  • Stop when $\epsilon_m$ is zero or greater than 0.5
  • Output of the boosted classifier is weighted sum of classifiers:
  • $$f(x) = sign \sum \limits_{m=1}^{M} \alpha_m y_m(x)$$

Ensemble methods: Boosting

  • Stumping: AdaBoost with decision stumps (level 1 decision tree)
    • Choose one feature, split at one point
  • Use DecisionTreeClassifier

from sklearn.tree import DecisionTreeClassifier
hyps = []
hyp_ws = []
point_ws = np.ones(data.shape[0])/float(data.shape[0])
max_hyp = 50
for ix in range(max_hyp):
    stump = DecisionTreeClassifier(max_depth=1)
    stump.fit(data[:,:-1], data[:,-1], sample_weight = point_ws)
    pred = stump.predict(data[:,:-1])        
    errs = (pred != data[:,-1]).astype(int)
    err = np.sum(errs*point_ws)
    alpha = np.log((1-err)/err)
    point_ws = point_ws*np.exp(alpha*errs)
    point_ws = point_ws/np.sum(point_ws)
    				

Ensemble methods: Boosting

  • Stumping: AdaBoost with decision stumps (level 1 decision tree)

Ensemble methods: Boosting

  • Stumping: AdaBoost with decision stumps (level 1 decision tree)

Ensemble methods: Boosting

  • Stumping: AdaBoost with decision stumps (level 1 decision tree)

Ensemble methods: Boosting

  • Stumping: AdaBoost with decision stumps (level 1 decision tree)
  • Classifying data and computing error
  • 
    net_pred = np.zeros(data.shape[0])
    for ix in range(len(hyps)):
        pred_n = hyps[ix].predict(data[:,:-1])
        preds = preds+pred_n*hyp_ws[ix]
    net_pred[preds<0] = -1
    net_pred[preds>=0] = 1
    errors = np.sum((net_pred !=data[:,-1]).astype(int))
       				

Ensemble methods: Boosting

AdaBoost, derivation

  • We can see AdaBoost as a sequential mimization of the exponential error function: $$E = \sum \limits_{n=1}^{N} \exp \left( -t_n f_m(x_n) \right)$$
  • Where $f(x)$ is the weighted classification: $$f(x) = \sum \limits_{m=1}^{M} \alpha_m y_m(x)$$
  • And all $f_{1} ... f_{m-1}$ are assumed constant

Ensemble methods: Boosting

AdaBoost, derivation

  • We can decompose the error in correctly and incorrectly classified:
  • $$E = \sum \limits_{n=1}^{N} w_m^n \exp \left( -t_n\alpha_m y_m(x_n) \right)=e^{-\alpha_m} \sum \limits_{n \in \mathcal{T}} w_m^n + e^{\alpha_m} \sum \limits_{n \in \mathcal{M}} w_m^n$$ $$=e^{-\alpha_m} \sum \limits_{n=1}^{N} w_m^n + (e^{\alpha_m}-e^{-\alpha_m}) \sum \limits_{n=1}^{N} w_m^n I(y_m(x^n)\neq t^n)$$
  • Minimizing for $y_m$: $J_m = \sum \limits_{n=1}^{N}w_m^n I(y_m(x^n)\neq t^n)$
  • Minimizing for $\alpha_m$: $\alpha_{m+1} = \ln \frac{1-\epsilon_m}{\epsilon_m}$
  • AdaBoost minimizes the exponential error of the linear combination of the base classifiers with a sequential optimization.

Aprendizagem Automática

First test

First Test

First test

    • Lectures 1-12 (this one).
    • Next session (13,14) not for first test.
    • Session of October 30, revisions
    • Can consult 1 hand written A4 sheet (with identification).
    • Note: exam will be scored in two independent parts.

Aprendizagem Automática

Summary

Ensemble methods

Summary

  • Bagging: reduce variance by averaging
    • Useful for models with large variance
    • Useful for unstable models, otherwise there is too much correlation
  • Boosting: reduce bias by linear combination of classifiers
    • Useful for combining weak classifiers (large bias)
    • Note: must be able to weigh samples

Further reading

  • Alpaydin, Sections 17.6, 17.7
  • Marsland, Chapter 7
  • Bishop, Sections 14.2, 14.3