7. Naïve Bayes classifier

Naïve Bayes

Summary

• Bayes Rule
• Bayes Classifier
• Naïve Bayes classifier
• Comparing classifiers

Bayes' Rule

Bayes' Rule

Sum rule and marginal probability

• Given two random variables $X$ and $Y$
• $$p(X=x_i) = \sum\limits_{j=1}^{N}p(X=x_i,Y=y_j)$$
• This is the sum rule, giving the marginal probability Bayes' Rule

Product rule and joint probability

• The conditional probability of $Y = y_j$ given that $X = x_i$ is the fraction of $p(X = x_i,Y=y_j)$ in $p(X=x_i)$
• $$p(Y = y_j | X = x_i) = \frac{p(X = x_i,Y=y_j)}{p(X=x_i)}$$ $$p(X = x_i,Y=y_j) = p(Y = y_j | X = x_i)p(X=x_i)$$

Summarizing:

\begin{aligned}sum\ rule \qquad & p(X) = \sum\limits_{j=1}^{N}p(X,Y_j)\\product\ rule \qquad & p(X, Y) = p(Y | X)p(X)\end{aligned}

Bayes' Rule

Bayes' rule

• Since $p(Y,X)=p(X,Y)$, applying the product rule
• \begin{aligned} p(Y,X)&=p(X,Y) \Leftrightarrow \\ \\ p(Y | X)p(X)&=p(X | Y)p(Y) \Leftrightarrow \\ \\ p(Y|X) &= \frac{p(Y)p(X|Y)}{p(X)} \end{aligned}

Bayes' Rule

Frequentist interpretation

• Probability is frequency over infinite trials
• Bayes rule simply relates different frequency estimates:
• $$p(Y|X) = \frac{p(Y)p(X|Y)}{p(X)}$$

Bayesian interpretation

• Probability measures (rational) certainty about truth of hypotheses
• Bayes' rule gives posterior probability from prior and evidence:
• $$p(H|E) = \frac{p(H)p(E|H)}{p(E)}$$

Bayes' Rule

For classification

• We can consider the conditional probability of an example with features $x$ belonging to class $c$ to be: $$p(C=c|X=x)=\frac{p(C=c)p(X=x|C=c)}{p(X=x)}$$
• Frequentist interpretation:
• Frequencies over infinite trials
• Bayesian interpretation:
• Confidence on hypothesis given the data
• Either way, we have to estimate these probabilities from the data

Bayes Classifier

Bayes Classifier

To classify, find most probable class:

$$p(C=c|X=x)=\frac{p(C=c)p(X=x|C=c)}{p(X=x)}$$
• (probable in the bayesian sense)
• Since $p(X=x)$ is independent of $c$, simplify to:
• $$p(C=c |X=x)\propto{p(C=c)p(X=x|C=c)}$$
• And since
• $$p(C=c)p(X=x|C=c)=p(C=c,X=x)$$
• all we need is the joint distribution of the classes and the feature value combinations.

Bayes Classifier

To classify, find most probable class:

$$C^{Bayes} = \underset{c \in \{0,1,...,N\}} {\mathrm{argmax}} p(C=c, X=x)$$
• The Bayes Classifier is the ideal classifier
• If we know the joint distribution of classes and feature combinations we minimize the error.
• (That's the problem)

Bayes Classifier

Example: predict diabetes

• Questionnaires with 20 yes/no questions (food, exercise, etc..)
• Total of $2^{20}\sim$ 1 million combinations
• We need (many) millions of examples to estimate $P(C,X)$
• The full Bayes Classifier is not practical
• We never have enough data to estimate the joint probability for all combinations of features

Naïve Bayes Classifier

Naïve Bayes Classifier

Assume conditional independence given class

• Example: the time two people arrive home are not independent variables because a strike affects both.
• But if we know that there was a strike, these variables become independent.
• This makes it easier to estimate the probabilities

Applying the product rule to $p(C,X)$

\begin{aligned} p(C_k, x_1, ... , x_n) = & p(C_k)p(x_1|C_k)p(x_2|C_k,x_1) ... \\ & p(x_n|C_k,x_1,x_2,...,x_{n-1}) \end{aligned}
• But conditional independence given the class means that $$p(x_n|C_k,x_1,x_2,...,x_{n-1}) = p(x_n|C_k)$$

Naïve Bayes Classifier

So it all simplifies to:

$$p(C_k, x_1, ... , x_n) = p(C_k)\prod \limits_{j=1}^{N}p(x_j|C_k)$$
• Taking the logarithm:
• $$\ln p(C_k, x_1, ... , x_n) = \ln p(C_k)+\sum \limits_{j=1}^{N}\ln p(x_j|C_k)$$

The Naïve Bayes Classifier is:

$$C^{Naïve\ Bayes} = \underset{k \in \{0,1,...,K\}} {\mathrm{argmax}} \ln p(C_k)+\sum \limits_{j=1}^{N}\ln p(x_j|C_k)$$

Naïve Bayes Classifier

$$\underset{k \in \{0,1,...,K\}} {\mathrm{argmax}} \ln p(C_k)+\sum \limits_{j=1}^{N}\ln p(x_j|C_k)$$

To train the Naïve Bayes Classifier:

• For each class, find the log probability distribution of features
• For each class, find the log of the prior probability

To classify:

• Find class for which the sum of the feature logs, plus the log of the prior, is greatest

Naïve Bayes Classifier

Example: predict diabetes

• Questionnaires with 20 yes/no questions (food, exercise, etc..)
• Split into healthy and diabetic
• Count yes/no fraction for each question in each class
• Predict: log fraction of healthy to diabetic (prior) plus sum of log feature fraction for each class

NB: example 1

NB, continuous

• Data set with two continous features and two classes NB, continuous

Finding the distributions

• Parametric method: assume some distribution (e.g. Normal) and compute the parameters (e.g. mean and standard deviation):
• $$p(x_j|C_k) = \frac{1}{\sigma_k\sqrt{2\pi}}e^{-\frac{(x-\mu_k)^2}{2\sigma_k^2}}$$
• Non-Parametric method: do not assume distribution. E.g. KDE.

Parametric vs Non-parametric

• Parametric models: completely defined by the chosen parameters
• Non-Parametric need additional data (e.g. k-NN, KDE).
• (Naïve Bayes can be either)

NB, continuous

• In our case, we'll choose a KDE NB, continuous

• KDE for one feature, both classes: NB, continuous

• KDE for both features, both classes: NB, continuous

$$P(x_1,x_2|red)$$ NB, continuous

$$P(x_1,x_2|red)=$$
$$P(x_1|red)P(x_2|red)$$ NB, continuous

$$P(x_1,x_2|red)=$$
$$P(x_1|red)P(x_2|red)$$

$$P(x_1,x_2|blue) =$$
$$P(x_1|blue)P(x_2|blue)$$ NB, continuous

• We need to adjust the kernel width NB, continuous

• Cross-validation (10-fold) NB, continuous

• Final result
• ($h=1.8$) NB: example 2

NB, categorical

Mushroom edibility

• Data set with mushroom features and class (poisonous, edible)
• From http://archive.ics.uci.edu/ml/datasets/Mushroom
1. cap-shape:                bell=b,conical=c,convex=x,flat=f,knobbed=k,sunken=s
2. cap-surface:              fibrous=f,grooves=g,scaly=y,smooth=s
3. cap-color:                brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y
4. bruises?:                 bruises=t,no=f
5. odor:                     almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s
6. gill-attachment:          attached=a,descending=d,free=f,notched=n
7. gill-spacing:             close=c,crowded=w,distant=d
9. gill-color:               black=k,brown=n,buff=b,chocolate=h,gray=g,green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y
10. stalk-shape:              enlarging=e,tapering=t
11. stalk-root:               bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?
12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
...
21. population:               abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y

NB, categorical

Data set on mushroom edibility (poisonous, edible)

p,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u
e,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g
e,b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m
p,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u
e,x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g
[...]
• Note: for real applications, you can use the implementation of Naïve Bayes in the Scikit-learn library
• For the assignment you will implement your NB classifier.
• For the assignment, with continuous features, use KDE
• In this example we have categorical features, so we just compute the logarithms of the fractions

NB, categorical

List features values, adding "?"

def get_features():
features = []
for lin in lines:
ft_vals = '?'
fragments = lin.split('=')
for frag in fragments[1:]:
ft_vals = ft_vals+frag
features.append(ft_vals)
return features

1. cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
[...]
21. population: abundant=a,clustered=c,numerous=n, scattered=s...
22. habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,...

NB, categorical

List features values, adding "?"

def get_features():
features = []
for lin in lines:
ft_vals = '?'
fragments = lin.split('=')
for frag in fragments[1:]:
ft_vals = ft_vals+frag
features.append(ft_vals)
return features

In : get_features()
Out:
['?bcxfks', '?fgys', '?nbcgrpuewy', '?tf', '?alcyfmnps', '?adfn',
'?cwd', '?bn', '?knbhgropuewy', '?et', '?bcuezr?', '?fyks', '?fyks',
'?nbcgopewy', '?nbcgopewy', '?pu', '?nowy', '?not', '?ceflnpsz',
'?knbhrouwy', '?acnsvy', '?glmpuwd']

NB, categorical

• Load data and shuffle rows

feat_vals = np.zeros((len(lines),22)).astype(int)  # to store indexes
classes = np.zeros(len(lines))
for row,lin in enumerate(lines):
s = lin.replace(',','').strip()
classes[row] = class_codes.index(s)
for column,fv in enumerate(s[1:]):
feat_vals[row,column] = features[column].index(fv)
ixs = list(range(feat_vals.shape))
np.random.shuffle(ixs)
return feat_vals[ixs,:],classes[ixs]

• Note: take care not to scramble features and class labels

NB, categorical

• Build "histogram" with $\ln p(x_i|C_c)$
• (count frequencies)
• What to do if some feature does not appear? $p(x_i|C_c)=0$.
• $$\hat{p}(x_j = k) = \frac{count(k)+\alpha}{N+\alpha d}$$
• (we'll use $\alpha = 1$)

def make_hists(data,features):
hists = []
for feat in features:
hists.append(np.ones(len(feat)))
for row in range(data.shape):
for column in range(data.shape):
hists[column][data[row,column]] +=1
for ix in range(len(hists)):
hists[ix] = np.log(hists[ix]/float(data.shape+len(features[ix])))
return hists

NB, categorical

Split train and test sets

• We'll do stratified sampling to keep same proportion of classes in training and test sets (poisonous and edible)
• No need for cross-validation; we are not adjusting parameters
• We could use cross-validation instead of a test set if we had fewer data
• You can use train_test_split from sklearn.model_selection

def split_data(features,test_fraction):
edible = feat_vals[classes==0,:]
poison = feat_vals[classes==1,:]
e_test_points = int(test_fraction*edible.shape)
e_train = edible[e_test_points:,:]
e_test = edible[:e_test_points,:]
p_test_points = int(test_fraction*poison.shape)
p_train = poison[p_test_points:,:]
p_test = poison[:p_test_points,:]
return e_train,p_train,e_test,p_test

NB, categorical

Classify examples

def classify(e_class,e_log,p_class,p_log,feat_mat):
classes = np.zeros(feat_mat.shape)
for row in range(feat_mat.shape):
e_sum = e_log
p_sum = p_log
for column in range(feat_mat.shape):
e_sum = e_sum + e_class[column][int(feat_mat[row,column])]
p_sum = p_sum + p_class[column][int(feat_mat[row,column])]
if e_sum<p_sum:
classes[row]=1
return classes

NB, categorical

• The complete function

def do_bayes():
features = get_features()
e_train,p_train,e_test,p_test = split_data(features,0.5)
e_hists = make_hists(e_train,features)
p_hists = make_hists(p_train,features)
tot_len = e_train.shape+p_train.shape
e_log = np.log(float(e_train.shape)/tot_len)
p_log = np.log(float(p_train.shape)/tot_len)
c_e = classify(e_hists,e_log,p_hists,p_log,e_test)
c_p = classify(e_hists,e_log,p_hists,p_log,p_test)
errors = sum(c_e)+sum(1-c_p)
error_perc = float(errors)/(len(c_e)+len(c_p))*100
print(f'{errors:.0f} errors; {error_perc:.2f}% error rate')
print('\tE\tP')
print(f'E\t{sum(1-c_e):.0f}\t{sum(1-c_p):.0f}')
print(f'P\t{sum(c_e):.0f}\t{sum(c_p):.0f}')

191 errors; 4.70% error rate
E       P
E       2091    178
P       13      1780

NB, categorical

Confusion matrix

• Eating poisonous mushrooms is more costly...
Mushroom class
Prediction    Edible    Poisonous
Edible         2091        178
Poisonous        13       1780

Naïve Bayes

Discriminative classifiers

• Classifiers such as logistic regression compute a conditional probability for the class given the features.
• This is directly what we need to discriminate examples.

Generative classifiers

• Naïve Bayes classifier is a generative classifier
• These compute the joint probability distribution, allowing us to generate synthetic examples from the trained hypothesis

Naïve Bayes

• Learning from points on the left, generate points on the right  Comparing classifiers

Comparing classifiers

• Log.Reg., k-NN, Naïve Bayes   • Test set, 45 points
• Logistic Regression, 10 errors; k-NN, 6 errors; Naïve Bayes, 1 error

Comparing classifiers

Aproximate normal test

• Assume errors normally distributed around true error rate:
• $$\frac{X - Np_0}{\sqrt{Np_0(1-p_0)}}\approx Z$$
• Test if intervals are disjoint with some confidence level (e.g. 95%)
• $$X-1.96\sigma \le Np_0 \le X+1.96\sigma \quad \sigma \approx \sqrt{N \frac{X}{N}(1-\frac{X}{N})}$$
• Logistic Regression: $10\pm 5.4$
• k-NN: $6\pm 3.5$
• Naïve Bayes: $1\pm 1.9$

Comparing classifiers

Better alternative: McNemar's test

•  $e_{01}$ and $e_{10}$: count examples wrong in one and correct in the other
• $$\frac{(|e_{01}-e_{10}|-1)^2}{e_{01}+e_{10}}\approx \chi^2_1$$
• McNemar's test, 95% confidence, significant difference if:
• $$\frac{(|e_{01}-e_{10}|-1)^2}{e_{01}+e_{10}}\ge 3.84$$
• Logistic Regression vs kNN = 0.8
• kNN vs Naïve Bayes = 2.3
• Naïve Bayes vs Logistic Regression = 7.11
• Conclusion: Naïve Bayes likely better than Logistic Regression

Processing Data

Processing Data

Data can come in many forms

• Target values: Classification vs Regression
• Continuos or discrete features
• Different distributions

Main goal: predict for new examples

• Train on some data, validate and test on other data

• Should we compute scale parameters (normalization, standardization) for training set or training folds?

Processing Data

Rescaling parameters

• In theory: these are parameters computed from data
• We should compute on training data and apply to validation or test
• However, statistics like the mean converge rapidly
• In practice it makes little difference, so we do it only once
• (In general...)

Processing Data

Rescaling parameters

• Example: 2000 points, split 1000/10000, compute mean • Means diverge by 0.014 of standard deviation
• Computing for each training fold or whole set nearly the same

Processing Data

Splitting the data

• But things may be more complicated.
• Example: 2000 values, money spent on transportation Processing Data

Splitting the data

• Data comes from 6 different cities
• Example: 2000 values, money spent on transportation • How should we split the data and compute scaling parameters?

Processing Data

• Case 1:
• Examples can come from any city
• We want to predict for new examples in these cities
• We can shuffle the data and if the data set is large enough we can rescale everything once • For cross validation or testing, better to stratify
• Keep the same proportion of data from each city Processing Data

• Case 2:
• We know the city each example comes from
• We want to predict for new examples in other cities
• Different cities have different distributions Processing Data

• Case 2:
• We know the city each example comes from
• We want to predict for new examples in other cities
• Different cities have different distributions
• We need to consider the generalization from some cities to others
• We should split by city (and eventually shuffle before training) • Rescaling should be done for training data only because statistics will differ

Processing Data

What to take from this

• Do not memorize recipes
• This may lead you to incorrect results
• Understand what you are doing
• Main focus of this course

Summary

Aprendizagem Automática

Summary

• Bayes Classifier and Naïve Bayes Classifier
• Parametric and non-parametric models
• Generative and Discriminative classifiers
• Comparing classifiers
• Processing data: understand what you are dealing with