Aprendizagem Automática
Decisions
Ludwig Krippahl
Decisions
Summary
 Bayesian Learning
 Maximum Likelihood vs Maximum A Posteriori
 Decisions and costs
Decisions
Bayesian Learning
Bayesian Learning
Bayesian vs Frequentist probabilities
 To find parameters in some cases (E.g. regression, logistic regression) we maximized the likelihood:
$$\hat{\theta}_{ML} = \underset{\theta} {\mathrm{arg\ max}}\ \prod_{t=1}^{n}p(x^t,y^t;\theta)$$
 Rewriting as conditional probabilities, and since $p(x^t)$ is constant:
$$\prod_{t=1}^{n}p(x^t,y^t) = \prod_{t=1}^{n}p(y^tx^t)\times\prod_{t=1}^{n}p(x^t) \qquad \hat{\theta}_{ML} = \underset{\theta} {\mathrm{arg\ max}}\ \prod_{t=1}^{n}p(y^tx^t;\theta)$$
 Under a frequentist interpretation, probability is the frequency in the limit of infinite trials.
 Vector $\theta$ is unknown but not a random variable.
Bayesian Learning
Bayesian vs Frequentist probabilities
 Under a bayesian interpretation, probability is a measure of knowledge and uncertainty and $\theta$ can be seen as another random variable with its own probability distribution
 Given prior $p(\theta)$ and sample $S$, update posterior $p(\thetaS)$:
$$p(\thetaS) = \frac{p(S\theta)p(\theta)}{p(S)}$$
 where $p(S)$ is the marginal probability of $S$ (the evidence ) and $p(S\theta)$ is the likelihood of $\theta$
$$p(\thetaS) = \frac{p(S\theta)p(\theta)}{p(S)} \Leftrightarrow p(\thetaS) = \frac{\prod \limits_{t=1}^{n}p(y^tx^t,\theta)p(\theta)}{p(S)}$$
Bayesian Learning
Bayesian vs Frequentist probabilities
 Since $p(S)$ is generally unknown and constant, we approximate the posterior with the Maximum A Posteriori (MAP) estimate:
$$\hat{\theta}_{MAP} = \underset{\theta} {\mathrm{arg\ max}}\ \prod \limits_{t=1}^{n} p(y^tx^t,\theta)p(\theta)$$
 ML and MAP are similar but with a significant difference:
$$\hat{\theta}_{ML} = \underset{\theta} {\mathrm{arg\ max}}\ \prod_{t=1}^{n}p(y^tx^t;\theta)$$
 Treating the parameters as a probability distribution leads naturally to regularization due to the inclusion of the prior probability distribution of the parameters $p(\theta)$
 (e.g. Bayesian logistic regression)
Bayesian Learning
Computing priors
 Uninformative Priors: the prior probability has little impact on the posterior, and MAP becomes similar to ML
 In some cases, a uniform distribution can suffice.
 In other cases, we need different distributions. E.g. line slope on linear regression
 We may also want to include prior information about the parameters
 Often results in probability distributions for which we have no analytical expression for expected values
 Bayesian learning generally requires numerical sampling methods (Monte Carlo), which can make it computationally more demanding
 But we can explicitly use prior probability distributions instead of adhoc regularization
Decisions
Decisions and costs
Decisions and costs
Measuring error
 So far, the loss functions we used were all measures or error
 But sometimes, the error may not be the best loss function
Loss functions
 Suppose we have the joint probability distributions $P(x,C_1)$ and $P(x,C_2)$
 We also have a classifier that classifies an example as $C_2$ if $x>\hat{x}$ or $C_1$ otherwise
Decisions and costs
 Errors depend on the choice of $\hat{x}$
Decisions and costs
 Red and green: $C_2$ misclassified; Blue: $C_1$ misclassified
Decisions and costs
 Minimizing the misclassification rate is equivalent to maximizing the probability of x corresponding to the predicted class
 This can be done by choosing $\hat{x}$ such that
$$P(C_1x)>P(C_2x) \ \ for\ \ x<\hat{x}$$
$$P(C_2x)>P(C_1x) \ \ for\ \ x>\hat{x}$$
Decisions and costs
 Minimizing classification error:
Decisions and costs
 Suppose $C_1$ is cancer patient and $C_2$ is healthy. It may be more costly to mistake $C_1$ for $C_2$ than viceversa.
 We can consider the following loss matrix:

Predict cancer 
Predict healthy 
Has cancer 
0 
5 
Is healthy 
1 
0 
 Now we classify minimizing this loss function:
$$\sum \limits_{k} L_{k,j} p(C_kx)$$
Decisions and costs
 Minimizing classification error:
Decisions and costs
 Taking loss into account:
Decisions and costs
 Intuition:Multiplying by misclassification cost:
Decisions and costs
Utility and Loss
 Utility: decision literature often mentions a utility function instead of a loss function
 The idea is the same, but maximize instead of minimize
 We do not want to simply minimize the probability of being wrong
 We want to minimize the loss (or maximize the utility) of the decision
 This requires taking into account costs and benefits of alternatives
Decisions and costs
Decision confidence
 Rejection option
 We may prefer not to risk a mistake and postpone decision in some cases
 Misclassification often occurs when probabilities are similar
 We can reject classification in those cases (e.g. warn user)
$$p(C_kx)\leq \phi \qquad \forall k$$
Decisions and costs
 Rejecting classification below 0.7
Decisions
Summary
 Bayesian interpretation
 MAP vs ML: importancen of priors
 Decision: misclassification, cost, rejection
Further reading
 Alpaydin, Chapter 3 up to 3.5
 Bishop, Section 1.5