- Bayesian Learning
- Maximum Likelihood vs Maximom A Posteriori
- Monte Carlo and computing prior probability distributions
- Decisions and costs

- To find parameters in some cases (E.g. regression, logistic regression) we maximized the likelihood: $$\hat{\theta}_{ML} = \underset{\theta} {\mathrm{arg\ max}}\ \prod_{t=1}^{n}p(x^t,y^t)$$
- Rewriting as conditional probabilities, and since $p(x^t)$ is constant: $$\prod_{t=1}^{n}p(x^t,y^t) = \prod_{t=1}^{n}p(y^t|x^t)\times\prod_{t=1}^{n}p(x^t) \qquad \hat{\theta}_{ML} = \underset{\theta} {\mathrm{arg\ max}}\ \prod_{t=1}^{n}p(y^t|x^t;\theta)$$
- Under a frequentist interpretation, probability is the frequency in the limit of infinite trials.
- So $\theta$ is unknown but not a random variable.

- Under a bayesian interpretation, probability is a measure of knowledge and uncertainty and $\theta$ can be seen as another random variable with its own probability distribution
- Given
prior $p(\theta)$ and sample $S$, updateposterior $p(\theta|S)$:
$$p(\theta|S) = \frac{p(S|\theta)p(\theta)}{p(S)}$$
- where $p(S)$ is the marginal probability of $S$ (the
evidence ) and $p(S|\theta)$ is thelikelihood of $\theta$

- Since $p(S)$ is generally unknown and constant, we approximate the posterior with the
Maximum A Posteriori (MAP) estimate:
$$\hat{\theta}_{MAP} = \underset{\theta} {\mathrm{arg\ max}}\ \prod \limits_{t=1}^{n} p(y^t|x^t,\theta)p(\theta)$$
- ML and MAP are similar but with a significant difference: $$\hat{\theta}_{ML} = \underset{\theta} {\mathrm{arg\ max}}\ \prod_{t=1}^{n}p(y^t|x^t;\theta)$$
- Treating the parameters as a probability distribution leads naturally to regularization due to the inclusion of the prior probability distribution of the parameters $p(\theta)$
- (e.g. Bayesian logistic regression)

Uninformative Priors : the prior probability has little impact on the posterior, and MAP becomes similar to ML- In some cases, a uniform distribution can suffice.
- In other cases, we need different distributions. E.g. line slope on linear regression
- We may also want to include prior information about the parameters
- Often results in probability distributions for which we have no analytical expression for expected values
- Bayesian learning generally requires numerical sampling methods (Monte Carlo), which can make it computationally more demanding
- but we can explicitly use prior probability distributions instead of ad-hoc regularization

- So far, the loss functions we used were all measures or error
- But sometimes, the error may not be the best loss function

- Suppose we have the joint probability distributions $P(x,C_1)$ and $P(x,C_2)$
- We also have a classifier that classifies an example as $C_2$ if $x>\hat{x}$ or $C_1$ otherwise

- Errors depend on the choice of $\hat{x}$

- Red and green: $C_2$ misclassified; Blue: $C_1$ misclassified

- Minimizing the misclassification rate is equivalent to maximizing the probability of x corresponding to the predicted class
- This can be done by choosing $\hat{x}$ such that $$P(C_1|x)>P(C_2|x) \ \ for\ \ x<\hat{x}$$ $$P(C_2|x)>P(C_1|x) \ \ for\ \ x>\hat{x}$$

- Minimizing classification error:

- Suppose $C_1$ is cancer patient and $C_2$ is healthy. It may be more costly to mistake $C_1$ for $C_2$ than vice-versa.
- We can consider the following
loss matrix :

Predict cancer | Predict healthy | |

Is cancer | 0 | 5 |

Is healthy | 1 | 0 |

- Now we classify minimizing this
loss function :

- Minimizing classification error:

- Taking loss into account:

- Intuition:Multiplying by misclassification cost:

Utility : decision literature often mentions a utility function instead of a loss function- The idea is the same, but maximize instead of minimize

Rejection option - Misclassification often occurs when probabilities are similar
- We can reject classification in those cases (e.g. warn user)

$$p(C_k|x)\leq \phi \qquad \forall k$$

- Rejecting classification below 0.7

- Bayesian interpretation
- MAP vs ML: importancen of priors
- Decision: misclassification, cost, rejection

- Alpaydin, Chapter 3 up to 3.5
- Bishop, Section 1.5