Aprendizagem Automática

2. Introduction to Supervised Learning

Ludwig Krippahl

Supervised Learning

Summary

  • Supervised learning, basic concepts
  • Regression and classification
  • Fitting curves with Least Mean Squares

Supervised Learning

Basic concepts

Supervised Learning

Basic idea

  • We have a set of labelled data $$\left\{(x^1,y^1), ..., (x^n,y^n)\right\}$$
  • We assume there is some function $$F(X) : X \rightarrow Y$$
  • The goal of Supervised Learning is to find (from the examples) $$g(\theta, X) : X \rightarrow Y$$ such that $g(\theta, X)$ approximates $F(X)$
  • Supervised because we can compare $g(\theta, X)$ to $Y$

Supervised Learning

Training (Supervised learning)

  • Ideally, we want to approximate $F(X) : X \rightarrow Y$ for all $X$
  • But, for now, we'll consider only our Training Set
  • $$\left\{(x^1,y^1), ..., (x^n,y^n)\right\}$$
  • Training Set
    The data we use to adjust the parameters $\theta$ in our model.
    • More generally: data used to choose a hypothesis
  • Training Error or Empirical Error
    The error on the training set for each instance of $\theta$.
    • (Sample Error in Mitchell 1997)

Supervised Learning

Our ML problem for today:

  • Goal: Predict the $Y$ values in our training set
  • Performance: minimise training error
  • Data: $\left\{(x_1,y_1), ..., (x_n,y_n)\right\}$

Classification and Regression

  • In Classification $Y$ is discrete.
    • Examples: SPAM detection, predict if mushrooms are poisonous
    • Find function to split data in differente sets
  • In Regression $Y$ is continuous.
    • Examples: predicting trends, prices, purchase probabilities
    • Find function that approximates $Y$

Supervised Learning

Regression

Regression

Regression example

  • Polynomial fitting: a simple example of linear regression.
  • $$y = \theta_1 x_1 + \theta_2 x_2 + ... + \theta_{n+1}$$
  • Example: we have a set of $(x,y)$ points and want to fit the best line:$y = \theta_1 x + \theta_2$

Regression

How to find the best line?

Regression

Finding the best line

  • Assume $y$ is a function of $x$ plus some error: $$ y = F(x) + \epsilon $$
  • We want to approximate $F(x)$ with some $g(x,\theta)$.
  • Assuming $\epsilon \sim N(0,\sigma^2)$ and $g(x,\theta) \sim F(x)$, then: $$p(y|x)\sim\mathcal{N}(g(x,\theta),\sigma^2) $$
  • Given $\mathcal{X}=\{ x^t,y^t \}_{t=1}^{N}$ and knowing that $p(x,y)=p(y|x)p(x)$

$$p(X,Y)=\prod_{t=1}^{n}p(x^t,y^t)= \prod_{t=1}^{n}p(y^t|x^t)\times\prod_{t=1}^{n}p(x^t)$$

Regression

  • The probability of $(X,Y)$ given some $g(x,\theta)$ is the likelihood of parameters $\theta$:
$$l(\theta|\mathcal{X})=\prod_{t=1}^{n}p(x^t,y^t)= \prod_{t=1}^{n}p(y^t|x^t)\times\prod_{t=1}^{n}p(x^t)$$

Likelihood

  • $(x,y)$ are randomly sampled from all possible values.
  • But $\theta$ is not a random variable.
  • Find the $\theta$ for which the data is most probable
  • In other words, find the $\theta$ of maximum likelihood

Regression

Maximum likelihood

$$l(\theta|\mathcal{X})=\prod_{t=1}^{n}p(x^t,y^t)= \prod_{t=1}^{n}p(y^t|x^t)\times\prod_{t=1}^{n}p(x^t)$$
  • First, take the logarithm (same maximum)
$$L(\theta|\mathcal{X})=log\left(\prod_{t=1}^{n}p(y^t|x^t)\times\prod_{t=1}^{n}p(x^t)\right)$$
  • We ignore $p(X)$, since it's independent of $\theta$
  • $$L(\theta|\mathcal{X}) \propto log\left(\prod_{t=1}^{n}p(y^t|x^t)\right)$$
  • Replace the expression for the normal:
$$\mathcal{L}(\theta|\mathcal{X})\propto log\prod_{t=1}^{n}\frac{1}{\sigma \sqrt {2\pi } } e^{- [y^t - g(x^t|\theta)]^2 /2\sigma^2 }$$

Regression

Maximum likelihood

  • Replace the expression for the normal:
  • $$\mathcal{L}(\theta|\mathcal{X})\propto log\prod_{t=1}^{n}\frac{1}{\sigma \sqrt {2\pi } } e^{- [y^t - g(x^t|\theta)]^2 /2\sigma^2 }$$
  • Simplify:
  • $$\mathcal{L}(\theta|\mathcal{X})\propto log\prod_{t=1}^{n}e^{- [y^t - g(x^t|\theta)]^2}$$ $$\mathcal{L}(\theta|\mathcal{X})\propto -\sum_{t=1}^{n} [y^t - g(x^t|\theta)]^2$$

Regression

Maximum likelihood

$$\mathcal{L}(\theta|\mathcal{X})\propto -\sum_{t=1}^{n} [y^t - g(x^t|\theta)]^2$$
  • Max(likelihood) = Min(squared error):
$$E(\theta|\mathcal{X})=\sum_{t=1}^{n} [y^t - g(x^t|\theta)]^2$$
  • Note: the squared error is often written
$$E(\theta|\mathcal{X})=\frac{1}{2}\sum_{t=1}^{n} [y^t - g(x^t|\theta)]^2$$
  • But this is just for convenience.

Supervised Learning

Least Mean Squares Minimization

LMS

How to find the best line?

LMS

How to find the best line?

  • We find the parameters for $$g(x) = x \theta_1 + \theta_2$$ that minimize the squared error $$E(\theta|\mathcal{X})=\sum_{t=1}^{n} [y^t - g(x^t)]^2$$

Let's visualise this surface wrt $\theta$

LMS

LMS

LMS

LMS

  • This allows us to find the best $\theta_1,\theta_2$ (not a very good model...)

Supervised Learning

Curves

Curves

Linear Regression

  • How to fit curves with something straight?
  • We can change the data:

    $\mathcal{X_2}=\{ x_1^t,x_2^t,y^t \}$, where $x_1 = x^2$ and $x_2 =x$

Curves

Curves

Linear Regression

  • Now we fit our new data set
  • $$\mathcal{X_2}=\{ x_1^t,x_2^t,y^t \}$$
  • With the (linear) model
  • $$y = \theta_1 x_1 + \theta_2 x_2 + \theta_3$$

Curves

Curves

  • Then we project it back using $x_1 = x^2$ and $x_2 =x$

Curves

Curves

Linear Regression

  • This is the equivalent of fitting a second degree polynomial
  • $$y = \theta_1 x^2 + \theta_2 x + \theta_3$$
    
    import numpy as np
    import matplotlib.pyplot as plt
    
    mat = np.loadtxt('polydata.csv',delimiter=';')
    x,y = (mat[:,0], mat[:,1])
    coefs = np.polyfit(x,y,2)
    
    pxs = np.linspace(0,max(x),100)
    poly = np.polyval(coefs,pxs)
    
    plt.figure(1, figsize=(12, 8), frameon=False)
    plt.plot(x,y,'or')
    plt.plot(pxs,poly,'-')    
    plt.axis([0,max(x),-1.5,1.5])    
    plt.title('Degree: 2')
    plt.savefig('testplot.png')
    plt.close()
    

Curves

Curves

Linear Regression

  • How to fit curves with something straight?
  • Important idea: we use something straight in higher dimensions

Supervised Learning

Curve More!

Curve more

Improving the fit

  • We can go to higher order polynomials. E.g. third degree:
$$y = \theta_1 x^3 + \theta_2 x^2 + \theta_3 x + \theta_4$$

Curve more

Improving the fit

  • E.g. fifth degree:
$$y = \theta_1 x^5 + \theta_2 x^4 + ... + \theta_5 x + \theta_6$$

Curve more

Improving the fit

  • E.g. degree 15:
$$y = \theta_1 x^{15} + \theta_2 x^{14} + ... + \theta_{15} x + \theta_{16}$$

Curve more

Improving the fit

  • Degree 15 is probably not a good idea...

Curve more

Improving the fit

  • Degree 15 is probably not a good idea...
  • If we overfit our data, we increase error outside
  • How can we prevent getting carried away?

Next lecture: overfitting

Supervised Learning

Summary

Supervised Learning

Summary

  • Supervised learning: Classification and Regression
  • Linear regression: maximum likelihood and least mean squares
  • Polynomial regression is linear regression
    • (nonlinear transformation to higher dimensions)
  • But can be too much of a good thing

Further reading

  • Bishop, Chapter 1
  • Alpaydin, Section 2.6
  • Marsland, Sections 1.4 and 2.4