Aprendizagem Automática

Deep Learning

Ludwig Krippahl

Deep Learning

Summary

  • Deep learning (a very brief introduction)
  • Problem of backpropagation
  • Autoencoders
  • Stacked denoising autoencoders

Deep Learning

Deep learning

Deep learning

  • Shallow networks: An ANN with one hidden layer can arbitrarily approximate any function
    • We can adjust each weight of hidden layer to a small segment of the function
    • Another way of thinking about this is that the hidden layer projects the input space into an arbitrarily higher dimensional space.
    • This easily leads to overfitting, forcing selection of features to reduce parameters.
  • Deep networks: with multiple hidden layers we can solve this problem
    • Each layer can transform the input space without increasing the dimensions too much (or even decreasing)
    • Eventually, we can find a good representation for the final layer.
    • This is also how we solve problems. E.g. hardware, drivers, kernel, OS API, applications, user interface

Deep networks

  • Deep neural networks are also what we have in our brains


  • Retina cells find colour and brightness
  • Ganglia extract edges and orientation
  • Visual cortex matches patterns
  • Motor cortex generates screaming and running
  • Unfortunately, backpropagation on sigmoid neurons does not work (nor is it used in brains)

Aprendizagem Automática

Backpropagation

Backpropagation

  • Example: classify with one hidden layer

Backpropagation

  • Hidden layer transforms data set

Backpropagation

  • Harder problem, one hidden layer

Backpropagation

  • Cannot transform adequately

Backpropagation

  • Cannot transform adequately
  • We could solve this problem with more neurons on the hidden layer
  • However, that would increase the dimension and overfitting. So let's try adding another layer of the same size (2 neurons)

Backpropagation

  • Two hidden layers
    • Vanishing gradient increases time and causes instability in the minimization

Backpropagation

  • Takes much longer to train
    • Second layer myst wait for the first layer

Backpropagation

  • Three hidden layers

Backpropagation

  • Three hidden layers

Backpropagation

  • Backpropagation on sigmoid (or similar) neurons: gradients vanish
$$\begin{array}{rcl} \Delta w_{min}^j&=& - \eta \left( \sum\limits_{p} \frac{\delta E_{kp}^j}{\delta s_{kp}^j} \frac{\delta s_{kp}^j}{\delta net_{kp}^j}\frac{\delta net_{kp}^j}{\delta s_{in}^j} \right) \frac{\delta s_{in}^j}{\delta net_{in}^j}\frac{\delta net_{in}^j}{\delta w_{min}} \\ \\ &=& \eta (\sum\limits_p \delta_{kp} w_{mkp} ) s_{in}^j(1-s_{in}^j) x_i^j =\eta\delta_{in} x_i^j \end{array}$$
  • This problem is usually solved with numerical techniques:
    • Other activation functions (rectified linear units)
    • Use higher order derivatives
    • Use momentum and good initialization
  • Or pre-training the network layer by layer
    • Unsupervised learning for pre-training the network

Aprendizagem Automática

Autoencoders

Autoencoders

  • An autoencoder is a ANN that receives an input $\mathbf{x}\in [0,1]^d$, outputs $\mathbf{z}\in [0,1]^d$ and encodes a hidden representation $\mathbf{y}\in [0,1]^{d'}$
    • E.g. a MLP with $d$ inputs and outputs and a hidden layer with $d'$ neurons.
  • Mitchell's autoassociator is an example of an autoencoder.
    • 8 inputs
    • 3 hidden neurons
    • 8 output neurons
    • 10000000 ... 00000001

Autoencoders

  • Autoencoder, 784-1000-500-250-2 (Hinton, Salakhutdinov, 2006)
    • MNIST, 28x28 pixels, A: PCA, B: Autoencoder

Autoencoders

  • Tipically, autoencoders are trained to minimize the reconstruction error, which can be defined as the squared error: $$L(\mathbf{xz}) = ||\mathbf{x}-\mathbf{z}||^2$$
    • But can be defined in other ways (e.g. cross entropy)
    • Dimension of the hidden representation can be the less, same or higher.
    • (overfitting is always a problem).

Denoising Autoencoders

  • We can reduce overfitting by adding noise
    • Reduce overfitting details; more robust representation of the examples
    • E.g. set a random set of inputs to 0 every time an example is presented to the network

Autoencoders

Stacked Denoising Autoencoders

  • We can train a denoising autoencoder using the original data
  • Then we discard the output layer, and use the hidden representation as input to the next autoencoder
  • This way we can train each autoencoder, one at a time, with unsupervised learning. Each one is reencoding the hidden representation of the previous one
  • The network can then be fine-tuned as a MLP with backpropagation. This can be a supervised-learning step if we are dealing with classification

Deep learning

Advantages of Deep Learning

  • Less need for engineering features
  • Better performance (with enough data) for complex problems
  • Adaptable architecture and knowledge transfer

Disadvantages of Deep Learning

  • Requires more computation power and data than classical ML
  • Result is not easy to understand or explain

Deep learning

Deep Learning

Aprendizagem Automática

Summary

Summary

  • Deep learning: better at dealing with large data sets
  • Problem of Backpropagation stumped researchers for years
  • Autoencoders for pre-training:
    • Unsupervised learning of each layer
    • Stack them all and fine-tune
  • Modern approach: other activation functions (ReLU)

Resources (optional)

  • https://www.tensorflow.org (TensorFlow, Google, based on Theano)
  • https://pytorch.org/ (PyTorch, Facebook, based on Torch)

Next lectures: questions and revisions