- Deep learning (a very brief introduction)
- Problem of backpropagation
- Autoencoders
- Stacked denoising autoencoders

- Shallow networks: An ANN with one hidden layer can arbitrarily approximate any function
- We can adjust each weight of hidden layer to a small segment of the function
- Another way of thinking about this is that the hidden layer projects the input space into an arbitrarily higher dimensional space.
- This easily leads to overfitting, forcing selection of features to reduce parameters.
- Deep networks: with multiple hidden layers we can solve this problem
- Each layer can transform the input space without increasing the dimensions too much (or even decreasing)
- Eventually, we can find a good representation for the final layer.
- This is also how we solve problems. E.g. hardware, drivers, kernel, OS API, applications, user interface

- Deep neural networks are also what we have in our brains

- Retina cells find colour and brightness
- Ganglia extract edges and orientation
- Visual cortex matches patterns
- Motor cortex generates screaming and running

- Unfortunately, backpropagation on sigmoid neurons does not work (nor is it used in brains)

- Example: classify with one hidden layer

- Hidden layer transforms data set

- Harder problem, one hidden layer

- Cannot transform adequately

- Cannot transform adequately
- We could solve this problem with more neurons on the hidden layer
- However, that would increase the dimension and overfitting. So let's try adding another layer of the same size (2 neurons)

- Two hidden layers
- Vanishing gradient increases time and causes instability in the minimization

- Takes much longer to train
- Second layer myst wait for the first layer

- Three hidden layers

- Three hidden layers

- Backpropagation on sigmoid (or similar) neurons: gradients vanish

- This problem is usually solved with numerical techniques:
- Other activation functions (rectified linear units)
- Use higher order derivatives
- Use momentum and good initialization
- Or pre-training the network layer by layer
- Unsupervised learning for pre-training the network

- An
autoencoder is a ANN that receives an input $\mathbf{x}\in [0,1]^d$, outputs $\mathbf{z}\in [0,1]^d$ and encodes a hidden representation $\mathbf{y}\in [0,1]^{d'}$ - E.g. a MLP with $d$ inputs and outputs and a hidden layer with $d'$ neurons.
- Mitchell's autoassociator is an example of an autoencoder.
- 8 inputs
- 3 hidden neurons
- 8 output neurons
- 10000000 ... 00000001

- Autoencoder, 784-1000-500-250-2 (Hinton, Salakhutdinov, 2006)
- MNIST, 28x28 pixels, A: PCA, B: Autoencoder

- Tipically, autoencoders are trained to minimize the reconstruction error, which can be defined as the squared error: $$L(\mathbf{xz}) = ||\mathbf{x}-\mathbf{z}||^2$$
- But can be defined in other ways (e.g. cross entropy)
- Dimension of the hidden representation can be the less, same or higher.
- (overfitting is always a problem).

- We can reduce overfitting by adding noise
- Reduce overfitting details; more robust representation of the examples
- E.g. set a random set of inputs to 0 every time an example is presented to the network

- We can train a denoising autoencoder using the original data
- Then we discard the output layer, and use the hidden representation as input to the next autoencoder
- This way we can train each autoencoder, one at a time, with unsupervised learning. Each one is reencoding the hidden representation of the previous one
- The network can then be fine-tuned as a MLP with backpropagation. This can be a supervised-learning step if we are dealing with classification

- Less need for engineering features
- Better performance (with enough data) for complex problems
- Adaptable architecture and knowledge transfer

- Requires more computation power and data than classical ML
- Result is not easy to understand or explain

- It can also be a bit disconcerting...

Source: Social Media Today

- Deep learning: better at dealing with large data sets
- Problem of Backpropagation stumped researchers for years
- Autoencoders for pre-training:
- Unsupervised learning of each layer
- Stack them all and fine-tune
- Modern approach: other activation functions (ReLU)

- https://www.tensorflow.org (TensorFlow, Google, based on Theano)
- https://pytorch.org/ (PyTorch, Facebook, based on Torch)