Deep Learning with Python

Book by the author of Keras, F. Chollet about what is deep learning how to apply it to common tasks. Very well written and easy to follow, it focuses on practical teaching rather than theory. I read it two years ago and as the field is moving very fast content is a bit dated, best to wait for the second edition in Q2 2020.
95%
Intro
- in 2016 Deep Learning replaces Support Vector Machines and other methods for most problems
- Deep Learning need nearly no feature engineering anymore
- Now competitions on Kaggle are won with Deep Learning for perception and XGB for organised data
Basics
- Gradient is derivative for multi-dimensions functions
- Dimensions are often 2D ( ID, Features), 4D for images (ID, H, W, channels), 5D for video (ID, H, W, channels, frame)
- Learning Back propagation by hand useless now that TensorFlow can do formal computation of gradient function directly
- Main goal is to minimize Loss function by using an optimizer. For each mini-batch Loss optimizer compute new weights based on current Loss.
MNIST can be solved in 10 lines to a 97% accuracy!
Type of layers to use
Dense - fully connected
Sequence - recurrent, LSTM
Image - convolution
Model is usually represented by acyclic graph
Normalize test data with training data values to avoid spilling information
Small dataset –> Use (iterated) K-fold validation
Fundamentals of ML
- Supervised learning
- Most prevalent for now
- Given a set of input/output predict new output with new inputs
- Need big annotated datasets
Where DL shines the most
Unsupervised learning
Given a set of data, find what is interesting
Butter and bread of analytics
Dimensionality reduction and clustering often used
Self-supervised learning
Special type of supervised whereby label are generated with heuristics
Autoencoders
E.g. predicting next frame in a video, next word in a sequence
Reinforcement learning
Agent-based
Actions lead to environment changes
Reward function judge changes
Agent optimize to get best reward
E.g. Model beating video games
How-to
- Define the problem
- Define the metric to optimize
- Define an evaluation protocol
- Prepare Data
- Pick a base model architecture that does better than baseline e.g. random
- Make the model overfit
- Regularize (L1, L2, dropout) and tune hyperparameters
- Maybe try other model architectures
Computer Vision
- Skipped as nothing new on CN for me
Text and sequences
- 1D CNN when order does not matter, e.g. translation
- RNN when order matters e.g. time series analysis
- Bag of words and n-gram only for shallow networks
- DL and its multi-layers can learn long sequences without a need for feature engineering
- Using pre-trained embedding for text is rarely a good idea
- Word embeddings
- (256, 512, 1024) are a way to condense one-hot (20k+)
- Need to be learned from data with aim that distance between words is representative of closeness of meaning
- Very hard to find universal one, Word2vec is OK, best to compute a new embedding for each problem
Functional API
- Enable more topologies such as multiple inputs, outputs, residual, inception type networks…
- Only requirements is to create Directed acyclic graphs (DAGs)
- Residual help fighting vanishing gradients and representational bottlenecks
Advanced ML
- Callback to print status, early stopping, changing optimizer parameters during training…
- Tensor Board to visualize model performances in details
- Batch normalization ensure data is mean 0 and variance 1, important as usually done on input data but no sure after layers –> Layers.BatchNorm(a), a=1 except for conv2D channel first where a=-1
- New approach using selu and lecun_rand for self-normalizing NN, only working on Dense so far
- Move towards Separable conv2d as cheaper and as good as conv2d, base for Xception model
- Hyper-parameters space is discrete, so it is hard to find a good way to tune them, Hyperas and Hyperopt for Python can help
Ensembling
- Always better, especially with model of different approach as capture different part of the latent information
- Use weighted averages vs. accuracy, can use Nelder-Mead optimisation to choose weights.
- DNN + Trees (Random forest or gradient boost) is great combination of Deep + Wide
Trees
- Works best on structured data
- Random forests –> ensemble of uncorrelated weak predictor gives strong predictor, works by sampling input data e.g. [1,2,3,3,4,5,5] and [1,2,2,2,4,5] and randomizing features that tree can choose from
- Gradient boost –> start with tree, then compute loss and gradients, add second tree and so on, until happy with results.
Libraries for Kaggle, the simpler the winner
- Keras now 40%
- lightGBM by MSFT leaf-based addition, faster than XGB?
- XGBoost, level-based additions
- TensorFlow
Generative approaches
- Text generation
- Reuse sequence model, give softmax of all output but we want to control randomness temperature to go from random to predictible
Build network with its input the first part of a sequence and its output the other part
Style transfer
3 different loss with weighted average
Use optimisation over the pixel to iterate over input and change output
DNN is used to compute loss here
Good result when style is space-invariant and input simple shape
Ultimately, one could generate lots of good examples and learn the functions as a filter with a fairly simple CNN
Deep Dreams
Same idea than style transfer but for style running a CNN backward to see how to maximize activations on what has been learned before such as cats, dogs, building and anything in ImageNet
VAE
Describe latent space based on some input
Then give new input and generate output for space described
Tend to work best when axis are well-defined parameters, continuity of change along them
Basic encoder-decoder does not lead to interesting results, working on distributions instead
VAE ensure input sample of distribution with same mean and variance learned. The decoder then sample the distribution to get output
Out = mean + small_epsilon * exp(log var)
GAN
Similar base idea than VAE
Much more unpredictable output than VAE, no continuous structure of latent space!
Use of generators
Notoriously hard to train
Generator and discriminator battling, not optimal solution but an equilibrium
Lots of gotchas, see code in book e.g. learning rate decay, gradient clipping
Future of DL
- DNN can only do very local generalization for now
- DNN often started from scratch, will become part of program
- Merging of DNN and program synthesizer, with use of non-differentiable networks with loops
- Algo (formal, reasoning) vs. Geometric (informal, pattern matching) modules
- Extreme generalization: learn from very little or no data at all, like human
- How to progress: arxiv, kaggle, practice, practice, practice