Deep Learning Tutorial李宏毅 Hung-yi Lee Deep learning attracts lots of attention. • I believe you have seen lots of exciting results before. Deep learning trends at Google. Source: SIGMOD/Jeff Dean This talk focuses on the basic techniques. Outline Lecture I: Introduction of Deep Learning Lecture II: Tips for Training Deep Neural Network Lecture III: Variants of Neural Network Lecture IV: Next Wave Lecture I: Introduction of Deep Learning Outline of Lecture I Introduction of Deep Learning Let’s start with general machine learning. Why Deep? “Hello World” for Deep Learning Machine Learning ≈ Looking for a Function • Speech Recognition f “How are you” • Image Recognition f “Cat” • Playing Go f “5-5” (next move) • Dialogue System f “Hi” “Hello” (what the user said) (system response) Image Recognition: Framework f “cat” A set of Model function f1 . f 2 f1 “cat” f2 “money” f1 “dog” f2 “snake” . Image Recognition: Framework f “cat” A set of Model function f1 . f 2 Better! Goodness of function f Supervised Learning Training function input: Data function output: “monkey” “cat” “dog” . f 2 “cat” Step 1 Goodness of Pick the “Best” Function Using f function f f* Step 2 Step 3 Training Data “monkey” “cat” “dog” . Image Recognition: Framework f “cat” Training Testing A set of Model function f1 . Three Steps for Deep Learning Step 1: Step 2: Step 3: pick define a set goodness of the best of function function function Deep Learning is so simple …… . Three Steps for Deep Learning Step 1: Step 2: Step 3: pick define a set Neural goodness of the best ofNetwork function function function Deep Learning is so simple …… . Human Brains . Neural Network Neuron z a1w1 ak wk aK wK b a1 w1 A simple function … … wk z z ak a … Activation … wK function aK weights b bias . Neural Network Neuron Sigmoid Function z z 1 z 1 e z 2 1 z 4 -1 -2 0.98 Activation -1 function 1 weights 1 bias . Weights and biases are network parameters ? . Neural Network Different connections leads to different network structure z z z z Each neurons can have different values of weights and biases. 98 1 -2 1 -1 -2 0.12 -1 1 0 Sigmoid Function z z 1 z 1 e z .Fully Connect Feedforward Network 1 4 0. Fully Connect Feedforward Network 1 4 0.86 3 0.11 -1 0.12 -2 0.83 -1 1 -1 4 0 0 2 .98 2 0.62 1 -2 -1 -1 1 0 -2 -1 -2 0. 12 -1 0. Fully Connect Feedforward Network 1 0.73 2 0.5 -2 0.51 ? = ? = Input vector.85 0 1 -1 4 0 0 2 This is a function. define a function set . output vector −1 0. 1 0. define a function Given network structure.85 Given parameters ?.72 3 0.83 0 0.51 0 -2 -1 -1 1 0 -2 -1 0.62 0 0. Fully Connect Feedforward Network neuron Input Layer 1 Layer 2 Layer L Output x1 …… y1 x2 …… y2 …… …… …… …… …… xN …… yM Input Output Layer Hidden Layers Layer Deep means many hidden layers . Output Layer (Option) • Softmax layer as the output layer Ordinary Layer z1 y1 z1 In general. the output of z2 y2 z 2 network can be any value. May not be easy to interpret z3 y3 z 3 . 12 3 z2 e e z 2 2.7 y2 e z2 e zj j 1 0. Output Layer (Option) Probability: • Softmax layer as the output layer 1 > ?? > 0 ? ?? = 1 Softmax Layer 3 0.05 ≈0 z3 -3 3 e e z3 y3 e z3 e zj 3 j 1 e zj j 1 .88 3 e 20 z1 e e z1 y1 e z1 zj j 1 1 0. 7 is 2 The image is “2” …… …… …… x256 y10 0.2 is 0 16 x 16 = 256 Ink → 1 Each dimension represents No ink → 0 the confidence of a digit.Example Application Input Output y1 0. .1 is 1 x1 x2 y2 0. Example Application • Handwriting Digit Recognition x1 y1 is 1 x2 y2 is 2 Neural Machine “2” …… …… …… Network x256 y10 is 0 What is needed is a function …… Input: output: 256-dim vector 10-dim vector . .Example Application Input Layer 1 Layer 2 Layer L Output x1 …… y1 is 1 x2 …… A function set containing the y2 is 2 candidates for “2” …… …… …… …… …… …… Handwriting Digit Recognition xN …… y10 is 0 Input Output Layer Hidden Layers Layer You need to decide the network structure to let a good function in your function set. FAQ • Q: How many layers? How many neurons for each layer? Trial and Error + Intuition • Q: Can the structure be automatically determined? . Three Steps for Deep Learning Step 1: Step 2: Step 3: pick define a set Neural goodness of the best ofNetwork function function function Deep Learning is so simple …… . Training Data • Preparing training data: images and their labels “5” “0” “4” “1” “9” “2” “1” “3” The learning target is defined on the training data. . Learning Target x1 …… y1 is 1 Softmax x2 …… …… y2 is 2 …… …… x256 …… y10 is 0 16 x 16 = 256 Ink → 1 The learning target is …… No ink → 0 Input: y1 has the maximum value Input: y2 has the maximum value . A good function should make the loss Loss of all examples as small as possible. “1” x1 …… y1 As close as 1 x2 ……of possible Given a set y2 0 parameters …… …… …… …… …… Loss xN …… y10 ? 0 target Loss can be the distance between the network output and target . Total Loss: Total Loss ? ?= ?? For all training data … ?=1 x1 NN y1 ?1 ?1 As small as possible x2 NN y2 ?2 ?2 Find a function in function set that x3 NN y3 ?3 minimizes total loss L ?3 …… …… …… …… Find the network xR NN yR ?? parameters ?∗ that ?? minimize total loss L . Three Steps for Deep Learning Step 1: Step 2: Step 3: pick define a set Neural goodness of the best ofNetwork function function function Deep Learning is so simple …… . ?3 .g. ?2 . ⋯ weights …… …… Millions of parameters E. How to pick the best function Find network parameters ?∗ that minimize total loss L Layer l Layer l+1 Enumerate all possible values Network parameters ? = 106 ?1 . ?3 . ⋯ . ?1 . speech recognition: 8 layers and 1000 1000 1000 neurons each layer neurons neurons . ?2 . ?2 . ?2 . ?1 . ⋯ Find network parameters ?∗ that minimize total loss L Pick an initial value for w Total Random. ⋯ . RBM pre-train Loss ? Usually good enough w . Network parameters ? = Gradient Descent ?1 . net/album/photo/171572850 . ⋯ . ?1 . ⋯ Find network parameters ?∗ that minimize total loss L Pick an initial value for w Total Compute ?? ?? Loss ? Negative Increase w Positive Decrease w w http://chico386. Network parameters ? = Gradient Descent ?1 . ?2 . ?2 .pixnet. Network parameters ? = Gradient Descent ?1 . ?2 . ?2 . ⋯ . ⋯ Find network parameters ?∗ that minimize total loss L Pick an initial value for w Total Compute ?? ?? Loss ? ? ← ? − ??? ?? Repeat η is called −??? ?? “learning rate” w . ?1 . ⋯ Find network parameters ?∗ that minimize total loss L Pick an initial value for w Total Compute ?? ?? Loss ? ? ← ? − ??? ?? Repeat Until ?? ?? is approximately small (when update is little) w . ?2 . ?1 . ⋯ . Network parameters ? = Gradient Descent ?1 . ?2 . 1 −? ?? ??2 0. Gradient Descent ? Compute ?? ??1 ?? ?1 0.2 0.3 0.05 ?? = ??2 ⋮ …… ?? Compute ?? ??1 ??1 ?1 0.15 −? ?? ??1 ??1 Compute ?? ??2 ?? ?2 -0.2 −? ?? ??1 ⋮ …… gradient . 15 0.1 0.10 −? ?? ??1 −? ?? ??1 …… …… .2 0.3 0.2 0.09 −? ?? ??1 −? ?? ??1 …… Compute ?? ??2 Compute ?? ??2 ?2 -0.15 −? ?? ??2 −? ?? ??2 …… …… Compute ?? ??1 Compute ?? ??1 ?1 0. Gradient Descent ? Compute ?? ??1 Compute ?? ??1 ?1 0.05 0. Gradient Descent Color: Value of ?2 Total Loss L Randomly pick a starting point ?1 . we would reach a minima ….Gradient Descent Hopfully.. ?? ??2 ?1 . −? ?? ??2 ) Compute ?? ??1 . Color: Value of ?2 Total Loss L (−? ?? ??1 . . Gradient Descent . ? so different results There are some tips to help you avoid local ?1 ?2 minima. no guarantee.Difficulty • Gradient descent never guarantee global minima Different initial point Reach different minima. ?? ??2 ?2 ?1 .You are playing Age of Empires … Gradient DescentYou cannot see the whole map. −? ?? ??2 ) Compute ?? ??1 . (−? ?? ??1 . I hope you are not too disappointed :p . People image …… Actually …..Gradient Descent This is the “learning” of machines in deep learning …… Even alpha go using this approach. mp4/index.tw/~tlkagk/courses/MLDS_201 5_2/Lecture/DNN%20backprop.ecm.html 台大周伯威 同學開發 Don’t worry about ?? ??.Backpropagation • Backpropagation: an efficient way to compute ?? ?? • Ref: http://speech.ee.ntu.edu. . the toolkits will handle it. Concluding Remarks Step 1: Step 2: Step 3: pick define a set goodness of the best of function function function Deep Learning is so simple …… . Outline of Lecture I Introduction of Deep Learning Why Deep? “Hello World” for Deep Learning . 4 parameters. Frank. . Gang Li. 2011.1 Seide.Deeper is Better? Word Error Word Error Layer X Size Layer X Size Rate (%) Rate (%) 1 X 2k 24.8 performance 5 X 2k 17. better 4 X 2k 17. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks.2 2 X 2k 20. more 3 X 2k 18." Interspeech.4 Not surprised.6 1 X 16k 22.1 1 X 4634 22. and Dong Yu.2 1 X 3772 22.5 7 X 2k 17. Universality Theorem Any continuous function f f : R N RM Can be realized by a network with one hidden layer Reference for the reason: (given enough hidden http://neuralnetworksandde neurons) eplearning.html Why “Deep” neural network not “Fat” neural network? .com/chap4. s. Thin + Tall The same number of parameters Which one is better? …… x1 x2 …… xN x1 x2 …… xN Shallow Deep .Fat + Short v. Thin + Tall Word Error Word Error Layer X Size Layer X Size Rate (%) Rate (%) 1 X 2k 24.4 4 X 2k 17.8 5 X 2k 17. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks.Fat + Short v.2 1 X 3772 22. Frank.1 Seide. 2011.6 1 X 16k 22. Gang Li. and Dong Yu.2 2 X 2k 20." Interspeech.4 Why? 3 X 2k 18.5 7 X 2k 17.s.1 1 X 4634 22. . Analogy Logic circuits Neural network • Logic circuits consists of • Neural network consists of gates neurons • A two layers of logic gates • A hidden layer network can can represent any Boolean represent any continuous function. function. . • Using multiple layers of • Using multiple layers of logic gates to build some neurons to represent some functions are much simpler functions are much simpler less gates needed less less parameters data? This page is for EE background. Modularization • Deep → Modularization Classifier Girls with 長髮 長髮 1 long hair 女 女長髮長髮 女女 Classifier Boys with 長髮 2 weak long hair 男 examples Little Image Classifier Girls with 短髮短髮 3 short hair 女 女短髮短髮 女女 Classifier Boys with 短髮短髮 4 short hair 男 男短髮短髮 男男 . • Deep → Modularization 長髮 長髮 長髮長髮 男 短髮 女 短髮 女 長髮 Boy or Girl? 女 女短髮 女 v.s.s. Each basic classifier can have Modularization sufficient training examples. 女女 長髮 短髮短髮 Classifiers for the 男 男 男短髮短髮 attributes 男男 . 短髮短髮 短髮 女 男 男短髮 女女 短髮 Basic 男男 Image Classifier 長髮長髮 短髮 短髮 Long or 女 女長髮 長髮 女 女短髮 短髮 short? 女女 v. Modularization can be trained by little data • Deep → Modularization Classifier Girls with 1 long hair Boy or Girl? Classifier Boys with 2 fine long Little hair data Image Basic Classifier Classifier Girls with Long or 3 short hair short? Classifier Boys with Sharing by the 4 short hair following classifiers as module . …… …… …… …… xN …… The most basic Use 1st layer as module Use 2nd layer as classifiers to build classifiers module …… .Modularization • Deep → Modularization → Less training data? x1 …… x2 The modularization is …… automatically learned from data. M.. D. In Computer Vision–ECCV 2014 (pp. Visualizing and understanding convolutional networks. Reference: Zeiler. 818-833) • Deep → Modularization x1 …… x2 …… …… …… …… …… xN …… The most basic Use 1st layer as module Use 2nd layer as classifiers to build classifiers module …… . & Fergus. Modularization (2014). R. Outline of Lecture I Introduction of Deep Learning Why Deep? “Hello World” for Deep Learning . ntu.tw/~tlkagk/courses/MLDS_2015_2/L Keras ecture/Theano%20DNN.ecm.html http://speech.ee.mp4/index.ntu.edu.mp4/index.ecm.edu.ee.tw/~tlkagk/courses/MLDS_2015_2/Le cture/RNN%20training%20(v6). If you want to learn theano: http://speech.html Very flexible or Need some effort to learn Easy to learn and use Interface of TensorFlow or (still have some flexibility) Theano You can modify it if you can write keras TensorFlow or Theano . Keras • François Chollet is the author of Keras. • Keras means horn in Greek • Documentation: http://keras.io/ • Example: https://github.com/fchollet/keras/tree/master/exa mples . • He currently works for Google as a deep learning engineer and researcher. 感謝 沈昇勳 同學提供圖檔 使用 Keras 心得 . com/exdb/mnist/ “Hello world” for deep learning Keras provides data sets loading function: http://keras.io/datasets/ .lecun. Example Application • Handwriting Digit Recognition Machine “1” 28 x 28 MNIST Data: http://yann. Keras …… 28x28 …… 500 …… 500 Softmax y1 y2 …… y10 . Keras . 1: Configuration ? ← ? − ??? ?? 0.2: Find the optimal network parameters Training data Labels Next lecture (Images) (digits) . Keras Step 3.1 Step 3. 8/tutorials/mnist/beginners/index.2: Find the optimal network parameters numpy array numpy array 28 x 28 …… 10 …… =784 Number of training examples Number of training examples https://www. Keras Step 3.org/versions/r0.html .tensorflow. io/getting-started/faq/#how-can-i-save-a-keras-model How to use the neural network (testing): case 1: case 2: . Keras Save and load models http://keras. py • Way 2 (in your code) • import os • os.environ["THEANO_FLAGS"] = "device=gpu0" .Keras • Using GPU to speed training • Way 1 • THEANO_FLAGS=device=gpu0 python YourCode. Live Demo . Lecture II: Tips for Training DNN . Recipe of Deep Learning YES Step 1: define a NO Good Results on set of function Testing Data? Overfitting! Step 2: goodness of function YES NO Step 3: pick the Good Results on best function Training Data? Neural Network . Do not always blame Overfitting Not well trained Overfitting? Training Data Testing Data . e.Recipe of Deep Learning YES Good Results on Different approaches for Testing Data? different problems. dropout for good results YES on testing data Good Results on Training Data? Neural Network .g. Recipe of Deep Learning YES Choosing proper loss Good Results on Testing Data? Mini-batch YES New activation function Good Results on Adaptive Learning Rate Training Data? Momentum . Choosing Proper Loss “1” x1 …… y1 1 ?1 1 x2 Softmax …… y2 0 ?2 0 …… …… …… …… …… …… loss x256 …… y10 0 ?10 0 Which one is better? 10 10 target Square 2 Cross Error ?? − ?? Entropy − ?? ???? ?=1 =0 ?=1 =0 . Let’s try it Square Error Cross Entropy Testing: Accuracy Let’s try it Square Error 0.11 Cross Entropy 0.84 Training Cross Entropy Square Error Choosing Proper Loss When using softmax output layer, choose cross entropy Cross Entropy Total Loss Square Error http://jmlr.org/procee dings/papers/v9/gloro w1 w2 t10a/glorot10a.pdf Recipe of Deep Learning YES Choosing proper loss Good Results on Testing Data? Mini-batch YES New activation function Good Results on Adaptive Learning Rate Training Data? Momentum We do not really minimize total loss! Mini-batch Randomly initialize network parameters x1 NN y1 ? 1 Pick the 1st batch Mini-batch ?1 ?′ = ?1 + ?31 + ⋯ x31 NN y31 ? 31 Update parameters once ?31 Pick the 2nd batch …… ?′′ = ?2 + ?16 + ⋯ Update parameters once x2 NN y2 ?2 Mini-batch … ?2 Until all mini-batches have been picked x16 NN y16 ?16 ?16 one epoch …… Repeat the above process Mini-batch Pick the 1st batch x1 NN y1 ?1 ?′ = ?1 + ?31 + ⋯ Mini-batch ?1 Update parameters once x31 NN y31 ? 31 ?31 Pick the 2nd batch …… ?′′ = ?2 + ?16 + ⋯ Update parameters once 100 examples in a mini-batch … Until all mini-batches Repeat 20 times have been picked one epoch We do not really minimize total loss! Mini-batch Randomly initialize network parameters x1 NN y1 ? 1 Pick the 1st batch Mini-batch ?1 ?′ = ?1 + ?31 + ⋯ x31 NN y31 ? 31 Update parameters once ?31 Pick the 2nd batch …… ?′′ = ?2 + ?16 + ⋯ Update parameters once x2 NN y2 ?2 Mini-batch … ?2 L is different each time x16 NN y16 ?16 when we update ?16 parameters! …… . .Mini-batch Original Gradient Descent With Mini-batch Unstable!!! The colors represent the total loss. update examples 20 times in one epoch. Not always true with Mini-batch is Faster parallel computing. See all See only one examples batch Can have the same speed (not super large data set) 1 epoch Mini-batch has better performance! . Original Gradient Descent With Mini-batch Update after seeing all If there are 20 batches. 84 No batch 0.12 Training Mini-batch Accuracy No batch Epoch . Testing: Accuracy Mini-batch is Better! Mini-batch 0. Shuffle the training examples for each epoch Epoch 1 Epoch 2 x1 NN y1 ?1 x1 NN y1 ?1 Mini-batch Mini-batch ?1 ?1 x31 NN y31 ? 31 x31 NN y31 ? 31 ?31 ?17 …… …… Don’t worry. This is the default of Keras. x2 NN y2 ?2 x2 NN y2 ?2 Mini-batch Mini-batch ?2 ?2 x16 NN y16 ?16 x16 NN y16 ?16 ?16 ?26 …… …… . Recipe of Deep Learning YES Choosing proper loss Good Results on Testing Data? Mini-batch YES New activation function Good Results on Adaptive Learning Rate Training Data? Momentum . .Hard to get the power of Deep … Results on Training Data Deeper usually does not imply better. 84 9 layers 0. Testing: Accuracy Let’s try it 3 layers 0.11 Training 3 layers 9 layers . Vanishing Gradient Problem x1 …… y1 x2 …… y2 …… …… …… …… …… xN …… yM Smaller gradients Larger gradients Learn very slow Learn very fast Almost random Already converge based on random!? . Vanishing Gradient Problem Smaller gradients x1 …… ?1 ?1 x2 Small …… output ?2 ?2 …… …… …… …… …… …… ? +∆? xN …… ?? ?? Large +∆? input Intuitive way to compute the derivatives … ?? ∆? =? ?? ∆? . Hard to get the power of Deep … In 2006. In 2015. people use ReLU. . people used RBM pre-training. arXiv’15] . Fast to compute ?=? 2. Maas. Biological reason ?=0 3.ReLU • Rectified Linear Unit (ReLU) Reason: ? ? ? 1. Infinite sigmoid ? with different biases 4. Vanishing gradient [Xavier Glorot. AISTATS’11] [Andrew L. ICML’13] problem [Kaiming He. ? ?=? ReLU ?=0 ? 0 x1 y1 0 y2 x2 0 0 . ? ?=? ReLU A Thinner linear network ?=0 ? x1 y1 y2 x2 Do not have smaller gradients . Let’s try it . 96 • 9 layers Training ReLU Sigmoid . Testing: 9 layers Accuracy Let’s try it Sigmoid 0.11 ReLU 0. ReLU .variant ????? ???? ?????????? ???? ? ? ?=? ?=? ? ? ? = 0.01? ? = ?? α also learned by gradient descent . Maxout ReLU is a special cases of Maxout • Learnable activation function [Ian J. . Goodfellow. ICML’13] + 5 neuron + 1 Input Max 7 Max 2 x1 + 7 + 2 x2 + −1 + 4 Max 1 Max 4 + 1 + 3 You can have more than 2 elements in a group. Maxout ReLU is a special cases of Maxout • Learnable activation function [Ian J. Goodfellow. ICML’13] • Activation function in maxout network can be any piecewise linear convex function • How many pieces depending on how many elements in a group 2 elements in a group 3 elements in a group . Recipe of Deep Learning YES Choosing proper loss Good Results on Testing Data? Mini-batch YES New activation function Good Results on Adaptive Learning Rate Training Data? Momentum . Learning Rates Set the learning rate η carefully If learning rate is too large Total loss may not decrease after each update ?2 ?1 . Learning Rates Set the learning rate η carefully If learning rate is too large Total loss may not decrease after each update ?2 If learning rate is too small Training would be too slow ?1 . so we use larger learning rate • After several epochs. we are close to the destination. we are far from the destination. so we reduce the learning rate • E. • At the beginning. 1/t decay: ? ? = ? ? + 1 • Learning rate cannot be one-size-fits-all • Giving different parameters different learning rates .g.Learning Rates • Popular & Simple Idea: Reduce the learning rate by some factor every few epochs. Adagrad Original: ? ← ? − ??? ∕ ?? Adagrad: w ← ? − ߟ? ?? ∕ ?? Parameter dependent learning rate ? constant ߟ? = ? ?=0 ?? 2 ?? is ?? ∕ ?? obtained at the i-th update Summation of the square of the previous derivatives . larger Why? learning rate. Learning rate is smaller and smaller for all parameters 2.12 0.1 0.12 + 0. and vice versa .22 0.22 202 + 102 22 Observation: 1.1 20 2 20 ? ? ? ? = = 0.0 …… Learning rate: Learning rate: ? ? ? ? = = 0. Smaller derivatives.2 …… 20.0 10. ? ߟ? = Adagrad ? ?=0 ?? 2 g0 g1 …… g0 g1 …… ?1 ?2 0. Larger derivatives Smaller Learning Rate Smaller Derivatives Larger Learning Rate 2. larger Why? learning rate. Smaller derivatives. and vice versa . JMLR’11] • RMSprop • https://www. arXiv’12] • “No more pesky learning rates” [Tom Schaul. Kingma.edu/proj2015/054_report. arXiv’14] • Adam [Diederik P. ICLR’15] • Nadam • http://cs229. arXiv’12] • AdaSecant [Caglar Gulcehre. Zeiler.com/watch?v=O3sxAc4hxZU • Adadelta [Matthew D.Not the whole story …… • Adagrad [John Duchi.youtube.stanford.pdf . Recipe of Deep Learning YES Choosing proper loss Good Results on Testing Data? Mini-batch YES New activation function Good Results on Adaptive Learning Rate Training Data? Momentum . Hard to find optimal network parameters Total Loss Very slow at the plateau Stuck at saddle point Stuck at local minima ?? ∕ ?? ?? ∕ ?? ?? ∕ ?? ≈0 =0 =0 The value of a network parameter w . In physical world …… • Momentum How about put this phenomenon in gradient descent? . Still not guarantee reaching Momentum global minima. but give some hope …… cost Movement = Negative of ??∕?? + Momentum Negative of ?? ∕ ?? Momentum Real Movement ??∕?? = 0 . Adam RMSProp (Advanced Adagrad) + Momentum . 97 • ReLU. 3 layer Training Original Adam .96 Adam 0.Let’s try it Testing: Accuracy Original 0. Recipe of Deep Learning YES Early Stopping Good Results on Testing Data? Regularization YES Dropout Good Results on Training Data? Network Structure . The parameters achieving the learning target do not necessary have good results on the testing data. . Training Data: Testing Data: Learning target is defined by the training data.Why Overfitting? • Training data and testing data can be different. Panacea for Overfitting • Have more training data • Create more training data (?) Handwriting recognition: Original Created Training Data: Training Data: Shift 15。 . Why Overfitting? • For experiments. we added some noises to the testing data . 97 Noisy 0.Why Overfitting? • For experiments. . we added some noises to the testing data Testing: Accuracy Clean 0.50 Training is not influenced. Recipe of Deep Learning YES Early Stopping Good Results on Testing Data? Weight Decay YES Dropout Good Results on Training Data? Network Structure . Early Stopping Total Loss Stop at Validation set here Testing set Training set Epochs http://keras.io/getting-started/faq/#how-can-i-interrupt-training-when- Keras: the-validation-loss-isnt-decreasing-anymore . Recipe of Deep Learning YES Early Stopping Good Results on Testing Data? Weight Decay YES Dropout Good Results on Training Data? Network Structure . Weight Decay • Our brain prunes out the useless link between neurons. Doing the same thing to machine’s brain improves the performance. . Weight Decay Weight decay is one Useless kind of regularization Close to zero (萎縮了) . 99 w w Smaller and smaller Keras: http://keras.01 L Weight Decay: w 10.Weight Decay L • Implementation Original: w w w 0.io/regularizers/ . Recipe of Deep Learning YES Early Stopping Good Results on Testing Data? Weight Decay YES Dropout Good Results on Training Data? Network Structure . Dropout Training: Each time before updating the parameters Each neuron has p% to dropout . Using the new network for training For each mini-batch. Dropout Training: Thinner! Each time before updating the parameters Each neuron has p% to dropout The structure of the network is changed. we resample the dropout neurons . If a weight w = 1 by training. all the weights times (1-p)% Assume that the dropout rate is 50%. set ? = 0.5 for testing. . Dropout Testing: No dropout If the dropout rate at training is p%. if you know your partner will dropout. nothing will be done finally. so obtaining good results eventually. you will do better. . When testing.Dropout . no one dropout actually. if everyone expect the partner will do the work. However.Intuitive Reason 我的 partner 會擺爛,所以 我要好好做 When teams up. 5 × ?2 ? ′ ?3 0.5 × ?1 ? ′ ≈ 2? ?2 ? 0.5 × ?3 ?4 0.Dropout .5 × ?4 Weights multiply (1-p)% ?′ ≈ ? .Intuitive Reason • Why the weights should multiply (1-p)% (dropout rate) when testing? Training of Dropout Testing of Dropout Assume dropout rate is 50% No dropout Weights from training ?1 0. Dropout is a kind of ensemble. Training Ensemble Set Set 1 Set 2 Set 3 Set 4 Network Network Network Network 1 2 3 4 Train a bunch of networks with different structures . Ensemble Testing data x Network Network Network Network 1 2 3 4 y1 y2 y3 y4 average .Dropout is a kind of ensemble. minibatch minibatch minibatch minibatch Training of 1 2 3 4 Dropout M neurons …… 2M possible networks Using one mini-batch to train one network Some parameters in the network are shared . Dropout is a kind of ensemble. Testing of Dropout testing data x All the weights …… multiply (1-p)% y1 y2 y3 ????? average ≈ y . Dropout is a kind of ensemble. Goodfellow. arXiv’12] • Dropout works better with Maxout [Ian J. Rennie. ICML’13] • Dropout delete neurons • Dropconnect deletes the connection between neurons • Annealed dropout [S. Hinton. SLT’14] • Dropout rate decreases by epochs • Standout [J. NISP’13] • Each neural has different dropout rate . Ba. JMLR’14] [Pierre Baldi. ICML’13] • Dropconnect [Li Wan. NIPS’13][Geoffrey E.J.More about dropout • More reference for dropout [Nitish Srivastava. add( dropout(0.8) ) …… 500 model.Let’s try it …… …… 500 model.add( dropout(0.8) ) Softmax y1 y2 …… y10 . 50 Epoch + dropout 0.63 .Let’s try it No Dropout Accuracy Dropout Testing: Training Accuracy Noisy 0. Recipe of Deep Learning YES Early Stopping Good Results on Testing Data? Regularization YES Dropout Good Results on Training Data? Network Structure CNN is a very good example! (next lecture) . Concluding Remarks of Lecture II . Recipe of Deep Learning YES Step 1: define a NO Good Results on set of function Testing Data? Step 2: goodness of function YES NO Step 3: pick the Good Results on best function Training Data? Neural Network . Let’s try another task . com/ .Document Classification 政治 “stock” in document 經濟 Machine 體育 “president” in document 體育 政治 財經 http://top-breaking-news. Data . MSE . ReLU . 55 + ReLU 0.75 + Adam 0.77 . Accuracy Adaptive Learning Rate MSE 0.36 CE 0. 77 + dropout 0. Accuracy Dropout Adam 0.79 . Lecture III: Variants of Neural Networks . Variants of Neural Networks Convolutional Neural Network (CNN) Widely used in image processing Recurrent Neural Network (RNN) . the first layer of fully connected network would be very large …… Softmax …… 100 ……3 x 107 …… …… 100 100 x 100 x 3 1000 Can the fully connected network be simplified by considering the properties of image recognition? . Why CNN for Image? • When processing image. Why CNN for Image • Some patterns are much smaller than the whole image A neuron does not have to see the whole image to discover the pattern. Connecting to small region with less parameters “beak” detector . “upper-left beak” detector Do almost the same thing They can use the same set of parameters. “middle beak” detector .Why CNN for Image • The same patterns appear in different regions. Why CNN for Image • Subsampling the pixels will not change the object bird bird subsampling We can subsample the pixels to make image smaller Less parameters for the network to process the image Three Steps for Deep Learning Step 1: Step 2: Step 3: pick define a set Convolutional goodness of the best of function Neural Network function function Deep Learning is so simple …… The whole CNN cat dog …… Convolution Max Pooling Can repeat Fully Connected many times Feedforward network Convolution Max Pooling Flatten The whole CNN Property 1 Some patterns are much Convolution smaller than the whole image Property 2 Max Pooling The same patterns appear in Can repeat different regions. many times Property 3 Convolution Subsampling the pixels will not change the object Max Pooling Flatten The whole CNN cat dog …… Convolution Max Pooling Can repeat Fully Connected many times Feedforward network Convolution Max Pooling Flatten CNN – Convolution Those are the network parameters to be learned. 1 -1 -1 1 0 0 0 0 1 -1 1 -1 Filter 1 0 1 0 0 1 0 -1 -1 1 Matrix 0 0 1 1 0 0 1 0 0 0 1 0 -1 1 -1 -1 1 -1 Filter 2 0 1 0 0 1 0 Matrix 0 0 1 0 1 0 -1 1 -1 …… 6 x 6 image Each filter detects a small Property 1 pattern (3 x 3). 1 -1 -1 CNN – Convolution -1 1 -1 Filter 1 -1 -1 1 stride=1 1 0 0 0 0 1 0 1 0 0 1 0 3 -1 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 6 x 6 image . 1 -1 -1 CNN – Convolution -1 1 -1 Filter 1 -1 -1 1 If stride=2 1 0 0 0 0 1 0 1 0 0 1 0 3 -3 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 We set stride=1 below 0 0 1 0 1 0 6 x 6 image . 1 -1 -1 CNN – Convolution -1 1 -1 Filter 1 -1 -1 1 stride=1 1 0 0 0 0 1 0 1 0 0 1 0 3 -1 -3 -1 0 0 1 1 0 0 1 0 0 0 1 0 -3 1 0 -3 0 1 0 0 1 0 0 0 1 0 1 0 -3 -3 0 1 6 x 6 image 3 -2 -2 -1 Property 2 . -1 1 -1 CNN – Convolution -1 1 -1 Filter 2 -1 1 -1 stride=1 Do the same process for 1 0 0 0 0 1 every filter 0 1 0 0 1 0 3 -1 -3 -1 -1 -1 -1 -1 0 0 1 1 0 0 1 0 0 0 1 0 -3 1 0 -3 -1 -1 -2 1 0 1 0 0 1 0 Feature 0 0 1 0 1 0 -3 -3 Map0 1 -1 -1 -2 1 6 x 6 image 3 -2 -2 -1 -1 0 -4 3 4 x 4 image . 1 -1 -1 CNN – Zero Padding -1 1 -1 Filter 1 -1 -1 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 1 0 0 You will get another 6 x 6 1 0 0 0 1 0 images in this way 0 1 0 0 1 0 0 0 0 1 0 1 0 0 Zero padding 0 0 0 6 x 6 image . CNN – Colorful image 1 -1 -1 -1-1 11 -1-1 11 -1-1 -1-1 -1 1 -1 -1-1 11 -1-1 -1-1-1 111 -1-1-1 Filter 2 -1 1 -1 Filter 1 -1 1 -1 -1-1 -1-1 11 -1-1 11 -1-1 -1 -1 1 Colorful image 1 0 0 0 0 1 1 0 0 0 0 1 0 11 00 00 01 00 1 0 1 0 0 1 0 0 00 11 01 00 10 0 0 0 1 1 0 0 1 00 00 10 11 00 0 1 0 0 0 1 0 0 11 00 00 01 10 0 0 1 0 0 1 0 0 00 11 00 01 10 0 0 0 1 0 1 0 0 0 1 0 1 0 . The whole CNN cat dog …… Convolution Max Pooling Can repeat Fully Connected many times Feedforward network Convolution Max Pooling Flatten . CNN – Max Pooling 1 -1 -1 -1 1 -1 -1 1 -1 Filter 1 -1 1 -1 Filter 2 -1 -1 1 -1 1 -1 3 -1 -3 -1 -1 -1 -1 -1 -3 1 0 -3 -1 -1 -2 1 -3 -3 0 1 -1 -1 -2 1 3 -2 -2 -1 -1 0 -4 3 . CNN – Max Pooling New image 1 0 0 0 0 1 but smaller 0 1 0 0 1 0 Conv 3 0 0 0 1 1 0 0 -1 1 1 0 0 0 1 0 0 1 0 0 1 0 Max 3 1 0 3 0 0 1 0 1 0 Pooling 2 x 2 image 6 x 6 image Each filter is a channel . The whole CNN 3 0 -1 1 Convolution 3 1 0 3 Max Pooling Can repeat A new image many times Smaller than the original Convolution image The number of the channel Max Pooling is the number of filters . The whole CNN cat dog …… Convolution Max Pooling A new image Fully Connected Feedforward network Convolution Max Pooling A new image Flatten . 3 Flatten 0 1 3 0 -1 1 3 3 1 -1 0 3 Flatten 1 Fully Connected Feedforward network 0 3 . The whole CNN Convolution Max Pooling Can repeat many times Convolution Max Pooling . Input + 5 Max 7 x1 + 7 x2 + −1 Max 1 + 1 1 0 0 0 0 1 1 -1 -1 -1 1 -1 0 1 0 0 1 0 -1 1 -1 -1 1 -1 0 0 1 1 0 0 -1 -1 1 -1 1 -1 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 convolution Max image pooling (Ignoring the non-linear activation function after the convolution.) . 1 -1 -1 Filter 1 1: 1 -1 1 -1 2: 0 -1 -1 1 3: 0 4: 0 3 … 1 0 0 0 0 1 0 1 0 0 1 0 7: 0 0 0 1 1 0 0 8: 1 1 0 0 0 1 0 9: 0 0 1 0 0 1 0 10: 0 … 0 0 1 0 1 0 13: 0 6 x 6 image 14: 0 Less parameters! 15: 1 Only connect to 9 16: 1 input. not fully connected … . 1 -1 -1 1: 1 -1 1 -1 Filter 1 2: 0 -1 -1 1 3: 0 4: 0 3 … 1 0 0 0 0 1 0 1 0 0 1 0 7: 0 0 0 1 1 0 0 8: 1 1 0 0 0 1 0 9: 0 -1 0 1 0 0 1 0 10: 0 … 0 0 1 0 1 0 13: 0 6 x 6 image 14: 0 Less parameters! 15: 1 16: 1 Shared weights Even less parameters! … . ) . Input + 5 Max 7 x1 + 7 x2 + −1 Max 1 + 1 1 0 0 0 0 1 1 -1 -1 -1 1 -1 0 1 0 0 1 0 -1 1 -1 -1 1 -1 0 0 1 1 0 0 -1 -1 1 -1 1 -1 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 convolution Max image pooling (Ignoring the non-linear activation function after the convolution. Input + 5 Max 7 x1 + 7 x1 + −1 Max 1 + 1 3 -1 -3 -1 3 0 -3 1 0 -3 -3 -3 0 1 3 1 3 -2 -2 -1 . Input + 5 Max 7 x1 + 7 x2 + −1 Dim = 6 x 6 = 36 Max 1 parameters = + 1 36 x 32 = 1152 Dim = 4 x 4 x 2 = 32 1 0 0 0 0 1 1 -1 -1 -1 1 -1 0 1 0 0 1 0 -1 1 -1 -1 1 -1 0 0 1 1 0 0 -1 -1 1 -1 1 -1 1 0 0 0 1 0 0 1 0 0 1 0 convolution 0 0 1 0 1 0 Max Only 9 x 2 = 18 pooling image parameters . Convolutional Neural Network Step 1: Step 2: Step 3: pick define a set Convolutional goodness of the best of function Neural Network function function “monkey” 0 “cat” 1 CNN …… “dog” 0 Convolution. fully connected Learning: Nothing special. just gradient descent …… . Max target Pooling. .Playing Go Next move Network (19 x 19 positions) 19 x 19 matrix 19 x 19 vector 19(image) x 19 vector Black: 1 Fully-connected feedword white: -1 network can be used none: 0 But CNN performs much better. s. 社清春 Playing Go 黑: 5之五 白: 天元 黑: 五之5 Training: record of previous plays Target: Network “天元” = 1 else = 0 Target: Network “五之 5” = 1 else = 0 . 進藤光 v. Why CNN for playing Go? • Some patterns are much smaller than the whole image Alpha Go uses 5 x 5 for first layer • The same patterns appear in different regions. . Why CNN for playing Go? • Subsampling the pixels will not change the object Max Pooling How to explain this??? Alpha Go does not use Max Pooling …… . Variants of Neural Networks Convolutional Neural Network (CNN) Recurrent Neural Network (RNN) Neural Network with Memory . ticket booking system Destination: Taipei Slot time of arrival: November 2nd .Example Application • Slot Filling I would like to arrive Taipei on November 2nd. Example Application y1 y2 Solving slot filling by Feedforward network? Input: a word (Each word is represented as a vector) Taipei x1 x2 . bag. cat. apple = [ 1 0 0 0 0] Each dimension corresponds bag = [ 0 1 0 0 0] to a word in the lexicon cat = [ 0 0 1 0 0] The dimension for the word dog = [ 0 0 0 1 0] is 1. and others are 0 elephant = [ 0 0 0 0 1] . 1-of-N encoding How to represent each word as a vector? 1-of-N Encoding lexicon = {apple. elephant} The vector is lexicon size. dog. Beyond 1-of-N encoding Dimension for “Other” Word hashing apple 0 a-a-a 0 bag 0 a-a-b 0 … … cat 0 a-p-p 1 dog 0 … … 26 X 26 X 26 elephant 0 p-l-e 1 … … … p-p-l 1 “other” 1 … … w = “apple” w = “Gandalf” w = “Sauron” 187 . Example Application time of dest departure y1 y2 Solving slot filling by Feedforward network? Input: a word (Each word is represented as a vector) Output: Probability distribution that the input word belonging to the slots Taipei x1 x2 . Example Application time of dest departure y1 y2 arrive Taipei on November 2nd other dest other time time Problem? leave Taipei on November 2nd place of departure Neural network Taipei x1 x2 needs memory! . Three Steps for Deep Learning Step 1: Step 2: Step 3: pick define a set Recurrent goodness of the best of function Neural Network function function Deep Learning is so simple …… . .Recurrent Neural Network (RNN) y1 y2 The output of hidden layer are stored in the memory. store a1 a2 Memory can be considered x1 x2 as another input. Probability of Probability of Probability of “arrive” in each slot “Taipei” in each slot “on” in each slot y1 y2 y3 store store a1 a2 a3 a1 a2 x1 x2 x3 arrive Taipei on November 2nd .RNN The same network is used again and again. . RNN Different Prob of “leave” Prob of “Taipei” Prob of “arrive” Prob of “Taipei” in each slot in each slot in each slot in each slot y1 y2 …… y1 y2 …… store store a1 a2 a1 a2 a1 …… a1 …… x1 x2 …… x1 x2 …… leave Taipei arrive Taipei The values stored in the memory is different. Of course it can be deep … yt yt+1 yt+2 …… …… …… …… …… …… …… …… …… xt xt+1 xt+2 . Bidirectional RNN xt xt+1 xt+2 …… …… yt yt+1 yt+2 …… …… xt xt+1 xt+2 . Long Short-term Memory (LSTM) Other part of the network Special Neuron: Signal control Output Gate 4 inputs. the output gate 1 output (Other part of the network) Memory Forget Signal control Cell Gate the forget gate (Other part of the network) Signal control Input Gate LSTM the input gate (Other part of the network) Other part of the network . ? = ℎ ? ′ ? ?? ?? multiply Activation function f is ? ?? ℎ ?′ usually a sigmoid function Between 0 and 1 Mimic open and close gate ? ? ?? ?c′ ?? ?? ?? ? ? ? ?? ? ′ = ? ? ? ?? + ?? ?? ? ?? ?? multiply ? ? ? . 0 ≈0 -10 10 10 7 10 ≈1 ≈1 3 10 3 3 . -3 ≈1 10 -3 10 -3 7 -10 ≈0 ≈1 -3 10 -3 -3 . LSTM ct-1 …… vector zf zi z zo 4 vectors xt . LSTM yt zo ct-1 × + × × zf zi zf zi z zo xt z . Extension: “peephole” LSTM yt yt+1 ct-1 ct ct+1 × + × × + × × × zf zi z zo zf zi z zo ct-1 ht-1 xt ct ht xt+1 . https://img.gif .komicolle. Keras can handle it.Multiple-layer LSTM Don’t worry if you cannot understand this. “GRU”.org/2015-09-20/src/14426967627131. “SimpleRNN” layers This is quite standard now. Keras supports “LSTM”. Three Steps for Deep Learning Step 1: Step 2: Step 3: pick define a set goodness of the best of function function function Deep Learning is so simple …… . Learning Target other dest other 0 … 1 … 0 0 … 1 … 0 0 … 1 … 0 y1 y2 y3 copy copy a1 a2 a3 a1 a2 Wi x1 x2 x3 Training Sentences: arrive Taipei on November 2nd other dest other time time . Three Steps for Deep Learning Step 1: Step 2: Step 3: pick define a set goodness of the best of function function function Deep Learning is so simple …… . Learning y1 y2 Backpropagation through time (BPTT) copy a1 a2 ? ? ← ? − ??? ∕ ?? x1 x2 RNN Learning is very difficult in practice. . 感謝 曾柏翔 同學 提供實驗結果 Unfortunately …… • RNN-based network is not always easy to learn Real experiments on Language modeling sometimes Total Loss Lucky Epoch . Total Clipping CostLoss w2 w1 [Razvan Pascanu. The error surface is either very flat or very steep. ICML’13] .The error surface is rough. Why? ?=1 ?1000 = 1 Large Small ? = 1.01 ?1000 ≈ 20000 ?? ?? Learning rate? ? = 0.01 ?1000 ≈ 0 ?? ?? Learning rate? =w999 y1 y2 y3 y1000 Toy Example 1 1 1 1 …… w w w 1 1 1 1 1 0 0 0 .99 ?1000 ≈ 0 small Large ? = 0. Helpful Techniques • Long Short-term Memory (LSTM) • Can deal with gradient vanishing (not gradient explode) Memory and input are added The influence never disappears unless forget gate is closed No Gradient vanishing add (If forget gate is opened.) Gated Recurrent Unit (GRU): simpler than LSTM [Cho. EMNLP’14] . Helpful Techniques Structurally Constrained Clockwise RNN Recurrent Network (SCRN) [Jan Koutnik. Le. JMLR’14] [Tomas Mikolov. arXiv’15] Outperform or be comparable with LSTM in 4 different tasks . ICLR’15] Vanilla RNN Initialized with Identity matrix + ReLU activation function [Quoc V. More Applications …… Probability of Probability of Probability of “arrive” in each slot “Taipei” in each slot “on” in each slot y1 y2 y3 Input store and output are both sequences store a1 with the a2 length same a3 a 1 a2 RNN can do more than that! x1 x2 x3 arrive Taipei on November 2nd . ……. Keras Example: Many to one https://github.com/fchollet/keras/blob /master/examples/imdb_lstm. but output is only one vector Sentiment Analysis 超好雷 好雷 看了這部電影覺 這部電影太糟了 這部電影很 得很高興 ……. 棒 …….py • Input is a vector sequence. 普雷 負雷 Positive (正雷) Negative (負雷) Positive (正雷) 超負雷 …… 我 覺 得 …… 太 糟 了 . Speech Recognition Output: “好棒” (character sequence) Trimming Problem? Why can’t it be 好好好棒棒棒棒棒 “好棒棒” (vector Input: sequence) .Many to Many (Output is shorter) • Both input and output are both sequences. • E. but the output is shorter.g. Interspeech’15][Jie Li. Interspeech’15][Andrew Senior. • Connectionist Temporal Classification (CTC) [Alex Graves. ASRU’15] “好棒” Add an extra symbol “φ” “好棒棒” representing “null” 好 φ φ 棒 φ φ φ φ 好 φ φ 棒 φ 棒 φ φ . but the output is shorter.Many to Many (Output is shorter) • Both input and output are both sequences. ICML’14][Haşim Sak. ICML’06][Alex Graves. Machine Translation (machine learning→機器學習) machine learning Containing all information about input sequence .g. → Sequence to sequence learning • E.Many to Many (No Limitation) • Both input and output are both sequences with different lengths. Many to Many (No Limitation) • Both input and output are both sequences with different lengths. Machine Translation (machine learning→機器學習) 機 器 學 習 慣 性 …… …… machine learning Don’t know when to stop . → Sequence to sequence learning • E.g. Many to Many (No Limitation) 推 tlkagk: =========斷========== Ref:http://zh.pttpedia.com/wiki/%E6%8E%A5%E9%BE%8D% E6%8E%A8%E6%96%87 (鄉民百科) .wikia. NIPS’14][Dzmitry Bahdanau.g. Machine Translation (machine learning→機器學習) === 機 器 學 習 machine learning Add a symbol “===“ (斷) [Ilya Sutskever. arXiv’15] .Many to Many (No Limitation) • Both input and output are both sequences with different lengths. → Sequence to sequence learning • E. ICCV’15] A vector for whole === image a woman is CNN …… Input image Caption Generation .One to Many • Input an image. but output a sequence of words [Kelvin Xu. arXiv’15][Li Yao. walking in the forest. . Video A group of people is A group of people is knocked by a tree. Application: Video Caption Generation A girl is running. Video Caption Generation • Can machine describe what it see from video? • Demo: 曾柏翔、吳柏瑜、盧宏宗 . Concluding Remarks Convolutional Neural Network (CNN) Recurrent Neural Network (RNN) . Lecture IV: Next Wave . Outline Supervised Learning • Ultra Deep Network New network structure • Attention Model Reinforcement Learning Unsupervised Learning • Image: Realizing what the World Looks Like • Text: Understanding the Meaning of Words • Audio: Learning human language without supervision . wikipedia.org/wiki/%E9%9B%99%E5%B3%B0%E5%A1%94#/me dia/File:BurjDubaiHeight.svg .Skyscraper https://zh. pdf 8 layers 6.e du/slides/winter1516_le 19 layers cture8.7% 7. Ultra Deep Network 22 layers http://cs231n.stanford.3% 16.4% AlexNet (2012) VGG (2014) GoogleNet (2014) . 7% 16.57% 7.3% 6. Ultra Deep Network 101 layers 152 layers 3.4% AlexNet VGG GoogleNet Residual Net Taipei (2012) (2014) (2014) (2015) 101 . 7.3% 6.7% 16.57% have special structure.4% AlexNet VGG GoogleNet Residual Net (2012) (2014) (2014) (2015) . Ultra Deep Network Worry about overfitting? 152 layers Worry about training first! This ultra deep network 3. Ultra Deep Network • Ultra deep network is the ensemble of many networks with different depth. 6 layers Ensemble 4 layers 2 layers . Ultra Deep Network • FractalNet Resnet in Resnet Good Initialization? . Ultra Deep Network • • + copy Gate controller copy . output layer output layer output layer Highway Network automatically determines the layers needed! Input layer Input layer Input layer . Outline Supervised Learning • Ultra Deep Network New network structure • Attention Model Reinforcement Learning Unsupervised Learning • Image: Realizing what the World Looks Like • Text: Understanding the Meaning of Words • Audio: Learning human language without supervision . Attention-based Model What you learned Lunch today in these lectures What is deep learning? summer vacation 10 Answer Organize years ago http://henrylo1605.html .tw/2015/05/blog-post_56.blogspot. tw/~tlkagk/courses/MLDS_2015_2/Lecture/Attain%20(v3).html .ntu.edu.mp4/index.ee.e cm.Attention-based Model Input DNN/RNN output Reading Head Controller Reading Head …… …… Machine’s Memory Ref: http://speech. Attention-based Model v2 Input DNN/RNN output Reading Head Writing Head Controller Controller Writing Head Reading Head …… …… Machine’s Memory Neural Turing Machine . .Reading Comprehension Query DNN/RNN answer Reading Head Controller Semantic Analysis …… …… Each sentence becomes a vector. 2015. R. Fergus. Weston.com/fchollet/keras/blob/master/examples/ba bi_memnn. Sukhbaatar.Reading Comprehension • End-To-End Memory Networks. S.py . A. Szlam. The position of reading head: Keras has example: https://github. NIPS. J. Visual Question Answering source: http://visualqa.org/ . Visual Question Answering Query DNN/RNN answer Reading Head Controller CNN A vector for each region Visual Question Answering • Huijuan Xu, Kate Saenko. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering. arXiv Pre-Print, 2015 Speech Question Answering • TOEFL Listening Comprehension Test by Machine • Example: Audio Story: (The original story is 5 min long.) Question: “ What is a possible origin of Venus’ clouds? ” Choices: (A) gases released as a result of volcanic activity (B) chemical reactions caused by high surface temperatures (C) bursts of radio energy from the plane's surface (D) strong winds that blow dust into the atmosphere Experimental setup: Simple Baselines 717 for training, 124 for validation, 122 for testing (2) select the shortest (4) the choice with semantic choice as answer most similar to others Accuracy (%) random (1) (2) (3) (4) (5) (6) (7) Naive Approaches Everything is learned Model Architecture from training examples …… It be quite possible that this be Answer due to volcanic eruption because volcanic eruption often emit gas. If Attention that be the case volcanism could very Select the choice most well be the root cause of Venus 's thick similar to the answer cloud cover. And also we have observe burst of radio energy from the planet Attention Question 's surface. These burst be similar to what we see when volcano erupt on Semantics earth …… Semantic Speech Semantic Analysis Recognition Analysis Question: “what is a possible Audio Story: origin of Venus‘ clouds?" Model Architecture Word-based Attention Model Architecture Sentence-based Attention . (A) (A) (A) (A) (A) (B) (B) (B) . Supervised Learning Memory Network: 39.2% Accuracy (%) (proposed by FB AI group) (1) (2) (3) (4) (5) (6) (7) Naive Approaches . Interspeech 16] Supervised Learning [Fang & Hsu & Lee. SLT 16] Word-based Attention: 48.2% Accuracy (%) (proposed by FB AI group) (1) (2) (3) (4) (5) (6) (7) Naive Approaches .8% Memory Network: 39. [Tseng & Lee. Outline Supervised Learning • Ultra Deep Network New network structure • Attention Model Reinforcement Learning Unsupervised Learning • Image: Realizing what the World Looks Like • Text: Understanding the Meaning of Words • Audio: Learning human language without supervision . Scenario of Reinforcement Learning Observation Action Agent Don’t do Reward that Environment . Scenario of Reinforcement Learning Agent learns to take actions to maximize expected reward.htm . Reward http://www. Observation Action Agent Thank you.com/news/conte Environment nt/2013-11/26/content_8800180.sznews. …… Hello …… Bad Learning from critics Agent Agent . Supervised v.s. ……. Reinforcement • Supervised “Hello” Say “Hi” Learning from teacher “Bye bye” Say “Good bye” • Reinforcement ……. Scenario of Reinforcement Learning Agent learns to take actions to maximize expected reward. Observation Action Reward Next Move If win, reward = 1 If loss, reward = -1 Otherwise, reward = 0 Environment Supervised v.s. Reinforcement • Supervised: Next move: Next move: “5-5” “3-3” • Reinforcement Learning First move …… many moves …… Win! Alpha Go is supervised learning + reinforcement learning. Difficulties of Reinforcement Learning • It may be better to sacrifice immediate reward to gain more long-term reward • E.g. Playing Go • Agent’s actions affect the subsequent data it receives • E.g. Exploration Deep Reinforcement Learning DNN Observation Action …… … Function Function Input Output Used to pick the best function Reward Environment Application: Interactive Retrieval • Interactive retrieval is helpful. [Wu & Lee, INTERSPEECH 16] “Deep Learning” user “Deep Learning” related to Machine Learning? “Deep Learning” related to Education? Deep Reinforcement Learning • Different network depth Some depth is needed. Better retrieval The task cannot be addressed performance, Less user labor by linear model. More Interaction bloomberg.com/watch?v=0JL04JJjocc • Driving • https://www.youtube.More applications • Alpha Go.com/watch?v=0xo1Ldx3L 5Q • Google Cuts Its Giant Electricity Bill With DeepMind-Powered AI • http://www. Playing Video Games.youtube. Dialogue • Flying Helicopter • https://www.com/news/articles/2016-07- 19/google-cuts-its-giant-electricity-bill-with-deepmind- powered-ai . Silver/web/Te aching.net/rldm2015_silver_reinfo rcement_learning/ .ucl.html • 10 lectures (1:30 each) • Deep Reinforcement Learning • http://videolectures.To learn deep reinforcement learning …… • Lectures of David Silver • http://www0.uk/staff/D.ac.cs. Outline Supervised Learning • Ultra Deep Network New network structure • Attention Model Reinforcement Learning Unsupervised Learning • Image: Realizing what the World Looks Like • Text: Understanding the Meaning of Words • Audio: Learning human language without supervision . com/blog/generative-models/ Draw something! .Does machine know what the world look like? Ref: https://openai. Deep Dream • Given a photo.com/ . machine adds what it sees …… http://deepdreamgenerator. com/ . machine adds what it sees …… http://deepdreamgenerator.Deep Dream • Given a photo. make its style like famous paintings https://dreamscopeapp.Deep Style • Given a photo.com/ . com/ .Deep Style • Given a photo. make its style like famous paintings https://dreamscopeapp. Deep Style CNN CNN content style CNN ? . Generating Images by RNN color of color of color of 2nd pixel 3rd pixel 4th pixel color of color of color of 1st pixel 2nd pixel 3rd pixel . 06759 Real World .Generating Images by RNN • Pixel Recurrent Neural Networks • https://arxiv.org/abs/1601. Generating Images • Training a decoder to generate images is unsupervised ? code Training data is a lot of images Neural Network . Auto-encoder code NN Decoder Not state-of- the-art Learn together approach NN code Encoder As close as possible Output Layer Input Layer Layer bottle Layer Layer Layer … … Encoder Decoder Code . https://arxiv. http://arxiv.2661 code NN Decoder .Generating Images • Training a decoder to generate images is unsupervised • Variation Auto-encoder (VAE) • Ref: Auto-Encoding Variational Bayes.6114 • Generative Adversarial Network (GAN) • Ref: Generative Adversarial Networks.org/abs/1406.org/abs/1312. Which one is machine-generated? Ref: https://openai.com/blog/generative-models/ . com/mattya/chainer-DCGAN .畫漫畫!!! https://github. Outline Supervised Learning • Ultra Deep Network New network structure • Attention Model Reinforcement Learning Unsupervised Learning • Image: Realizing what the World Looks Like • Text: Understanding the Meaning of Words • Audio: Learning human language without supervision . com/ .Machine Reading • Machine learn the meaning of words from reading a lot of documents without supervision http://top-breaking-news. Machine Reading • Machine learn the meaning of words from reading a lot of documents without supervision Word Vector / Embedding tree flower dog rabbit run jump cat . files.wordpress.Machine Reading • Generating Word Vector/Embedding is unsupervised Apple Training data is a lot of text Neural Network ? https://garavato.com/2011/11/stacksdocuments.jpg?w=490 . Machine Reading • Machine learn the meaning of words from reading a lot of documents without supervision • A word can be understood by its context You shall know a word 蔡英文、馬英九 are by the company it keeps something very similar 馬英九 520宣誓就職 蔡英文 520宣誓就職 . net/hustwj/cikm-keynotenov2014 283 .slideshare.Word Vector Source: http://www. ? ??????? Word Vector ≈ ? ?????? − ? ???? + ? ????? • Characteristics ? ℎ????? − ? ℎ?? ≈ ? ?????? − ? ??? ? ???? − ? ????? ≈ ? ?????? − ? ??????? ? ???? − ? ????? ≈ ? ????? − ? ???? • Solving analogies Rome : Italy = Berlin : ? Compute ? ?????? − ? ???? + ? ????? Find the word w with the closest V(w) 284 . Machine Reading • Machine learn the meaning of words from reading a lot of documents without supervision . Demo • Model used in demo is provided by 陳仰德 • Part of the project done by 陳仰德、林資偉 • TA: 劉元銘 • Training data is from PTT (collected by 葉青峰) 286 . Outline Supervised Learning • Ultra Deep Network New network structure • Attention Model Reinforcement Learning Unsupervised Learning • Image: Realizing what the World Looks Like • Text: Understanding the Meaning of Words • Audio: Learning human language without supervision . Learning from Audio Book Machine does not have any prior knowledge Machine listens to lots of audio book Like an infant [Chung. Interspeech 16) . Audio Word to Vector • Audio segment corresponding to an unknown word Fixed-length vector . Audio Word to Vector • The audio segments corresponding to words with similar pronunciations are close to each other. dog never dog never dogs never ever ever . Sequence-to-sequence Auto-encoder vector audio segment RNN Encoder The values in the memory represent the whole audio segment The vector we want How to train RNN Encoder? x1 x2 x3 x4 acoustic features audio segment . Sequence-to-sequence Input acoustic features Auto-encoder x1 x2 x3 x4 The RNN encoder and decoder are jointly trained. y1 y2 y3 y4 RNN Encoder RNN Decoder x1 x2 x3 x4 acoustic features audio segment . Audio Word to Vector .Results • Visualizing embedding vectors of the words fear fame name near . com/blog/wavenet-generative-model-raw-audio/ .WaveNet (DeepMind) https://deepmind. Concluding Remarks . Concluding Remarks Lecture I: Introduction of Deep Learning Lecture II: Tips for Training Deep Neural Network Lecture III: Variants of Neural Network Lecture IV: Next Wave . express.co.uk/news/science/651202/First-step-towards-The-Terminator- becoming-reality-AI-beats-champ-of-world-s-oldest-game .AI 即將取代多數的工作? • New Job in AI Age AI 訓練師 (機器學習專家、 資料科學家) http://www. AI 訓練師 機器不是自己會學嗎? 為什麼需要 AI 訓練師 戰鬥是寶可夢在打, 為什麼需要寶可夢訓練師? . 小智的噴火龍 best function • 需要足夠的經驗 • E.AI 訓練師 寶可夢訓練師 AI 訓練師 • 寶可夢訓練師要挑選適合 • 在 step 1,AI訓練師要挑 的寶可夢來戰鬥 選合適的模型 • 寶可夢有不同的屬性 • 不同模型適合處理不 • 召喚出來的寶可夢不一定 同的問題 能操控 • 不一定能在 step 3 找出 • E.g.g. Deep Learning • 需要足夠的經驗 . html . AI 訓練師 • 厲害的 AI , AI 訓練師功不可沒 • 讓我們一起朝 AI 訓練師之路邁進 http://www.tw/web only_content_10787.com.gvm.