I:Introduction of Deep Learning
- Framework
- A set of function
- Neuron:weights,bias,activation function(sigmoid)
- Layer:input,hidden,output(softmax)
- Goodness of function f
- Loss:distance between the network output and target
- find the network parameters that minimize total loss
- Pick the best function
- Gradient Descent: Backpropagation: an efficient way to compute 相当于把求偏导的路径进行了去重优化
- A set of function
- why deep
- more parameters, better performance
- any function can be realized by one single hidden layer
- deep->modularization->less training data
II:Tips for Training DNN
Choosing proper loss
- square error vs cross entropy
- when using softmax output layer, choose cross entropy
Mini-batch:faster
New activation function
- vanishing gradient problem
- ReLU
- Maxout:ReLU is a special cases of Maxout
Adaptive learning rate
- Adagrad
gi是第i次更新获得的梯度 - RMSprop,Adadelta,AdaSecant,Adam,Nadam
- Adagrad
- Momentum
- Adam:RMSProp+Momentum
- Handling overfitting
- more training data:create training data,add noise
- Early stopping
- 即在每一个epoch结束时(一个epoch即对所有训练数据的一轮遍历)计算 validation data的accuracy,当accuracy不再提高时,就停止训练
- Weight Decay
- 避免网络过拟合
- Dropout
- Each neuron has p% to dropout
- no dropout in testing, weights timess 1-p
III:Variants of Neural Networks
- CNN
- connecting to small region with less parameters
- for image:same patterns,subsampling will not change the object
- step:(concolution->maxpooling)+->flatten->fully connected feedforward network
- RNN
- the output of hidden layer are stored in the memory.
- memory can be considered as another input.
- Bidirectional RNN:利用上下文信息
- LSTM
- CNN是空间上的深度网络,RNN是时间上的深度网络
IV:Next Wave
- Supervised Learning
- Ultral Deep Network
- Attention model
- 自然语言处理中经常使用
- Reinforcement Learning
- Unsupervised Learning
- image:realizing what the world looks like
- auto-encoder
- text:understanding the meaning of words
- word vector
- audio:learning human language without supervision
- image:realizing what the world looks like
补充
- 激活函数类型以及优缺点
- sigmoid:容易饱和,梯度消失;非零均值
- tanh:tanh(x)=2σ(2x)−1,均值
- relu:f(x)=max(0,x),收敛速度快,求导方便;负梯度容易坏死
- leaky relu:f(x)=1(x<0)(ax)+1(x>=0)(x),小于零时不再坏死
- prelu:a值可学习
- softmax:多分类,可求导
- maxout:更加宽泛的激活函数
- 损失函数选择
- 均方差+Sigmoid:反向中,每一层向前递推都要乘以σ′(z),收敛速度慢
- 交叉熵损失函数+Sigmoid:得到的的δl梯度表达式没有了σ′(z)
- 对数似然损失函数+softmax:多分类
- 防止过拟合
- more training data:噪声,重采样
- early stop
- dropout
- weight declay:相当于正则化
- 梯度下降
- 随机梯度下降,minibatch
- 牛顿法:二阶收敛快,但是每一步都要求hessian矩阵比较慢
- 拟牛顿法:用正定矩阵近似hessian矩阵的逆
- 共轭梯度
- 启发式优化
- 拉格朗日乘数