Hyperparameters
Learning Rate
- Learning Rate Decay
- Adaptive Learning Rate Optimizers
Minibatch Size
- Systematic evaluation of CNN advances on the ImageNet, Author: Dmytro Mishkin, Nikolay Sergievskiy, Jiri Matas
Epoch
- Early Stopping
- Recently, TensorFlow has abandoned Monitor function, but uses SessionRunHooks instead. SessionRunHook is a part of tf.train, which has been developing. It will be the most important API to implement Early Stopping.
- Reference to API - Train Hooks
Number of Hidden Units / Layers
Typically, 3-layer NNs performs better than 2-layer NNs in practice. But deeper NNs (i.e. 4-, 5-, 6-layer NNs) don’t help a lot, which is for sure contrary to CNNs. People have found that the depth is one of the most important components for CNNs. And the deeper the CNNs are, the better they perfrom as recognision systems. (Reference: cs231n - Andrej Karpathy)
Others
- Practical recommendations for gradient-based training of deep architectures, Author: Yoshua Bengio
- Deep Learning - 11.4 - Choosing Hyperparameters, Author: Ian Goodfellow、Yoshua Bengio、Aaron Courville
- Neural Networks and Deep Learning - Chap.3 - How to choose a neural network’s hyper-parameters, Author: Michael Nielsen
- Efficient BackProp, Author: Yann Lecun
- How to Generate a Good Word Embedding?, Author: Siwei Lai, Kang Liu, Liheng Xu, Jun Zhao
- Systematic evaluation of CNN advances on the ImageNet, Author: Dmytro Mishkin、Nikolay Sergievskiy、Jiri Matas
- Visualizing and Understanding Recurrent Networks, Author: Andrej Karpathy、Justin Johnson、Li Fei-Fei
RNN
LSTM vs. GRU
The evaluation clearly demonstrated the superiority of the gated units; both the LSTM unit and GRU, over the traditional tanh unit. This was more evident with the more challenging task of raw speech signal modeling. However, we could not make concrete conclusion on which of the two gating units was better.
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, Author: Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, Yoshua Bengio
The GRU outperformed the LSTM on all tasks with the exception of language modelling.
An Empirical Exploration of Recurrent Network Architectures
Our consistent finding is that depth of at least two is beneficial. However, between two and three layers our results are mixed. Additionally, the results are mixed between the LSTM and the GRU, but both significantly outperform the RNN.
Visualizing and Understanding Recurrent Networks, Author: Andrej Karpathy, Justin Johnson, Li Fei-Fei
Which of these variants is best? Do the differences matter? Greff, et al. (2015) do a nice comparison of popular variants, finding that they’re all about the same. Jozefowicz, et al. (2015) tested more than ten thousand RNN architectures, finding some that worked better than LSTMs on certain tasks.
Understanding LSTM Networks, Author: Chris Olah
Application | Cell | Num. of Layers | Size | Vocabulary | Embedding Size | Learning Rate | Reference |
---|---|---|---|---|---|---|---|
Large Vocabulary Speech Recognition | LSTM | 5, 7 | 600, 1000 | 82K, 500K | - | - | paper |
Speech Recognition | LSTM | 1, 3, 5 | 250 | - | - | 0.001 | paper |
seq2seq | LSTM | 4 | 1000 | Origin: 160K, Target: 80K | 1000 | - | paper |
Image Caption Generator | LSTM | - | 512 | - | 512 | (fixed) | paper |
Image Generation | LSTM | - | 256, 400, 800 | - | - | - | paper |
Question Answering | LSTM | 2 | 500 | - | 300 | - | paper |
Text Summarization (seq2seq) | GRU | - | 200 | Origin: 119K, Target: 68K | 100 | 0.001 | paper |