Neural Network : Scaling & Gradient descent optimization
--
Introduction
In this article, I wanted to give you a little overview of some techniques to optimize your Neural Network, by influencing your gradient gradient descent and the scale of the data.
It could help increasing the accuracy of the NNs and reduce it’s training cost, but be aware of overfitting, in this case I will not talk about it but you can give a look to regularization to resolve this issue.
Feature Scaling
This method is used to normalize the features of your data, as you can see by it’s designation. You can use different type of feature scaling like the min-max normalization or the mean normalization, but one of the most interesting in machine learning is the Z-score normalization.
It determines the distribution mean (μ) and standard deviation ( σ ) for each feature (x) and it makes the values of each feature in the data have zero-mean and unit-variance. Here is the formula :
To use it with tensorflow, I advise to take a look to the following function :
tf.scale_to_z_score(x)
Batch normalization
Like the feature scaling, the batch normalization is a method used for scaling but is applied to the layers’ inputs, it normalize the inputs by re-centering and re-scaling the latest. It is a technique used for training very deep neural networks and standardizes the inputs to a layer for each mini-batch.
It reduces in consequence the variance range therefore improves the performance. But it’s main flaw is that it can cause gradient explosion during the initialization of the training phase in deep neural networks. Here is the formula :
To use it with tensorflow, I advise to take a look to the following function :
tf.nn.batch_normalization(x, mean, variance, offset, scale, variance_epsilon)
Mini-batch Gradient Descent
This method is a variation of the gradient descent algorithm that splits the training dataset into small batches. It seeks to find a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent.
It is the recommended variant of gradient descent for most applications in deep learning. It can result in a premature convergence of the model and give a less optimal result depending of the stability of the error. Here is the formula :
To use it with tensorflow, there’s a builtin method “batch_size” that can be used to apply mini-batch in your DNN using the above formula (no magic function here).
Gradient Descent with Momentum
Another variant of the gradient descent, in this case the SGD remembers a moment of each iteration (it explains the designation momentum). It refers to the following analogy in physics, to a particle traveling through parameter space, accelerating the gradient descent of the loss.
It uses the moving average to “denoise” the data and find the optimal descent, but even if it converges better and faster it doesn’t resolve all problems, the learning rate has to be tuned manually and even if it’s low in some cases, the current gradient can cause oscillations and doesn’t end on the most optimal result. Here is the formula :
To use it with tensorflow, I advise to take a look to the following function :
tf.train.MomentumOptimizer(learning_rate, momentum).minimize(loss)
RMSProp Gradient Descent
This Gradient Descent method is an adaptive learning rate methods for the gradient descent. It’s designation stands for Root Mean Squared Propagation and is based on the AdaGrad version.
It is designed to accelerate the optimization process, it decrease the number of function evaluations required to reach the optima. But the flaw here is the same as the momentum, as the learning rate must be set manually. Here is the formula :
To use it with tensorflow, I advise to take a look to the following function :
tf.train.RMSPropOptimizer(learning_rate, decay, epsilon).minimize(loss)
Adam Gradient Descent
This method designation stands for Adaptive Movement Estimation Gradient Descent and is also an extension of the gradient descent using the automatic adaptation of the learning rate. It uses the decreasing moving average of the gradient descent to accelerate the optimization process.
It’s pros compared to the others is that the learning rate is adaptive and not manually set, and it’s uses the momentum associated with the Adaptive learning rate methods. But even with that method, it’s still difficult to result in the optima, you will instead result in the approximated optima. Here is the formula :
To use it with tensorflow, I advise to take a look to the following function :
tf.train.AdamOptimizer(learning_rate, beta1, beta2, epsilon).minimize(loss)
Learning Rate Decay
This technique is used for training modern neural networks and influence the learning rate hyperparameter to control how much the model to change in response to the estimated error each time the weights are updated.
As setting the learning rate manually is challenging, it can result in a long training process. It’s an optimal approach to obtain a dynamic learning rate. Here is the formula :
alpha / (1 + decay_rate * np.floor(global_step / decay_step))
To use it with tensorflow, I advise to take a look to the following function :
tf.train.inverse_time_decay(alpha, global_step, decay_step, decay_rate)