Learning Rate Là Gì

  -  
When training deep neural networks, it is often useful lớn reduce learning rate as the training progresses. This can be done by using pre-defined learning rate schedules or adaptive sầu learning rate methods. In this article, I train a convolutional neural network on CIFAR-10 using differing learning rate schedules and adaptive learning rate methods to lớn compare their Model performances.

Bạn đang xem: Learning rate là gì

Learning Rate Schedules

Learning rate schedules seek khổng lồ adjust the learning rate during training by reducing the learning rate according to a pre-defined schedule. Common learning rate schedules include time-based decay, step decayexponential decay. For illustrative purpose, I construct a convolutional neural network trained on CIFAR-10, using stochastic gradient descent (SGD) optimization algorithm with different learning rate schedules khổng lồ compare the performances.

Constant Learning Rate

Constant learning rate is the mặc định learning rate schedule in SGD optimizer in Keras. Momentum và decay rate are both phối to lớn zero by mặc định. It is tricky to choose the right learning rate. By experimenting with range of learning rates in our example, lr=0.1 shows a relative good performance to start with. This can serve sầu as a baseline for us lớn experiment with different learning rate strategies.

keras.optimizers.SGD(lr=0.1, momentum=0.0, decay=0.0, nesterov=False)

*

*

Fig 1 : Constant Learning Rate

Time-Based Decay

The mathematical form of time-based decay is lr = lr0/(1+kt) where lr, k are hyperparameters and t is the iteration number. Looking into the source code of Keras, the SGD optimizer takes decay và lr arguments và update the learning rate by a decreasing factor in each epoch.

lr *= (1. / (1. + self.decay * self.iterations))Momentum is another argument in SGD optimizer which we could tweak to obtain faster convergence. Unlike classical SGD, momentum method helps the parameter vector to lớn build up velocity in any direction with constant gradient descent so as lớn prevent oscillations. A typical choice of momentum is between 0.5 to lớn 0.9.

SGD optimizer also has an argument called nesterov which is set to false by mặc định. Nesterov momentum is a different version of the momentum method which has stronger theoretical converge guarantees for convex functions. In practice, it works slightly better than standard momentum.

In Keras, we can implement time-based decay by setting the initial learning rate, decay rate & momentum in the SGD optimizer.

learning_rate = 0.1decay_rate = learning_rate / epochsmomentum = 0.8sgd = SGD(lr=learning_rate, momentum=momentum, decay=decay_rate, nesterov=False)

*

*

Fig 2 : Time-based Decay Schedule

Step Decay

Step decay schedule drops the learning rate by a factor every few epochs. The mathematical khung of step decay is :

lr = lr0 * drop^floor(epoch / epochs_drop) A typical way is lớn lớn drop the learning rate by half every 10 epochs. To implement this in Keras, we can define a step decay function & use LearningRateScheduler callbaông chồng to take the step decay function as argument and return the updated learning rates for use in SGD optimizer.

def step_decay(epoch): initial_lrate = 0.1 drop = 0.5 epochs_drop = 10.0 lrate = initial_lrate * math.pow(drop, math.floor((1+epoch)/epochs_drop)) return lratelrate = LearningRateScheduler(step_decay)As a digression, a callbaông chồng is a mix of functions lớn be applied at given stages of the training procedure. We can use callbacks to lớn get a view on internal states và statistics of the Model during training. In our example, we create a custom callbaông chồng by extending the base class keras.callbacks.Callback to record loss history and learning rate during the training procedure.

class LossHistory(keras.callbacks.Callback): def on_train_begin(self, logs=): self.losses = <> self.lr = <> def on_epoch_end(self, batch, logs=): self.losses.append(logs.get(‘loss’)) self.lr.append(step_decay(len(self.losses)))Putting everything together, we can pass a callbaông chồng list consisting of LearningRateScheduler callbaông chồng and our custom callback to lớn fit the Model. We can then visualize the learning rate schedule and the loss history by accessing loss_history.lr and loss_history.losses.

Xem thêm: Sử Dụng Cortana Windows 10 Là Gì ? Cách Bật Cortana Là Gì

loss_history = LossHistory()lrate = LearningRateScheduler(step_decay)callbacks_list = history = mã sản phẩm.fit(X_train, y_train, validation_data=(X_chạy thử, y_test), epochs=epochs, batch_size=batch_size, callbacks=callbacks_list, verbose=2)

*

Fig 3b : Step Decay Schedule

Exponential Decay

Another comtháng schedule is exponential decay. It has the mathematical khung lr = lr0 * e^(−kt), where lr, k are hyperparameters & t is the iteration number. Similarly, we can implement this by defining exponential decay function & pass it khổng lồ LearningRateScheduler. In fact, any custom decay schedule can be implemented in Keras using this approach. The only difference is khổng lồ define a different custom decay function.

def exp_decay(epoch): initial_lrate = 0.1 k = 0.1 lrate = initial_lrate * exp(-k*t) return lratelrate = LearningRateScheduler(exp_decay)
Fig 4b : Exponential Decay ScheduleLet us now compare the Mã Sản Phẩm accuracy using different learning rate schedules in our example.


Fig 5 : Comparing Performances of Different Learning Rate SchedulesAdaptive sầu Learning Rate Methods

The challenge of using learning rate schedules is that their hyperparameters have lớn be defined in advance & they depover heavily on the type of model và problem. Another problem is that the same learning rate is applied lớn all parameter updates. If we have sầu sparse data, we may want to update the parameters in different extent instead.

Adaptive sầu gradient descent algorithms such as Adagrad, Adadelta, RMSprop, Adam, provide an alternative sầu lớn classical SGD. These per-parameter learning rate methods provide heuristic approach without requiring expensive work in tuning hyperparameters for the learning rate schedule manually.

In brief, Adagrad performs larger updates for more sparse parameters & smaller updates for less sparse parameter. It has good performance with sparse data và training large-scale neural network. However, its monotonic learning rate usually proves too aggressive sầu và stops learning too early when training deep neural networks. Adadelta is an extension of Adagrad that seeks to lớn reduce its aggressive, monotonically decreasing learning rate. RMSprop adjusts the Adagrad method in a very simple way in an attempt to reduce its aggressive sầu, monotonically decreasing learning rate. Adam is an update to the RMSProp optimizer which is like RMSprop with momentum.

In Keras, we can implement these adaptive learning algorithms easily using corresponding optimizers. It is usually recommended lớn leave the hyperparameters of these optimizers at their default values (except lr sometimes).

keras.optimizers.Adagrad(lr=0.01, epsilon=1e-08, decay=0.0)keras.optimizers.Adadelta(lr=1.0, rho=0.95, epsilon=1e-08, decay=0.0)keras.optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)Let us now look at the mã sản phẩm performances using different adaptive sầu learning rate methods. In our example, Adadelta gives the best Model accuracy aao ước other adaptive sầu learning rate methods.


Fig 6 : Comparing Performances of Different Adaptive Learning Algorithms

Finally, we compare the performances of all the learning rate schedules and adaptive learning rate methods we have sầu discussed.


Fig 7: Comparing Performances of Different Learning Rate Schedules & Adaptive Learning AlgorithmsConclusion

In many examples I have sầu worked on, adaptive sầu learning rate methods demonstrate better performance than learning rate schedules, và they require much less effort in hyperparamater settings. We can also use LearningRateScheduler in Keras khổng lồ create custom learning rate schedules which is specific to lớn our data problem.

Xem thêm: Giấy Cam Kết Tiếng Anh Là Gì ? Giá Trị Pháp Lý Như Thế Nào? Giá Trị Pháp Lý Như Thế Nào

For further reading, Yoshua Bengio’s paper provides very good practical recommendations for tuning learning rate for deep learning, such as how khổng lồ phối initial learning rate, mini-batch kích thước, number of epochs và use of early stopping and momentum.