Dive Into Deep Learning — Part 2

Nouran Ali

4 min readFeb 26, 2023

This is part 2 of my summary of the chapters I read from the dive into deep learning book.

In case you didn’t read it this is part 1 summarizing section 3.1 pages 82 to the beginning of 86.

Dive Into Deep Learning — Part1

In this series, I will be sharing with you my summarization of the Dive into deep learning book, I started reading it…

nouranalimohammed.medium.com

The following sections of the analytic solution talk about optimizing the model and how to calculate the gradients.

Minibatch stochastic gradient descent

The idea behind the gradient descent algorithm is to iteratively reduce the error by updating the parameters in a direction that incrementally lowers the loss function.

The authors then mention two extremes to apply the GD algorithm...

The naive approach:
Take the derivative of the loss function which is an average of the losses calculated on every example in the dataset, a full update is powerful but it has some drawbacks…
Drawbacks:
. Can be extremely slow as we need to pass over the entire dataset to make a single update.
. If there is a lot of redundancy in the training data, the beneﬁt of a full update is very low
The extreme approach
Consider only a single example at a time and update steps based on one observation at a time, does that remind you of something??
Yes, it’s the stochastic gradient descent algorithm or SGD.
It can be effective even in large datasets but it also has some drawbacks…
Drawbacks:
. It can take longer to process one sample at a time compared to a full batch
. Some NN layers work well when we process more than one observation at a time (batch normalization layer).

If the two methods have major drawbacks then what should we do? It’s simple we just pick a middle ground.

Minibatch stochastic gradient descent

Instead of taking the full dataset as a batch or only taking a single sample, we take a minibatch, the size of the minibatch depends on many variables:

Memory size
number of accelerators/GPUs
number of layers
dataset size

But the authors recommend a number between 32 and 256 (multiple of a large power of 2)

Algorithm steps:

Randomly sample a minibatch Bi of a fixed number |B| of training samples and initialize the model parameters randomly
Compute the derivative of the avg loss on the minibatch
we multiply the gradient by a predetermined small positive value η, the learning rate, and subtract the resulting term from the current parameter values.