Skip to content

Commit 88e03c1

Browse files
committed
Add suggestions, use register_hook, and clean up text/tone
Used output.register_hook method as suggested by alban instead of the retain_grad method. Also addressed soulitzer's comments and fixed some other phrases to better align with a PyTorch tutorial tone. Increased the DPI of the static images to 150, as they were slightly blurry on my monitor.
1 parent 9a50805 commit 88e03c1

File tree

3 files changed

+114
-123
lines changed

3 files changed

+114
-123
lines changed
Loading
Loading

advanced_source/visualizing_gradients_tutorial.py

Lines changed: 114 additions & 123 deletions
Original file line numberDiff line numberDiff line change
@@ -4,31 +4,16 @@
44
55
**Author:** `Justin Silver <https://github.com/j-silv>`__
66
7-
When training neural networks with PyTorch, it’s possible to ignore some
8-
of the library’s internal mechanisms. For example, running
9-
backpropagation requires a simple call to ``backward()``. This tutorial
10-
dives into how those gradients are calculated and stored in two
11-
different kinds of PyTorch tensors: leaf vs. non-leaf. It will also
12-
cover how we can extract and visualize gradients at any layer in the
13-
network’s computational graph. By inspecting how information flows from
14-
the end of the network to the parameters we want to optimize, we can
15-
debug issues that occur during training such as `vanishing or exploding
16-
gradients <https://arxiv.org/abs/1211.5063>`__.
17-
18-
By the end of this tutorial, you will be able to:
19-
20-
- Differentiate leaf vs. non-leaf tensors
21-
- Know when to use ``requires_grad`` vs. ``retain_grad``
22-
- Visualize gradients after backpropagation in a neural network
23-
24-
We will start off with a simple network to understand how PyTorch
25-
calculates and stores gradients. Building on this knowledge, we will
26-
then visualize the gradient flow of a more complicated model and see the
27-
effect that `batch normalization <https://arxiv.org/abs/1502.03167>`__
28-
has on the gradient distribution.
29-
30-
Before starting, it is recommended to have a solid understanding of
31-
`tensors and how to manipulate
7+
This tutorial explains the subtleties of ``requires_grad``,
8+
``retain_grad``, leaf, and non-leaf tensors using a simple example. It
9+
then covers how to extract and visualize gradients at any layer in a
10+
neural network. By inspecting how information flows from the end of the
11+
network to the parameters we want to optimize, we can debug issues such
12+
as `vanishing or exploding
13+
gradients <https://arxiv.org/abs/1211.5063>`__ that occur during
14+
training.
15+
16+
Before starting, make sure you understand `tensors and how to manipulate
3217
them <https://docs.pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html>`__.
3318
A basic knowledge of `how autograd
3419
works <https://docs.pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html>`__
@@ -54,9 +39,9 @@
5439

5540

5641
######################################################################
57-
# Next, we will instantiate a simple network so that we can focus on the
58-
# gradients. This will be an affine layer, followed by a ReLU activation,
59-
# and ending with a MSE loss between the prediction and label tensors.
42+
# Next, we instantiate a simple network to focus on the gradients. This
43+
# will be an affine layer, followed by a ReLU activation, and ending with
44+
# a MSE loss between prediction and label tensors.
6045
#
6146
# .. math::
6247
#
@@ -137,8 +122,9 @@
137122
######################################################################
138123
# The distinction between leaf and non-leaf determines whether the
139124
# tensor’s gradient will be stored in the ``grad`` property after the
140-
# backward pass, and thus be usable for gradient descent optimization.
141-
# We’ll cover this some more in the `following section <#retain-grad>`__.
125+
# backward pass, and thus be usable for `gradient
126+
# descent <https://en.wikipedia.org/wiki/Gradient_descent>`__. We’ll cover
127+
# this some more in the `following section <#retain-grad>`__.
142128
#
143129
# Let’s now investigate how PyTorch calculates and stores gradients for
144130
# the tensors in its computational graph.
@@ -149,24 +135,24 @@
149135
# ``requires_grad``
150136
# -----------------
151137
#
152-
# To start the generation of the computational graph which can be used for
153-
# gradient calculation, we need to pass in the ``requires_grad=True``
154-
# parameter to a tensor constructor. By default, the value is ``False``,
155-
# and thus PyTorch does not track gradients on any created tensors. To
156-
# verify this, try not setting ``requires_grad``, re-run the forward pass,
157-
# and then run backpropagation. You will see:
138+
# To build the computational graph which can be used for gradient
139+
# calculation, we need to pass in the ``requires_grad=True`` parameter to
140+
# a tensor constructor. By default, the value is ``False``, and thus
141+
# PyTorch does not track gradients on any created tensors. To verify this,
142+
# try not setting ``requires_grad``, re-run the forward pass, and then run
143+
# backpropagation. You will see:
158144
#
159145
# ::
160146
#
161147
# >>> loss.backward()
162148
# RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
163149
#
164-
# PyTorch is telling us that because the tensor is not tracking gradients,
165-
# autograd can’t backpropagate to any leaf tensors. If you need to change
166-
# the property, you can call ``requires_grad_()`` on the tensor (notice
167-
# the ’_’ suffix).
150+
# This error means that autograd can’t backpropagate to any leaf tensors
151+
# because ``loss`` is not tracking gradients. If you need to change the
152+
# property, you can call ``requires_grad_()`` on the tensor (notice the \_
153+
# suffix).
168154
#
169-
# We can sanity-check which nodes require gradient calculation, just like
155+
# We can sanity check which nodes require gradient calculation, just like
170156
# we did above with the ``is_leaf`` attribute:
171157
#
172158

@@ -176,11 +162,11 @@
176162

177163

178164
######################################################################
179-
# It’s useful to remember that by definition a non-leaf tensor has
180-
# ``requires_grad=True``. Backpropagation would fail if this wasn’t the
181-
# case. If the tensor is a leaf, then it will only have
165+
# It’s useful to remember that a non-leaf tensor has
166+
# ``requires_grad=True`` by definition, since backpropagation would fail
167+
# otherwise. If the tensor is a leaf, then it will only have
182168
# ``requires_grad=True`` if it was specifically set by the user. Another
183-
# way to phrase this is that if at least one of the inputs to the tensor
169+
# way to phrase this is that if at least one of the inputs to a tensor
184170
# requires the gradient, then it will require the gradient as well.
185171
#
186172
# There are two exceptions to this rule:
@@ -193,8 +179,8 @@
193179
#
194180
# In summary, ``requires_grad`` tells autograd which tensors need to have
195181
# their gradients calculated for backpropagation to work. This is
196-
# different from which gradients have to be stored inside the tensor,
197-
# which is the topic of the next section.
182+
# different from which tensors have their ``grad`` field populated, which
183+
# is the topic of the next section.
198184
#
199185

200186

@@ -210,9 +196,9 @@
210196

211197

212198
######################################################################
213-
# This single function call populated the ``grad`` property of all leaf
214-
# tensors which had ``requires_grad=True``. The ``grad`` is the gradient
215-
# of the loss with respect to the tensor we are probing. Before running
199+
# Calling ``backward()`` populates the ``grad`` field of all leaf tensors
200+
# which had ``requires_grad=True``. The ``grad`` is the gradient of the
201+
# loss with respect to the tensor we are probing. Before running
216202
# ``backward()``, this attribute is set to ``None``.
217203
#
218204

@@ -242,7 +228,7 @@
242228

243229

244230
######################################################################
245-
# We also get ``None`` for the gradient, but now PyTorch warns us that a
231+
# PyTorch returns ``None`` for the gradient and also warns us that a
246232
# non-leaf node’s ``grad`` attribute is being accessed. Although autograd
247233
# has to calculate intermediate gradients for backpropagation to work, it
248234
# assumes you don’t need to access the values afterwards. To change this
@@ -304,15 +290,21 @@
304290
# >>> x.retain_grad()
305291
# RuntimeError: can't retain_grad on Tensor that has requires_grad=False
306292
#
307-
# In summary, using ``retain_grad()`` and ``retains_grad`` only make sense
308-
# for non-leaf nodes, since the ``grad`` attribute will already be
309-
# populated for leaf tensors that have ``requires_grad=True``. By default,
310-
# these non-leaf nodes do not retain (store) their gradient after
293+
294+
295+
######################################################################
296+
# Summary table
297+
# -------------
298+
#
299+
# Using ``retain_grad()`` and ``retains_grad`` only make sense for
300+
# non-leaf nodes, since the ``grad`` attribute will already be populated
301+
# for leaf tensors that have ``requires_grad=True``. By default, these
302+
# non-leaf nodes do not retain (store) their gradient after
311303
# backpropagation. We can change that by rerunning the forward pass,
312304
# telling PyTorch to store the gradients, and then performing
313305
# backpropagation.
314306
#
315-
# The following table can be used as a cheat-sheet which summarizes the
307+
# The following table can be used as a reference which summarizes the
316308
# above discussions. The following scenarios are the only ones that are
317309
# valid for PyTorch tensors.
318310
#
@@ -344,16 +336,18 @@
344336
# To illustrate the importance of gradient visualization, we will
345337
# instantiate one version of the network with batch normalization
346338
# (BatchNorm), and one without it. Batch normalization is an extremely
347-
# effective technique to resolve the vanishing/exploding gradients issue,
348-
# and we will be verifying that experimentally.
349-
#
350-
# The model we will use has a specified number of repeating
351-
# fully-connected layers which alternate between ``nn.Linear``,
352-
# ``norm_layer``, and ``nn.Sigmoid``. If we apply batch normalization,
353-
# then ``norm_layer`` will use
339+
# effective technique to resolve `vanishing/exploding
340+
# gradients <https://arxiv.org/abs/1211.5063>`__, and we will be verifying
341+
# that experimentally.
342+
#
343+
# The model we use has a configurable number of repeating fully-connected
344+
# layers which alternate between ``nn.Linear``, ``norm_layer``, and
345+
# ``nn.Sigmoid``. If batch normalization is enabled, then ``norm_layer``
346+
# will use
354347
# `BatchNorm1d <https://docs.pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html>`__,
355-
# otherwise it will use the identity transformation
356-
# `Identity <https://docs.pytorch.org/docs/stable/generated/torch.nn.Identity.html>`__.
348+
# otherwise it will use the
349+
# `Identity <https://docs.pytorch.org/docs/stable/generated/torch.nn.Identity.html>`__
350+
# transformation.
357351
#
358352

359353
def fc_layer(in_size, out_size, norm_layer):
@@ -416,60 +410,60 @@ def forward(self, x):
416410

417411

418412
######################################################################
419-
# Because we are using a ``nn.Module`` instead of individual tensors for
420-
# our forward pass, we need another method to access the intermediate
421-
# gradients. This is done by `registering a
422-
# hook <https://www.digitalocean.com/community/tutorials/pytorch-hooks-gradient-clipping-debugging>`__.
413+
# Because we wrapped up the logic and state of our model in a
414+
# ``nn.Module``, we need another method to access the intermediate
415+
# gradients if we want to avoid modifying the module code directly. This
416+
# is done by `registering a
417+
# hook <https://docs.pytorch.org/docs/stable/notes/autograd.html#backward-hooks-execution>`__.
423418
#
424419
# .. warning::
425420
#
426-
# Note that using backward pass hooks to probe an intermediate nodes gradient is preferred over using `retain_grad()`.
427-
# It avoids the memory retention overhead if gradients aren't needed after backpropagation.
428-
# It also lets you modify and/or clamp gradients during the backward pass, so they don't vanish or explode.
429-
# However, if in-place operations are performed, you cannot use the backward pass hook
430-
# since it wraps the forward pass with views instead of the actual tensors. For more information
431-
# please refer to https://github.com/pytorch/pytorch/issues/61519.
421+
# Using backward pass hooks attached to output tensors is preferred over using ``retain_grad()`` on the tensors themselves. An alternative method is to directly attach module hooks (e.g. ``register_full_backward_hook()``) so long as the ``nn.Module`` instance does not do perform any in-place operations. For more information, please refer to `this issue <https://github.com/pytorch/pytorch/issues/61519>`__.
432422
#
433-
# The following code defines our forward pass hook (notice the call to
434-
# ``retain_grad()``) and also gathers descriptive names for the network’s
435-
# layers.
423+
# The following code defines our hooks and gathers descriptive names for
424+
# the network’s layers.
436425
#
437426

438-
def hook_forward_wrapper(module_name, outputs):
439-
"""Python function closure so we can pass args"""
427+
# note that wrapper functions are used for Python closure
428+
# so that we can pass arguments.
429+
430+
def hook_forward_wrapper(module_name, grads):
440431
def hook_forward(module, args, output):
441-
"""Hook for forward pass which retains gradients and saves intermediate tensors"""
442-
output.retain_grad()
443-
outputs.append((module_name, output))
432+
"""Forward pass hook which attaches backward pass hooks to intermediate tensors"""
433+
output.register_hook(hook_backward_wrapper(module_name, grads))
444434
return hook_forward
435+
436+
def hook_backward_wrapper(module_name, grads):
437+
def hook_backward(grad):
438+
"""Backward pass hook which appends gradients"""
439+
grads.append((module_name, grad))
440+
return hook_backward
445441

446442
def get_all_layers(model, hook_fn):
447-
"""Register forward pass hook to all outputs in model
443+
"""Register forward pass hook (hook_fn) to model outputs
448444
449-
Returns layers, a dict with keys as layer/module and values as layer/module names
450-
e.g.: layers[nn.Conv2d] = layer1.0.conv1
451-
452-
Returns outputs, a list of tuples with module name and tensor output. e.g.:
453-
outputs[0] == (layer1.0.conv1, tensor.Torch(...))
454-
455-
The layer name is passed to a forward hook which will eventually go into a tuple
445+
Returns:
446+
- layers: a dict with keys as layer/module and values as layer/module names
447+
e.g. layers[nn.Conv2d] = layer1.0.conv1
448+
- grads: a list of tuples with module name and tensor output gradient
449+
e.g. grads[0] == (layer1.0.conv1, tensor.Torch(...))
456450
"""
457451
layers = dict()
458-
outputs = []
452+
grads = []
459453
for name, layer in model.named_modules():
454+
# skip Sequential and/or wrapper modules
460455
if any(layer.children()) is False:
461-
# skip Sequential and/or wrapper modules
462456
layers[layer] = name
463-
layer.register_forward_hook(hook_forward_wrapper(name, outputs))
464-
return layers, outputs
457+
layer.register_forward_hook(hook_fn(name, grads))
458+
return layers, grads
465459

466460
# register hooks
467-
layers_bn, outputs_bn = get_all_layers(model_bn, hook_forward_wrapper)
468-
layers_nobn, outputs_nobn = get_all_layers(model_nobn, hook_forward_wrapper)
461+
layers_bn, grads_bn = get_all_layers(model_bn, hook_forward_wrapper)
462+
layers_nobn, grads_nobn = get_all_layers(model_nobn, hook_forward_wrapper)
469463

470464

471465
######################################################################
472-
# Now let’s train the models for a few epochs:
466+
# Let’s now train the models for a few epochs:
473467
#
474468

475469
epochs = 10
@@ -478,8 +472,8 @@ def get_all_layers(model, hook_fn):
478472

479473
# important to clear, because we append to
480474
# outputs everytime we do a forward pass
481-
outputs_bn.clear()
482-
outputs_nobn.clear()
475+
grads_bn.clear()
476+
grads_nobn.clear()
483477

484478
optimizer_bn.zero_grad()
485479
optimizer_nobn.zero_grad()
@@ -498,33 +492,33 @@ def get_all_layers(model, hook_fn):
498492

499493

500494
######################################################################
501-
# After running the forward and backward pass, the ``grad`` values for all
502-
# the intermediate tensors should be present in ``outputs_bn`` and
503-
# ``outputs_nobn``. We reduce the gradient matrix to a single number (mean
504-
# absolute value) so that we can compare the two models.
495+
# After running the forward and backward pass, the gradients for all the
496+
# intermediate tensors should be present in ``grads_bn`` and
497+
# ``grads_nobn``. We compute the mean absolute value of each gradient
498+
# matrix so that we can compare the two models.
505499
#
506500

507-
def get_grads(outputs):
501+
def get_grads(grads):
508502
layer_idx = []
509503
avg_grads = []
510-
for idx, (name, output) in enumerate(outputs):
511-
if output.grad is not None:
512-
avg_grad = output.grad.abs().mean()
504+
for idx, (name, grad) in enumerate(grads):
505+
if grad is not None:
506+
avg_grad = grad.abs().mean()
513507
avg_grads.append(avg_grad)
514-
layer_idx.append(idx)
508+
# idx is backwards since we appended in backward pass
509+
layer_idx.append(len(grads) - 1 - idx)
515510
return layer_idx, avg_grads
516511

517-
layer_idx_bn, avg_grads_bn = get_grads(outputs_bn)
518-
layer_idx_nobn, avg_grads_nobn = get_grads(outputs_nobn)
512+
layer_idx_bn, avg_grads_bn = get_grads(grads_bn)
513+
layer_idx_nobn, avg_grads_nobn = get_grads(grads_nobn)
519514

520515

521516
######################################################################
522-
# Now that we have all our gradients stored in ``avg_grads``, we can plot
523-
# them and see how the average gradient values change as a function of the
524-
# network depth. We see that when we don’t have batch normalization, the
525-
# gradient values in the intermediate layers fall to zero very quickly.
526-
# The batch normalization model, however, maintains non-zero gradients in
527-
# its intermediate layers.
517+
# With the average gradients computed, we can now plot them and see how
518+
# the values change as a function of the network depth. Notice that when
519+
# we don’t apply batch normalization, the gradient values in the
520+
# intermediate layers fall to zero very quickly. The batch normalization
521+
# model, however, maintains non-zero gradients in its intermediate layers.
528522
#
529523

530524
fig, ax = plt.subplots()
@@ -566,7 +560,7 @@ def get_grads(outputs):
566560
# - Try increasing the number of layers (``num_layers``) in our model and
567561
# see what effect this has on the gradient flow graph
568562
# - How would you adapt the code to visualize average activations instead
569-
# of average gradients? (*Hint: in the ``get_grads()`` function we have
563+
# of average gradients? (*Hint: in the hook_forward() function we have
570564
# access to the raw tensor output*)
571565
# - What are some other methods to deal with vanishing and exploding
572566
# gradients?
@@ -585,9 +579,6 @@ def get_grads(outputs):
585579
# mechanics <https://docs.pytorch.org/docs/stable/notes/autograd.html>`__
586580
# - `Batch Normalization: Accelerating Deep Network Training by Reducing
587581
# Internal Covariate Shift <https://arxiv.org/abs/1502.03167>`__
588-
#
589-
590-
591-
######################################################################
592-
#
582+
# - `On the difficulty of training Recurrent Neural
583+
# Networks <https://arxiv.org/abs/1211.5063>`__
593584
#

0 commit comments

Comments
 (0)