4
4
5
5
**Author:** `Justin Silver <https://github.com/j-silv>`__
6
6
7
- When training neural networks with PyTorch, it’s possible to ignore some
8
- of the library’s internal mechanisms. For example, running
9
- backpropagation requires a simple call to ``backward()``. This tutorial
10
- dives into how those gradients are calculated and stored in two
11
- different kinds of PyTorch tensors: leaf vs. non-leaf. It will also
12
- cover how we can extract and visualize gradients at any layer in the
13
- network’s computational graph. By inspecting how information flows from
14
- the end of the network to the parameters we want to optimize, we can
15
- debug issues that occur during training such as `vanishing or exploding
16
- gradients <https://arxiv.org/abs/1211.5063>`__.
17
-
18
- By the end of this tutorial, you will be able to:
19
-
20
- - Differentiate leaf vs. non-leaf tensors
21
- - Know when to use ``requires_grad`` vs. ``retain_grad``
22
- - Visualize gradients after backpropagation in a neural network
23
-
24
- We will start off with a simple network to understand how PyTorch
25
- calculates and stores gradients. Building on this knowledge, we will
26
- then visualize the gradient flow of a more complicated model and see the
27
- effect that `batch normalization <https://arxiv.org/abs/1502.03167>`__
28
- has on the gradient distribution.
29
-
30
- Before starting, it is recommended to have a solid understanding of
31
- `tensors and how to manipulate
7
+ This tutorial explains the subtleties of ``requires_grad``,
8
+ ``retain_grad``, leaf, and non-leaf tensors using a simple example. It
9
+ then covers how to extract and visualize gradients at any layer in a
10
+ neural network. By inspecting how information flows from the end of the
11
+ network to the parameters we want to optimize, we can debug issues such
12
+ as `vanishing or exploding
13
+ gradients <https://arxiv.org/abs/1211.5063>`__ that occur during
14
+ training.
15
+
16
+ Before starting, make sure you understand `tensors and how to manipulate
32
17
them <https://docs.pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html>`__.
33
18
A basic knowledge of `how autograd
34
19
works <https://docs.pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html>`__
54
39
55
40
56
41
######################################################################
57
- # Next, we will instantiate a simple network so that we can focus on the
58
- # gradients. This will be an affine layer, followed by a ReLU activation,
59
- # and ending with a MSE loss between the prediction and label tensors.
42
+ # Next, we instantiate a simple network to focus on the gradients. This
43
+ # will be an affine layer, followed by a ReLU activation, and ending with
44
+ # a MSE loss between prediction and label tensors.
60
45
#
61
46
# .. math::
62
47
#
137
122
######################################################################
138
123
# The distinction between leaf and non-leaf determines whether the
139
124
# tensor’s gradient will be stored in the ``grad`` property after the
140
- # backward pass, and thus be usable for gradient descent optimization.
141
- # We’ll cover this some more in the `following section <#retain-grad>`__.
125
+ # backward pass, and thus be usable for `gradient
126
+ # descent <https://en.wikipedia.org/wiki/Gradient_descent>`__. We’ll cover
127
+ # this some more in the `following section <#retain-grad>`__.
142
128
#
143
129
# Let’s now investigate how PyTorch calculates and stores gradients for
144
130
# the tensors in its computational graph.
149
135
# ``requires_grad``
150
136
# -----------------
151
137
#
152
- # To start the generation of the computational graph which can be used for
153
- # gradient calculation, we need to pass in the ``requires_grad=True``
154
- # parameter to a tensor constructor. By default, the value is ``False``,
155
- # and thus PyTorch does not track gradients on any created tensors. To
156
- # verify this, try not setting ``requires_grad``, re-run the forward pass,
157
- # and then run backpropagation. You will see:
138
+ # To build the computational graph which can be used for gradient
139
+ # calculation, we need to pass in the ``requires_grad=True`` parameter to
140
+ # a tensor constructor. By default, the value is ``False``, and thus
141
+ # PyTorch does not track gradients on any created tensors. To verify this,
142
+ # try not setting ``requires_grad``, re-run the forward pass, and then run
143
+ # backpropagation. You will see:
158
144
#
159
145
# ::
160
146
#
161
147
# >>> loss.backward()
162
148
# RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
163
149
#
164
- # PyTorch is telling us that because the tensor is not tracking gradients,
165
- # autograd can’t backpropagate to any leaf tensors . If you need to change
166
- # the property, you can call ``requires_grad_()`` on the tensor (notice
167
- # the ’_’ suffix).
150
+ # This error means that autograd can’t backpropagate to any leaf tensors
151
+ # because ``loss`` is not tracking gradients . If you need to change the
152
+ # property, you can call ``requires_grad_()`` on the tensor (notice the \_
153
+ # suffix).
168
154
#
169
- # We can sanity- check which nodes require gradient calculation, just like
155
+ # We can sanity check which nodes require gradient calculation, just like
170
156
# we did above with the ``is_leaf`` attribute:
171
157
#
172
158
176
162
177
163
178
164
######################################################################
179
- # It’s useful to remember that by definition a non-leaf tensor has
180
- # ``requires_grad=True``. Backpropagation would fail if this wasn’t the
181
- # case . If the tensor is a leaf, then it will only have
165
+ # It’s useful to remember that a non-leaf tensor has
166
+ # ``requires_grad=True`` by definition, since backpropagation would fail
167
+ # otherwise . If the tensor is a leaf, then it will only have
182
168
# ``requires_grad=True`` if it was specifically set by the user. Another
183
- # way to phrase this is that if at least one of the inputs to the tensor
169
+ # way to phrase this is that if at least one of the inputs to a tensor
184
170
# requires the gradient, then it will require the gradient as well.
185
171
#
186
172
# There are two exceptions to this rule:
193
179
#
194
180
# In summary, ``requires_grad`` tells autograd which tensors need to have
195
181
# their gradients calculated for backpropagation to work. This is
196
- # different from which gradients have to be stored inside the tensor,
197
- # which is the topic of the next section.
182
+ # different from which tensors have their ``grad`` field populated, which
183
+ # is the topic of the next section.
198
184
#
199
185
200
186
210
196
211
197
212
198
######################################################################
213
- # This single function call populated the ``grad`` property of all leaf
214
- # tensors which had ``requires_grad=True``. The ``grad`` is the gradient
215
- # of the loss with respect to the tensor we are probing. Before running
199
+ # Calling ``backward()`` populates the ``grad`` field of all leaf tensors
200
+ # which had ``requires_grad=True``. The ``grad`` is the gradient of the
201
+ # loss with respect to the tensor we are probing. Before running
216
202
# ``backward()``, this attribute is set to ``None``.
217
203
#
218
204
242
228
243
229
244
230
######################################################################
245
- # We also get ``None`` for the gradient, but now PyTorch warns us that a
231
+ # PyTorch returns ``None`` for the gradient and also warns us that a
246
232
# non-leaf node’s ``grad`` attribute is being accessed. Although autograd
247
233
# has to calculate intermediate gradients for backpropagation to work, it
248
234
# assumes you don’t need to access the values afterwards. To change this
304
290
# >>> x.retain_grad()
305
291
# RuntimeError: can't retain_grad on Tensor that has requires_grad=False
306
292
#
307
- # In summary, using ``retain_grad()`` and ``retains_grad`` only make sense
308
- # for non-leaf nodes, since the ``grad`` attribute will already be
309
- # populated for leaf tensors that have ``requires_grad=True``. By default,
310
- # these non-leaf nodes do not retain (store) their gradient after
293
+
294
+
295
+ ######################################################################
296
+ # Summary table
297
+ # -------------
298
+ #
299
+ # Using ``retain_grad()`` and ``retains_grad`` only make sense for
300
+ # non-leaf nodes, since the ``grad`` attribute will already be populated
301
+ # for leaf tensors that have ``requires_grad=True``. By default, these
302
+ # non-leaf nodes do not retain (store) their gradient after
311
303
# backpropagation. We can change that by rerunning the forward pass,
312
304
# telling PyTorch to store the gradients, and then performing
313
305
# backpropagation.
314
306
#
315
- # The following table can be used as a cheat-sheet which summarizes the
307
+ # The following table can be used as a reference which summarizes the
316
308
# above discussions. The following scenarios are the only ones that are
317
309
# valid for PyTorch tensors.
318
310
#
344
336
# To illustrate the importance of gradient visualization, we will
345
337
# instantiate one version of the network with batch normalization
346
338
# (BatchNorm), and one without it. Batch normalization is an extremely
347
- # effective technique to resolve the vanishing/exploding gradients issue,
348
- # and we will be verifying that experimentally.
349
- #
350
- # The model we will use has a specified number of repeating
351
- # fully-connected layers which alternate between ``nn.Linear``,
352
- # ``norm_layer``, and ``nn.Sigmoid``. If we apply batch normalization,
353
- # then ``norm_layer`` will use
339
+ # effective technique to resolve `vanishing/exploding
340
+ # gradients <https://arxiv.org/abs/1211.5063>`__, and we will be verifying
341
+ # that experimentally.
342
+ #
343
+ # The model we use has a configurable number of repeating fully-connected
344
+ # layers which alternate between ``nn.Linear``, ``norm_layer``, and
345
+ # ``nn.Sigmoid``. If batch normalization is enabled, then ``norm_layer``
346
+ # will use
354
347
# `BatchNorm1d <https://docs.pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html>`__,
355
- # otherwise it will use the identity transformation
356
- # `Identity <https://docs.pytorch.org/docs/stable/generated/torch.nn.Identity.html>`__.
348
+ # otherwise it will use the
349
+ # `Identity <https://docs.pytorch.org/docs/stable/generated/torch.nn.Identity.html>`__
350
+ # transformation.
357
351
#
358
352
359
353
def fc_layer (in_size , out_size , norm_layer ):
@@ -416,60 +410,60 @@ def forward(self, x):
416
410
417
411
418
412
######################################################################
419
- # Because we are using a ``nn.Module`` instead of individual tensors for
420
- # our forward pass, we need another method to access the intermediate
421
- # gradients. This is done by `registering a
422
- # hook <https://www.digitalocean.com/community/tutorials/pytorch-hooks-gradient-clipping-debugging>`__.
413
+ # Because we wrapped up the logic and state of our model in a
414
+ # ``nn.Module``, we need another method to access the intermediate
415
+ # gradients if we want to avoid modifying the module code directly. This
416
+ # is done by `registering a
417
+ # hook <https://docs.pytorch.org/docs/stable/notes/autograd.html#backward-hooks-execution>`__.
423
418
#
424
419
# .. warning::
425
420
#
426
- # Note that using backward pass hooks to probe an intermediate nodes gradient is preferred over using `retain_grad()`.
427
- # It avoids the memory retention overhead if gradients aren't needed after backpropagation.
428
- # It also lets you modify and/or clamp gradients during the backward pass, so they don't vanish or explode.
429
- # However, if in-place operations are performed, you cannot use the backward pass hook
430
- # since it wraps the forward pass with views instead of the actual tensors. For more information
431
- # please refer to https://github.com/pytorch/pytorch/issues/61519.
421
+ # Using backward pass hooks attached to output tensors is preferred over using ``retain_grad()`` on the tensors themselves. An alternative method is to directly attach module hooks (e.g. ``register_full_backward_hook()``) so long as the ``nn.Module`` instance does not do perform any in-place operations. For more information, please refer to `this issue <https://github.com/pytorch/pytorch/issues/61519>`__.
432
422
#
433
- # The following code defines our forward pass hook (notice the call to
434
- # ``retain_grad()``) and also gathers descriptive names for the network’s
435
- # layers.
423
+ # The following code defines our hooks and gathers descriptive names for
424
+ # the network’s layers.
436
425
#
437
426
438
- def hook_forward_wrapper (module_name , outputs ):
439
- """Python function closure so we can pass args"""
427
+ # note that wrapper functions are used for Python closure
428
+ # so that we can pass arguments.
429
+
430
+ def hook_forward_wrapper (module_name , grads ):
440
431
def hook_forward (module , args , output ):
441
- """Hook for forward pass which retains gradients and saves intermediate tensors"""
442
- output .retain_grad ()
443
- outputs .append ((module_name , output ))
432
+ """Forward pass hook which attaches backward pass hooks to intermediate tensors"""
433
+ output .register_hook (hook_backward_wrapper (module_name , grads ))
444
434
return hook_forward
435
+
436
+ def hook_backward_wrapper (module_name , grads ):
437
+ def hook_backward (grad ):
438
+ """Backward pass hook which appends gradients"""
439
+ grads .append ((module_name , grad ))
440
+ return hook_backward
445
441
446
442
def get_all_layers (model , hook_fn ):
447
- """Register forward pass hook to all outputs in model
443
+ """Register forward pass hook (hook_fn) to model outputs
448
444
449
- Returns layers, a dict with keys as layer/module and values as layer/module names
450
- e.g.: layers[nn.Conv2d] = layer1.0.conv1
451
-
452
- Returns outputs, a list of tuples with module name and tensor output. e.g.:
453
- outputs[0] == (layer1.0.conv1, tensor.Torch(...))
454
-
455
- The layer name is passed to a forward hook which will eventually go into a tuple
445
+ Returns:
446
+ - layers: a dict with keys as layer/module and values as layer/module names
447
+ e.g. layers[nn.Conv2d] = layer1.0.conv1
448
+ - grads: a list of tuples with module name and tensor output gradient
449
+ e.g. grads[0] == (layer1.0.conv1, tensor.Torch(...))
456
450
"""
457
451
layers = dict ()
458
- outputs = []
452
+ grads = []
459
453
for name , layer in model .named_modules ():
454
+ # skip Sequential and/or wrapper modules
460
455
if any (layer .children ()) is False :
461
- # skip Sequential and/or wrapper modules
462
456
layers [layer ] = name
463
- layer .register_forward_hook (hook_forward_wrapper (name , outputs ))
464
- return layers , outputs
457
+ layer .register_forward_hook (hook_fn (name , grads ))
458
+ return layers , grads
465
459
466
460
# register hooks
467
- layers_bn , outputs_bn = get_all_layers (model_bn , hook_forward_wrapper )
468
- layers_nobn , outputs_nobn = get_all_layers (model_nobn , hook_forward_wrapper )
461
+ layers_bn , grads_bn = get_all_layers (model_bn , hook_forward_wrapper )
462
+ layers_nobn , grads_nobn = get_all_layers (model_nobn , hook_forward_wrapper )
469
463
470
464
471
465
######################################################################
472
- # Now let’s train the models for a few epochs:
466
+ # Let’s now train the models for a few epochs:
473
467
#
474
468
475
469
epochs = 10
@@ -478,8 +472,8 @@ def get_all_layers(model, hook_fn):
478
472
479
473
# important to clear, because we append to
480
474
# outputs everytime we do a forward pass
481
- outputs_bn .clear ()
482
- outputs_nobn .clear ()
475
+ grads_bn .clear ()
476
+ grads_nobn .clear ()
483
477
484
478
optimizer_bn .zero_grad ()
485
479
optimizer_nobn .zero_grad ()
@@ -498,33 +492,33 @@ def get_all_layers(model, hook_fn):
498
492
499
493
500
494
######################################################################
501
- # After running the forward and backward pass, the ``grad`` values for all
502
- # the intermediate tensors should be present in ``outputs_bn `` and
503
- # ``outputs_nobn ``. We reduce the gradient matrix to a single number (mean
504
- # absolute value) so that we can compare the two models.
495
+ # After running the forward and backward pass, the gradients for all the
496
+ # intermediate tensors should be present in ``grads_bn `` and
497
+ # ``grads_nobn ``. We compute the mean absolute value of each gradient
498
+ # matrix so that we can compare the two models.
505
499
#
506
500
507
- def get_grads (outputs ):
501
+ def get_grads (grads ):
508
502
layer_idx = []
509
503
avg_grads = []
510
- for idx , (name , output ) in enumerate (outputs ):
511
- if output . grad is not None :
512
- avg_grad = output . grad .abs ().mean ()
504
+ for idx , (name , grad ) in enumerate (grads ):
505
+ if grad is not None :
506
+ avg_grad = grad .abs ().mean ()
513
507
avg_grads .append (avg_grad )
514
- layer_idx .append (idx )
508
+ # idx is backwards since we appended in backward pass
509
+ layer_idx .append (len (grads ) - 1 - idx )
515
510
return layer_idx , avg_grads
516
511
517
- layer_idx_bn , avg_grads_bn = get_grads (outputs_bn )
518
- layer_idx_nobn , avg_grads_nobn = get_grads (outputs_nobn )
512
+ layer_idx_bn , avg_grads_bn = get_grads (grads_bn )
513
+ layer_idx_nobn , avg_grads_nobn = get_grads (grads_nobn )
519
514
520
515
521
516
######################################################################
522
- # Now that we have all our gradients stored in ``avg_grads``, we can plot
523
- # them and see how the average gradient values change as a function of the
524
- # network depth. We see that when we don’t have batch normalization, the
525
- # gradient values in the intermediate layers fall to zero very quickly.
526
- # The batch normalization model, however, maintains non-zero gradients in
527
- # its intermediate layers.
517
+ # With the average gradients computed, we can now plot them and see how
518
+ # the values change as a function of the network depth. Notice that when
519
+ # we don’t apply batch normalization, the gradient values in the
520
+ # intermediate layers fall to zero very quickly. The batch normalization
521
+ # model, however, maintains non-zero gradients in its intermediate layers.
528
522
#
529
523
530
524
fig , ax = plt .subplots ()
@@ -566,7 +560,7 @@ def get_grads(outputs):
566
560
# - Try increasing the number of layers (``num_layers``) in our model and
567
561
# see what effect this has on the gradient flow graph
568
562
# - How would you adapt the code to visualize average activations instead
569
- # of average gradients? (*Hint: in the ``get_grads()`` function we have
563
+ # of average gradients? (*Hint: in the hook_forward() function we have
570
564
# access to the raw tensor output*)
571
565
# - What are some other methods to deal with vanishing and exploding
572
566
# gradients?
@@ -585,9 +579,6 @@ def get_grads(outputs):
585
579
# mechanics <https://docs.pytorch.org/docs/stable/notes/autograd.html>`__
586
580
# - `Batch Normalization: Accelerating Deep Network Training by Reducing
587
581
# Internal Covariate Shift <https://arxiv.org/abs/1502.03167>`__
588
- #
589
-
590
-
591
- ######################################################################
592
- #
582
+ # - `On the difficulty of training Recurrent Neural
583
+ # Networks <https://arxiv.org/abs/1211.5063>`__
593
584
#
0 commit comments