Skip to content

Differentiation in TensorFlow

Automatic differentiation is useful for implementing machine learning algorithms such as backpropagation for training neural networks. In this guide, we will explore ways to compute gradients with TensorFlow in eager execution.

Gradients and Automatic Differentiation

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
w = tf.Variable(tf.random.normal((3, 2)), name='w')
b = tf.Variable(tf.zeros(2, dtype=tf.float32), name='b')
x = [[1., 2., 3.]]
2022-12-28 18:53:47.361164: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

w
<tf.Variable 'w:0' shape=(3, 2) dtype=float32, numpy=
array([[ 0.5979703 , -0.42594454],
       [ 0.44298685,  0.67796963],
       [-0.00147911, -1.0101731 ]], dtype=float32)>
b
<tf.Variable 'b:0' shape=(2,) dtype=float32, numpy=array([0., 0.], dtype=float32)>

TensorFlow provides the tf.GradientTape API for automatic differentiation; that is, computing the gradient of a computation with respect to some inputs, usually tf.Variables. TensorFlow "records" relevant operations executed inside the context of a tf.GradientTape onto a "tape". TensorFlow then uses that tape to compute the gradients of a "recorded" computation using reverse mode differentiation.

with tf.GradientTape(persistent=True) as tape:
    y = x @ w + b
    loss = tf.reduce_mean(y**2)

# Using lists
grad = tape.gradient(loss, [w, b])
print(f'w : {grad[0]} \n\nb : {grad[1]}\n\n')

# Using dictionaries
grad = tape.gradient(loss, {'w': w, 'b': b})
print(f'w : {grad["w"]} \n\nb : {grad["b"]}')
w : [[ 1.4795066 -2.1005244]
 [ 2.9590132 -4.201049 ]
 [ 4.43852   -6.3015733]] 

b : [ 1.4795066 -2.1005244]


w : [[ 1.4795066 -2.1005244]
 [ 2.9590132 -4.201049 ]
 [ 4.43852   -6.3015733]] 

b : [ 1.4795066 -2.1005244]

# A trainable variable
x0 = tf.Variable(3.0, name='x0')

# Not trainable
x1 = tf.Variable(3.0, name='x1', trainable=False)

# Not a Variable: A variable + tensor returns a tensor.
x2 = tf.Variable(2.0, name='x2') + 1.0

# Not a variable
x3 = tf.constant(3.0, name='x3')

with tf.GradientTape() as tape:
    y = (x0**2) + (x1**2) + (x2**2)

grad = tape.gradient(y, [x0, x1, x2, x3])

for g in grad:
    print(g)
tf.Tensor(6.0, shape=(), dtype=float32)
None
None
None

tape.watched_variables is used to get the list of all variables which tensorflow is watching

[var.name for var in tape.watched_variables()]
['x0:0']

To disable the default behavior of watching all tf.Variables, set watch_accessed_variables=False when creating the gradient tape

x0 = tf.Variable(0.0)
x1 = tf.Variable(10.0)

with tf.GradientTape(watch_accessed_variables=False) as tape:
    # set only x1 to be watched not x0
    tape.watch(x1)
    y0 = tf.math.sin(x0)
    y1 = tf.nn.softplus(x1)
    y = y0 + y1
    ys = tf.reduce_sum(y)

grad = tape.gradient(ys, {'x0': x0, 'x1': x1})

print('dy/dx0:', grad['x0'])
print('dy/dx1:', grad['x1'].numpy())
dy/dx0: None
dy/dx1: 0.9999546

By default, the resources held by a GradientTape are released as soon as the GradientTape.gradient method is called. To compute multiple gradients over the same computation, create a gradient tape with persistent=True. This allows multiple calls to the gradient method as resources are released when the tape object is garbage collected.

x = tf.constant([1, 3.0])
with tf.GradientTape(persistent=True) as tape:
    tape.watch(x)
    y = x * x * x * x 
    z = y * y * y * y

print(tape.gradient(z, x).numpy()) 
print(tape.gradient(y, x).numpy())
[1.6000000e+01 2.2958251e+08]
[  4. 108.]

Notes on performance

  • There is a tiny overhead associated with doing operations inside a gradient tape context. For most eager execution this will not be a noticeable cost, but you should still use tape context around the areas only where it is required.

  • Gradient tapes use memory to store intermediate results, including inputs and outputs, for use during the backwards pass.

  • For efficiency, some ops (like ReLU) don't need to keep their intermediate results and they are pruned during the forward pass. However, if you use persistent=True on your tape, nothing is discarded and your peak memory usage will be higher.

Control Flow

Here a different variable is used on each branch of an if. The gradient only connects to the variable that was used.

x = tf.constant(1.0)

v0 = tf.Variable(2.0)
v1 = tf.Variable(2.0)

with tf.GradientTape(persistent=True) as tape:
    tape.watch(x)
    if x > 0.0:
        result = v0
    else:
        result = v1**2 

dv0, dv1 = tape.gradient(result, [v0, v1])

print(dv0)
print(dv1)

dx = tape.gradient(result, x)
print(dx)
tf.Tensor(1.0, shape=(), dtype=float32)
None
None

Control statements themselves are not differentiable, so they are invisible to gradient-based optimizers. Depending on the value of x in the above example, the tape either records result = v0 or result = v1**2. The gradient with respect to x is always None.