I am using the keras mixed precision API in order to fit my networks in GPU. Typically in my code this will look like this. A MWE would be:
from tensorflow.keras.mixed_precision import experimental as mixed_precision
use_mixed_precision = True
if use_mixed_precision:
policy_type = 'mixed_float16'
else:
policy_type = 'float32'
policy = mixed_precision.Policy(policy_type)
mixed_precision.set_policy(policy)
This seems to have the desired effect, as, when I train my model and profile it using the TensorBoard Callback, a large chunk of my ops are run in half precision, and some of them are using TensorCore (I have a GPU with compute capability of more than 7.0).
However, Conv2DBackpropFilter
is not using TensorCore, even though according to the TensorBoard information it is eligible for its use.
I don't have a minimal reproducible example for the whole thing yet, and I can work on it if needed but I wanted to know first if this was an expected behaviour or if there were some know gotchas, since I couldn't find any information online.
EDIT
I have an MRE which has a different behaviour but the same question: why is TensorCore not used (all dimensions needed are multiple of 8)?
import tensorflow as tf
from tensorflow.keras.mixed_precision import experimental as mixed_precision
use_mixed_precision = True
if use_mixed_precision:
policy_type = 'mixed_float16'
else:
policy_type = 'float32'
policy = mixed_precision.Policy(policy_type)
mixed_precision.set_policy(policy)
nf = 8
model = tf.keras.models.Sequential([
tf.keras.layers.Conv2D(filters=nf, kernel_size=3, padding='same'),
tf.keras.layers.Conv2D(filters=nf, kernel_size=3, padding='same'),
tf.keras.layers.Conv2D(filters=nf, kernel_size=3, padding='same'),
])
model.compile(loss='mse', optimizer='sgd')
bs = 8
inputs = tf.random.normal([bs, 32, 32, 1])
outputs = tf.random.normal([bs, 32, 32, nf])
tboard_cback = tf.keras.callbacks.TensorBoard(
profile_batch='5, 10',
log_dir='logs',
histogram_freq=0,
write_graph=False,
write_images=False,
)
model.fit(inputs, outputs, callbacks=[tboard_cback], epochs=15)
In this MRE, 64.2% of the ops time is spent in half-precision, meaning that the half-precision is indeed happening. In my logs I also have the check for compute capability:
NFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
Your GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: Tesla V100-SXM2-32GB, compute capability 7.0
Yet, none of the ops (this time it's not just Conv2DBackpropFilter
) run with TensorCore. I don't understand why.