As described in this git issue (very complete description), I'm trying to load an .onnx model in openvino backend. However, the BatchNorm layers are considered in training mode when setting track_running_stats=False
.
Here is how I do this before converting my torch model to onnx:
model.eval()
for child in model.children():
if type(child)==nn.BatchNorm2d:
child.track_running_stats = False
child.running_mean = None
child.running_var = None
Then, I export the model to onnx :
dummy_input = torch.randn(1, 3, 200, 200, requires_grad=True)
torch.onnx.export(model, dummy_input, model_path, export_params=True, opset_version=16, training=torch.onnx.TrainingMode.PRESERVE)
Finally, I got this error when loading it in openvino :
Error: Check '(node.get_outputs_size() == 1)' failed at src/frontends/onnx/frontend/src/op/batch_norm.cpp:67:
While validating ONNX node '<Node(BatchNormalization): BatchNormalization_10>':
Training mode of BatchNormalization is not supported.
As mentioned in the git issue, I have tried to look at the BatchNorm inputs/outputs :
for node in onnx_model.graph.node:
if any(("BatchNorm" in s or "bn" in s) for s in node.input) or any(("BatchNorm" in s or "bn" in s) for s in node.output):
print('Node:',node.name)
print(node)
So you can see those nodes are related to BN :
Node: ReduceMean_5
input: "onnx::ReduceMean_22"
output: "onnx::BatchNormalization_23"
name: "ReduceMean_5"
op_type: "ReduceMean"
attribute {
name: "axes"
ints: 0
ints: 1
type: INTS
}
attribute {
name: "keepdims"
i: 0
type: INT
}
Node: ReduceMean_9
input: "onnx::ReduceMean_26"
output: "onnx::BatchNormalization_27"
name: "ReduceMean_9"
op_type: "ReduceMean"
attribute {
name: "axes"
ints: 0
ints: 1
type: INTS
}
attribute {
name: "keepdims"
i: 0
type: INT
}
Node: BatchNormalization_10
input: "input"
input: "bn1.weight"
input: "bn1.bias"
input: "onnx::BatchNormalization_23"
input: "onnx::BatchNormalization_27"
output: "input.4"
output: "29"
output: "30"
name: "BatchNormalization_10"
op_type: "BatchNormalization"
attribute {
name: "epsilon"
f: 9.999999747378752e-06
type: FLOAT
}
attribute {
name: "momentum"
f: 0.8999999761581421
type: FLOAT
}
attribute {
name: "training_mode"
i: 1
type: INT
}
Node: BatchNormalization_13
input: "input.8"
input: "bn2.0.weight"
input: "bn2.0.bias"
input: "bn2.0.running_mean"
input: "bn2.0.running_var"
output: "input.12"
name: "BatchNormalization_13"
op_type: "BatchNormalization"
attribute {
name: "epsilon"
f: 9.999999747378752e-06
type: FLOAT
}
attribute {
name: "momentum"
f: 0.8999999761581421
type: FLOAT
}
attribute {
name: "training_mode"
i: 0
type: INT
}
Node: ReduceMean_18
input: "onnx::ReduceMean_37"
output: "onnx::BatchNormalization_38"
name: "ReduceMean_18"
op_type: "ReduceMean"
attribute {
name: "axes"
ints: 0
ints: 1
type: INTS
}
attribute {
name: "keepdims"
i: 0
type: INT
}
Node: ReduceMean_22
input: "onnx::ReduceMean_41"
output: "onnx::BatchNormalization_42"
name: "ReduceMean_22"
op_type: "ReduceMean"
attribute {
name: "axes"
ints: 0
ints: 1
type: INTS
}
attribute {
name: "keepdims"
i: 0
type: INT
}
Node: BatchNormalization_23
input: "input.16"
input: "bn3.weight"
input: "bn3.bias"
input: "onnx::BatchNormalization_38"
input: "onnx::BatchNormalization_42"
output: "43"
output: "44"
output: "45"
name: "BatchNormalization_23"
op_type: "BatchNormalization"
attribute {
name: "epsilon"
f: 9.999999747378752e-06
type: FLOAT
}
attribute {
name: "momentum"
f: 0.8999999761581421
type: FLOAT
}
attribute {
name: "training_mode"
i: 1
type: INT
}
And that you can indeed see that 2/3 BN layers are in training mode = 1 (-> True).
How to deal with it in order for onnx to consider them in eval mode, while keeping the track_running_stats=False
?
I'm not very familiar with Onnx and more globally a beginner in DL so I'd enjoy any advices !
I finally have found a solution based on a suggestion made on the GitHub issue linked in the question.
So, after setting
track_running_stats
to False, theBatchNormalization
layers are considered in training mode as you can see in the Onnx graph.I have deleted directly into the graph the unused outputs referred to the
mean
andvar
in the batch normalization layers, then setting manually the layer into eval mode (training_mode = 0
). You must delete the unused outputs and not only set thetraining_mode
attribute to0
because otherwise the check will not be passed.After that, I'm able to run inferences correctly and to have the expected results.