Why std::unique_ptr isn't optimized while std::variant can?

331 views Asked by At

I tried to compare the overhead of std::visit(std::variant polymorphism) and virtual function(std::unique_ptr polymorphism).(please note my question is not about overhead or performance, but optimization.) Here is my code. https://quick-bench.com/q/pJWzmPlLdpjS5BvrtMb5hUWaPf0

#include <memory>
#include <variant>

struct Base
{
  virtual void Process() = 0;
};

struct Derived : public Base
{
  void Process() { ++a; }
  int a = 0;
};

struct VarDerived
{
  void Process() { ++a; }
  int a = 0;
};

static std::unique_ptr<Base> ptr;
static std::variant<VarDerived> var;

static void PointerPolyMorphism(benchmark::State& state)
{
  ptr = std::make_unique<Derived>();
  for (auto _ : state)
  {
    for(int i = 0; i < 1000000; ++i)
      ptr->Process();
  }
}
BENCHMARK(PointerPolyMorphism);

static void VariantPolyMorphism(benchmark::State& state)
{
  var.emplace<VarDerived>();
  for (auto _ : state)
  {
    for(int i = 0; i < 1000000; ++i)
      std::visit([](auto&& x) { x.Process();}, var);
  }
}
BENCHMARK(VariantPolyMorphism);

I know it's not good benchmark test, it was only draft during my test. But I was surprised at the result. std::visit benchmark was high(which means slow) without any optimization. But When I turn on optimization (higher than O2), std::visit benchmark is extremely low(which means extremely fast) while std::unique_ptr isn't. I'm wondering why the same optimization can't be applied to the std::unique_ptr polymorphism?

2

There are 2 answers

0
idmean On BEST ANSWER

I've compiled your code with Clang++ to LLVM (without your benchmarking) with -Ofast. Here's what you get for VariantPolyMorphism, unsurprisingly:

define void @_Z19VariantPolyMorphismv() local_unnamed_addr #2 {
  ret void
}

On the other hand, PointerPolyMorphism does really execute the loop and all calls:

define void @_Z19PointerPolyMorphismv() local_unnamed_addr #2 personality i32 (...)* @__gxx_personality_v0 {
  %1 = tail call dereferenceable(16) i8* @_Znwm(i64 16) #8, !noalias !8
  tail call void @llvm.memset.p0i8.i64(i8* nonnull align 16 dereferenceable(16) %1, i8 0, i64 16, i1 false), !noalias !8
  %2 = bitcast i8* %1 to i32 (...)***
  store i32 (...)** bitcast (i8** getelementptr inbounds ({ [3 x i8*] }, { [3 x i8*] }* @_ZTV7Derived, i64 0, inrange i32 0, i64 2) to i32 (...)**), i32 (...)*** %2, align 8, !tbaa !11, !noalias !8
  %3 = getelementptr inbounds i8, i8* %1, i64 8
  %4 = bitcast i8* %3 to i32*
  store i32 0, i32* %4, align 8, !tbaa !13, !noalias !8
  %5 = load %struct.Base*, %struct.Base** getelementptr inbounds ({ { %struct.Base* } }, { { %struct.Base* } }* @_ZL3ptr, i64 0, i32 0, i32 0), align 8, !tbaa !4
  store i8* %1, i8** bitcast ({ { %struct.Base* } }* @_ZL3ptr to i8**), align 8, !tbaa !4
  %6 = icmp eq %struct.Base* %5, null
  br i1 %6, label %7, label %8

7:                                                ; preds = %8, %0
  br label %11

8:                                                ; preds = %0
  %9 = bitcast %struct.Base* %5 to i8*
  tail call void @_ZdlPv(i8* %9) #7
  br label %7

10:                                               ; preds = %11
  ret void

11:                                               ; preds = %7, %11
  %12 = phi i32 [ %17, %11 ], [ 0, %7 ]
  %13 = load %struct.Base*, %struct.Base** getelementptr inbounds ({ { %struct.Base* } }, { { %struct.Base* } }* @_ZL3ptr, i64 0, i32 0, i32 0), align 8, !tbaa !4
  %14 = bitcast %struct.Base* %13 to void (%struct.Base*)***
  %15 = load void (%struct.Base*)**, void (%struct.Base*)*** %14, align 8, !tbaa !11
  %16 = load void (%struct.Base*)*, void (%struct.Base*)** %15, align 8
  tail call void %16(%struct.Base* %13)
  %17 = add nuw nsw i32 %12, 1
  %18 = icmp eq i32 %17, 1000000
  br i1 %18, label %10, label %11
}

The reason for this is that both your variables are static. This allows the compiler to infer that no code outside the translation unit has access to your variant instance. Therefore your loop doesn't have any visible effect and can be safely removed. However, although your smart pointer is static, the memory it points to could still change (as a side-effect of the call to Process, for example). The compiler can therefore not easily prove that is safe to remove the loop and doesn't.

If you remove the static from both VariantPolyMorphism you get:

define void @_Z19VariantPolyMorphismv() local_unnamed_addr #2 {
  store i32 0, i32* getelementptr inbounds ({ { %"union.std::__1::__variant_detail::__union", i32 } }, { { %"union.std::__1::__variant_detail::__union", i32 } }* @var, i64 0, i32 0, i32 1), align 4, !tbaa !16
  store i32 1000000, i32* getelementptr inbounds ({ { %"union.std::__1::__variant_detail::__union", i32 } }, { { %"union.std::__1::__variant_detail::__union", i32 } }* @var, i64 0, i32 0, i32 0, i32 0, i32 0, i32 0), align 4, !tbaa !18
  ret void
}

Which isn't surprising once again. The variant can only contain VarDerived so nothing needs to be computed at run-time: The final state of the variant can already be determined at compile-time. The difference, though, now is that some other translation unit might want to access the value of var later on and the value must therefore be written.

0
Marek R On
  1. Your variant can store only singe type, so this is same as single regular variable (it is working more like an optional).
  2. You are running test without optimizations enabled
  3. Result is not secured from optimizer so it can trash your code.
  4. Your code actually do not utilizes polymorphism, some compilers are able to figure out that there is only one implementation of Base class and drop virtual calls.

This is better but still not trustworthy: ver 1, ver 2 with arrays.

Yes polymorphism can be expensive when used in tight loops.

Witting benchmarks for such small extremely fast features is hard and full of pitfalls, so must be approached with extreme caution, since you reaching limitations of benchmark tool.