My CPU is AMD Ryzen 7 7840H which supports AVX-512 instruction set. When I run the .NET8 program, the value of Vector512.IsHardwareAccelerated is true. But System.Numerics.Vector<T> is still 256-bit, and does not reach 512-bit.
Why doesn't the Vector<T> type reach 512 bits in length? Is it currently unsupported, or do I need to tweak the configuration?
Example code:
TextWriter writer = Console.Out;
writer.WriteLine(string.Format("Vector512.IsHardwareAccelerated:\t{0}", Vector512.IsHardwareAccelerated));
writer.WriteLine(string.Format("Vector.IsHardwareAccelerated:\t{0}", Vector.IsHardwareAccelerated));
writer.WriteLine(string.Format("Vector<byte>.Count:\t{0}\t# {1}bit", Vector<byte>.Count, Vector<byte>.Count * 8));
Test results:
Vector512.IsHardwareAccelerated: True
Vector.IsHardwareAccelerated: True
Vector<byte>.Count: 32 # 256bit
See https://github.com/dotnet/runtime/issues/92189 - For the same hardware reason that C compilers default to
-mprefer-vector-width=256when auto-vectorizing large loops, C# doesn't automatically make all vectorized code use 512-bit even if it's available.Also, for small problems, e.g. 9 floats, it could mean no vectorized iterations happen, just scalar fallback code.
Also, apparently some code-bases (hopefully accidentally) depend on Vector not being wider than 32-byte, so it would be a breaking change for those.
I commented on the dotnet github issue with some details about the CPU-hardware reasons; I'll reproduce some of that here:
-mprefer-vector-width=512vs.256on Ice Lake Xeon. But again, that's LLVM auto-vectorization of scalar code, not like C# where this would only affect manually-vectorized loops, so the tuning considerations are somewhat different from-mprefer-vector-width=256.In a program that frequently wakes up for short bursts of computation, its AVX-512 usage will still lower turbo frequency for the core, affecting other programs.
Things are different on Zen 4; they handle 512-bit vectors by taking extra cycles in the execution units, so as long as 512-bit vectors don't require more shuffling work or some other effect that would add overhead, 512-bit vectors are a good win for front-end throughput and how far ahead out-of-order exec can see in terms of elements or scalar iterations. (Since a 512-bit uop is still only 1 uop for the front-end.) GCC and Clang default to
-mprefer-vector-width=512for-march=znver4.There's no turbo penalty or other inherent downsides to 512-bit vectors on Zen 4 (AFAIK; I don't know how misaligned loads perform). It's just a matter of whether software can use them efficiently (without needing more bloated code for loop prologues / epilogues, e.g. scalar cleanup if a masked final iteration doesn't Just Work.) AVX-512 masked stores are efficient on Zen 4, despite the fact that AVX1/2
vmaskmovps/vpmaskmovdaren't. (https://uops.info/)For code where you have exactly 32 bytes of something, if the 32-byte vectors are no longer an option then that's a loss. C#'s scalable vector-length model isn't ideal for those cases. ARM SVE or RISC-V Vector extensions where the hardware ISA are designed around a variable vector-length with masking to handle vectors shorter than the HW's native length, but doing the same thing for C#
Vector<>probably wouldn't work well because lots of hardware (x86 with AVX2, or AArch64 without SVE) can't efficiently support masking for arbitrary-length stuff.I wrote more about Intel on the github issue, which I'm not going to copy/paste all of here.
There can be significant overall throughput gains from 512-bit vectors for some workloads on Intel CPUs. But it comes with downsides, like more expensive misaligned memory access.