List Question
20 TechQA 2024-03-14T13:58:49.053000Achieving More FMA3 Performance Than The Theoretical Maximum
52 views
Asked by Anili
High Variance In Manual Vectorization Performance
47 views
Asked by Anili
Can we replace XOR with multiply-add?
235 views
Asked by Serge Rogatch
Why this AVX2 slowdown with FMA x86 MS C Compiler?
149 views
Asked by Martin Brown
Clang fused multiply-add depends on constancy of expression arguments
207 views
Asked by Fedor
v4fmaddps instructions for packed 32-bit integers
153 views
Asked by anna
GCC 12 (minGW 64): how to enable fused multiply add code generation
206 views
Asked by elena
How should I implement a generic FMA/FMAF instruction in software?
345 views
Asked by xiaohuihui
Fast fixed-size polynomial evaluation: MSVC vs GCC
107 views
Asked by pem
Deleteing initialization leads to avx2 fma performance drop. Why?
123 views
Asked by tigertang
Latency and number of FMA units
257 views
Asked by Mirco Mannino
Fastest way to multiply and sum/add two arrays (dot product) - unaligned surprisingly faster than FMA
1.2k views
Asked by Peter
Terminology: why "floating multiply-add" instead of "fused multiply-add"?
135 views
Asked by pmor
vfmadd231ps Floating Point Exception c0000090
173 views
Asked by Alois Kraus
Why Fma code is performing worse than Avx?
406 views
Asked by Станислав Герасименко
Is there any better implemention for integer 'mul and add' with avx?
689 views
Asked by TimeOrange
How to find magic multipliers for divisions by constant on a GPU?
322 views
Asked by amonakov
CUDA half float operations without explicit intrinsics
738 views
Asked by Bram
incompatible types when assigning to type ‘__m256d’ from type ‘int’
1.1k views
Asked by Mehdi
How to refine floating-point division on FMA-capable GPUs?
303 views
Asked by amonakov