I'm looking for the most performant way to compute all first integer powers of floating `X`

inside SSE-128/AVX-256/AVX-512 register (128 and 256 and 512 bits), e.g. for float AVX1-256 I want to get in register `X^1, X^2, X^3, X^4, X^5, X^6, X^7, X^8`

given `X`

on input. I need code for both float-32 and double-64.

I implemented this code for AVX-256/float-32 case:

```
__m256 _mm256_powers_ps(float x) {
float const x2 = x * x, x4 = x2 * x2;
return _mm256_mul_ps(
_mm256_mul_ps(
_mm256_setr_ps( x, x2, x, x2, x, x2, x, x2),
_mm256_setr_ps( 1, 1, x2, x2, 1, 1, x2, x2)
),
_mm256_setr_ps( 1, 1, 1, 1, x4, x4, x4, x4)
);
}
```

I designed it mainly to look simple. But I think performance-wise it can be made faster, maybe one or two less multiplications. Can anybody suggest a faster version? Maybe there is even some single AVX instruction for computing these powers?

I'm interested in both float-32 and double-64 versions for all of 128-bit (4 floats or 2 doubles) and 256-bit (8 floats or 4 doubles) and 512-bit SIMD registers (16 floats or 8 doubles).

I also implemented possibly faster version, looking more complex, but that has one less multiplication, speed comparison wasn't done:

```
__m256 _mm256_powers_ps(float x) {
float const x2 = x * x;
__m256 const r = _mm256_mul_ps(
_mm256_setr_ps( x, x2, x, x2, x, x2, x, x2),
_mm256_setr_ps( 1, 1, x2, x2, 1, 1, x2, x2)
);
return _mm256_mul_ps(r,
_mm256_setr_m128(_mm_set1_ps(1),
_mm_permute_ps(_mm256_castps256_ps128(r), 0b11'11'11'11))
);
}
```

I was also thinking about that simple non-SIMD solution might be also fast, because of good parallel pipelining of many independent multiplications:

```
__m256 _mm256_powers_ps(float x) {
auto const x2 = x * x;
auto const x3 = x2 * x;
auto const x4 = x2 * x2;
return _mm256_setr_ps(
x, x2, x3, x4, x4 * x, x4 * x2, x4 * x3, x4 * x4);
}
```

**Note**: these powers of `X`

are needed for the stage of computing float or double precision polynomial, see my other question regarding computing polynomials on SIMD. This polynomial may have different degrees, sometimes 3, sometimes 6, sometimes 9, sometimes 15, even 25 or 32 or 43 degree. This numbers are arbitrary, actually there could be used whole range of 1-48 degrees of polynomials. Coefficients of polynomials are given beforehand as constants. But value `X`

for which it should be computed is not known in advance. Of cause I will use `FMA`

module to compute poly value itself, but precomputed `X`

powers are needed for computing poly using SIMD.

Most common target CPU that will be used is Intel Xeon Gold 6230 that has AVX-512, so I need to optimize code for it.