In ARM SVE there are masked load instructions svld1
and there are also non-failing loads
svldff1(svptrue<>)
.
Questions:
- Does it make sense to do
svld1
with a mask as opppose tosvldff1
? - The behaviour of mask in
svldff1
seems confusing. Is there a practical reason to provide a not justsvptrue
mask forsvldff1
- Is there any performance difference between
svld1
andsvldff1
Both ldff1 and ld1 can be used to load a vector register. According my informal tests, on an AWS graviton processor, I find no performance difference, in the sense that both instructions (ldff1 and ld1) seem to have roughly the same performance characteristics. However, ldff1 will read and write to the first-fault register (FFR). It implies that you cannot do more than one ldff1 at any one time within an 'FFR group', since they are order sensitive and depend crucially on the FFR.
Furthermore, the ldff1 instruction is meant to be used along with the rdffr instruction, the instruction that generates a mask indicating which loads were successful. Using the rdffr instruction will obviously add some cost. I am assuming that the instruction in question might need to run after ldff1w, thus increasing the latency by at least a cycle. Of course, then you have to do something with the mask that rdffr produces...
Obviously, there is bound to be some small overhead tied to the FFR (clearing, setting, accessing).
"Is there a practical reason to provide a not just svptrue mask for svldff1": The documentation states that the leading inactive elements (up to the fault) are predicated to zero.