On uops.info VRSQRTPS is listed as having a lower latency than VSQRTPS across all the architectures I've checked. It also has a lower throughput but perhaps there are less units that can do it on most designs.
Intuitively, you would compute the reciprocal square root by finding the square root and then inverting it, so this would be slower than just finding the square root. But this is not what we see.
What makes computing the reciprocal square root faster?