I am exploring System.Numerics.Vector with .NET Framework 4.7.2 (the project I am working on cannot be migrated to .NET Core 3 and use the new Intrinsics namespace yet). The project is processing very large CSV/TSV files and we spend a lot of times looping through strings to find commas, quotes, etc. and I am trying to speed up the process.
So far, I have been able to use Vector to identify if a string contains a given character or not (using EqualsAny method). That’s great, but I want to go a little further. I want to efficiently find the index of that character using Vector. I do not know how. Below is he function I use to determine if a string contains a comma or not.
private static readonly char Comma = ',';
public static bool HasCommas(this string s)
{
if (s == null)
{
return false;
}
ReadOnlySpan<char> charSpan = s.AsSpan();
ReadOnlySpan<Vector<ushort>> charAsVectors = MemoryMarshal.Cast<char, Vector<ushort>>(charSpan);
foreach (Vector<ushort> v in charAsVectors)
{
bool foundCommas = Vector.EqualsAny(v, StringExtensions.Commas);
if (foundCommas)
{
return true;
}
}
int numberOfCharactersProcessedSoFar = charAsVectors.Length * Vector<ushort>.Count;
if (s.Length > numberOfCharactersProcessedSoFar)
{
for (int i = numberOfCharactersProcessedSoFar; i < s.Length; i++)
{
if (s[i] == ',')
{
return true;
}
}
}
return false;
}
I understand that I could use the function above and scan the resulting Vector, but it would defeat the purpose of using a Vector. I heard about the new Intrinsics library that could help, but I cannot upgrade my project to .NET Core 3.
Given a Vector, how would you efficiently find the position of a character? Is there a clever trick that I am not aware of?