I have a cuDF Series containing long strings and I would like to split each string into equal sized chunks.
My code to do this looks something like:
import cudf
s = cudf.Series(["abcdefg", "hijklmnop"])
def chunker(string):
chunk_size = 3
return [string[i:i+chunk_size] for i in range(0, len(string), chunk_size)]
print(s.apply(chunker))
This gives the error:
No implementation of function Function(<class 'range'>) found for signature:
>>> range(Literal[int](0), Masked(int32), Literal[int](3))
If I replace len(string) with a constant, then I get another error complaining about the indexing:
No implementation of function Function(<built-in function getitem>) found for signature:
>>> getitem(Masked(string_view), slice<a:b>)
The code works fine in regular Pandas but I was hoping to run this on some really large datasets and benefit from cdDF GPU operations.
You can use
str.findallfor this operation with a regular expression to match any character between 1 and 3 (chunk size) times, which will be faster in pandas and cuDF:You may also be interested in cudf.pandas, the zero-code change accelerator for pandas code.