Is there a better way to only return each pl.element()
in a polars array if it matches an item contained within a list?
While it works, I get the error The predicate 'col("").is_in([Series])' in 'when->then->otherwise' is not a valid aggregation and might produce a different number of rows than the group_by operation would. This behavior is experimental and may be subject to change
warning which leads me to believe there's probably a more concise/better way:
import polars as pl
terms = ['a', 'z']
(pl.LazyFrame({'a':['x y z']})
.select(pl.col('a')
.str.split(' ')
.list.eval(pl.when(pl.element().is_in(terms))
.then(pl.element())
.otherwise(None))
.list.drop_nulls()
.list.join(' ')
)
.fetch()
)
For posterity's sake, it replaces my previous attempt using .map_elements()
:
import polars as pl
import re
terms = ['a', 'z']
(pl.LazyFrame({'a':['x y z']})
.select(pl.col('a')
.str.split(' ')
.map_elements(lambda x: ' '.join(list(set(re.findall('|'.join(terms), x)))),
return_dtype = pl.Utf8)
)
.fetch()
)
@jqurious and @Dean MacGregor were exactly right, I just wanted to post an solution that explained the differences succinctly:
Also, this closely related question adds a bit more.