Starting from a single column abc of type ARRAY<STRING> with one row:
SELECT ARRAY ['a', 'b', 'c'] AS abc
If we execute the query with pyathena using the ArrowCursor:
cursor = pyathena.connect(**AWS_PARAMETERS).cursor(ArrowCursor)
execution_result = cursor.execute(query)
Where the imports and parameters are:
import pyathena
from pyathena.arrow.cursor import ArrowCursor
AWS_PARAMETERS = {
'aws_access_key_id': ***,
'aws_secret_access_key': ***,
'region_name': ***,
's3_staging_dir': ***,
'work_group': ***,
}
Inspecting the execution_results:
execution_result.description
It seems that pyathena got the column type right as array:
[('abc', 'array', None, None, 0, 0, 'UNKNOWN')]
However, when "materializing" the pyarrow table:
table = execution_result.as_arrow()
Unfortunately the column content becomes a STRING instead of an ARRAY<STRING>:
pyarrow.Table
abc: string
----
abc: [["[a, b, c]"]]
All I am trying to do here is to get that query result as a Polars (or Pandas) DataFrame with a column of type ARRAY<STRING>.
Is there a better and obvious way I am missing here to get the "type right"?
Should I really parse myself that string into a list of strings myself or even convert that string to JSON then to a list of string?
8 months after asking this question the current version of
pyathena-3.0.10is correctly materializing the arrow table: