How to convert numpy array to a Zarr array

1.3k views Asked by At

Suppose I have a converted a simple to column dataframe to a numpy array:

gdf.head()
>>>

     rid    rast
0      1    01000001000761C3ECF420013F0761C3ECF42001BF7172...
1      2    01000001000761C3ECF420013F0761C3ECF42001BF64BF...
2      3    01000001000761C3ECF420013F0761C3ECF42001BF560C...
3      4    01000001000761C3ECF420013F0761C3ECF42001BF7F25...
4      5    01000001000761C3ECF420013F0761C3ECF42001BF7172...

raster_np = gdf.to_numpy()
raster_np[0][0]
>>> array([1, '01000001000761C3E.........], dtype=object))   

I've been tasked with converting the numpy array to a Zarr file format (because of the size of the rast values and the size of the dataframe, chunking and compression might be necessary and the new .zarr files could be utilized better on an S3/cloud storage environment, I assume). I created a simple Zarr array like so:

 z_test = z.zeros(shape=(10000, 2), chunks=(10000, 2))
 z_test
 >>> <zarr.core.Array (10000, 2) float64>

Now, how do I get the data in raster_np into z_test and retain the Zarr attributes? Simply using z_test = raster_np obviously doesn't work. Perhaps there is something I am misunderstanding about Zarr. Any suggestions?

1

There are 1 answers

0
user2653663 On

Since your initial array is of mixed type (object) you need to create the zarr array with the correct data type, and encode the data. You can use the JSON encoder from numcodecs

import numcodecs

z_test = zarr.zeros(shape=(10000, 2), dtype=object, object_codec=numcodecs.JSON())
z_test[:] = raster_np

You will however have better performance if you store the rid and raster column as separate arrays with int and str datatypes respectively, or convert the hex to another basis.