I have a number of Parquet files in our data warehouse. Some of the earlier files ~700 have a Schema type for a column set to string when they should have been int32. Understanding Parquet are immutable; I'm looking for the best way to re-write these files with the correct column type. I am using Ruby with the red-parquet gem.
I have tried to cast the column to int and then save the file to a new location. It doesn't error but doesn't work. I've outlined the method I'm using below. Any help would be much appreciated.
def castCol(col = nil)
filesWritten = 0
getParquets.each do |file|
table = Arrow::Table.load(file)
if table.heading.data_type == "string"
newFileLoc = @saveDir + File.path(file)
puts newFileLoc
# Create Dir if Required
unless File.directory?(File.dirname(newFileLoc))
FileUtils.mkdir_p(File.dirname(newFileLoc))
end
table.heading.cast('int32')
table.save(newFileLoc)
filesWritten += 1
end
end
puts "Numebr of File Written: #{filesWritten}"
end
I've written the conversion in Python. The PyArrow libraries are better documented, which is probably to be expected.