Use Arrow schema with parquet StreamWriter

1.6k views Asked by At

I am attempting to use the C++ StreamWriter class provided by Apache Arrow.

The only example of using StreamWriter uses the low-level Parquet API i.e.

parquet::schema::NodeVector fields;

fields.push_back(parquet::schema::PrimitiveNode::Make(
  "string_field", parquet::Repetition::OPTIONAL, parquet::Type::BYTE_ARRAY,
  parquet::ConvertedType::UTF8));

fields.push_back(parquet::schema::PrimitiveNode::Make(
   "char_field", parquet::Repetition::REQUIRED, parquet::Type::FIXED_LEN_BYTE_ARRAY,
   parquet::ConvertedType::NONE, 1));

auto node = std::static_pointer_cast<parquet::schema::GroupNode>(
   parquet::schema::GroupNode::Make("schema", parquet::Repetition::REQUIRED, fields));

The end result is a std::shared_ptr<parquet::schema::GroupNode> that can then be passed into StreamWriter.

Is it possible to construct and use "high-level" Arrow schemas with StreamWriter? They are supported when using the WriteTable function (non-streaming) but I've found no examples of its use with the streaming API.

I, of course, can resort to using the low-level API but it is extremely verbose when creating large and complicated schema and I would prefer (but not need) to use the high-level Arrow schema mechanisms.

For example,

std::shared_ptr<arrow::io::FileOutputStream> outfile_;
PARQUET_ASSIGN_OR_THROW(outfile_, arrow::io::FileOutputStream::Open("test.parquet"));

// construct an arrow schema
auto schema = arrow::schema({arrow::field("field1", arrow::int64()),
                                   arrow::field("field2", arrow::float64()),
                                   arrow::field("field3", arrow::float64())});

// build the writer properties
parquet::WriterProperties::Builder builder;
auto properties = builder.build()

// my current best attempt at converting the Arrow schema to a Parquet schema
std::shared_ptr<parquet::SchemaDescriptor> parquet_schema;
parquet::arrow::ToParquetSchema(schema.get(), *properties, &parquet_schema); // parquet_schema is now populated

// and now I try and build the writer - this fails
auto writer = parquet::ParquetFileWriter::Open(outfile_, parquet_schema->group_node(), properties);

The last line fails because parquet_schema->group_node() (which is the only way I'm aware of to get access to the GroupNode for the schema) returns a const GroupNode* whereas ParquetFileWriter::Open) needs a std::shared_ptr<GroupNode>.

I'm not sure casting-away the constness of the returned group node and forcing it into the ::Open() call is an officially supported (or correct) usage of StreamWriter.

Is what I want to do possible?

1

There are 1 answers

0
lambda On

It seems that you need to use low-level api for StreamWriter.

a very tricky way:

auto writer = parquet::ParquetFileWriter::Open(outfile_, std::shared_ptr<parquet::schema::GroupNode>(const_cast<parquet::schema::GroupNode *>(parquet_schema->group_node())), properties);

you may need to converter schema manually.

source code cpp/src/parquet/arrow/schema.cc may help you.

PARQUET_EXPORT
::arrow::Status ToParquetSchema(const ::arrow::Schema* arrow_schema,
                                const WriterProperties& properties,
                                std::shared_ptr<SchemaDescriptor>* out);