Store metadata in CSV file

9.7k views Asked by At

I understand that this would be a misuse of the CSV format. I do understand that more appropriate solution would be to export xls/xlsx file, or give user an OData end point and let them use power pivot to retrieve data. At this time I want to stick with CSV, because it is easier to implement.

I need to export a flat list of rows (a report) as a CSV file. At the same time I would like to include some metadata about the data. E.g. report title, generation date, and user.

Is there a neat way to accomplish both goals using CSV file format?

The intent usage for this file is to allow the user to open file in Excel and do data crunching.

One solution would be to break CSV into two sections: meta and data. E.g.

Title,Report Generation Date,Generated by User,,
Outstanding Jobs,5/5/2015,[email protected],,
,,,,
Job Started, Estimated Completion, Description, Foo, Bar
9/3/2003,1/1/2006,"do something important""""",5,7
5/4/2007,2/2/2010,"do something else""""",3,12

Or perhaps there is a "standard" extension to CSV format for this purpose?

Are there any practical issues with this approach?

5

There are 5 answers

5
Eric J. On

There is no standard extension to CSV that allows for storing of metadata.

You can certainly have a separate section in the file for metadata, but that tends to complicate processing as the CSV parser has to deal with separate headers, etc. Standard CSV parsers won't do that for you.

Consider whether you can store the metadata in a separate file, using a naming convention, e.g.

MyData123.csv
MyData123-Meta.csv

You could bundle both into a Zip archive to keep them together, until they are ready for processing.

3
BdR On

Just want to add that you could also use the schema.ini format, so with your .csv file(s) also include a separate schema.ini file which contains something like this:

[yourfilenamehere.txt]
Format=CSVDelimited
ColNameHeader=True
DateTimeFormat=d/m/yyyy
Col1="Job Started" DateTime Width 8
Col2="Estimated Completion" DateTime Width 8
Col3="Description" Text Width 24
Col4="Foo" Integer Width 1
Col5="Bar" Integer Width 2

One schema.ini file can contain the metadata of multiple files, each file metadata is a new section for example [file1.csv] and [file2.csv] etc. And afaik there isn't separate key for a description or title, but you could just add your own custom keyname (so not used by the Text driver) like this:

[yourfilenamehere.txt]
MyReportTitle=Outstanding Jobs (5/5/2015) [email protected]
Format=CSVDelimited
etc..

Or add a line starting with semi-colon which counts as a comment line in .ini files

[yourfilenamehere.txt]
; Outstanding Jobs (5/5/2015) [email protected]
Format=CSVDelimited
etc..
1
Max Murphy On

There is a W3C recommendation for how to do this.

Here is an example: (Example 21: Tab-separated file containing embedded metadata)

#   publisher   City of Palo Alto
#   updated 12/31/2010
#name   GID on_street   species trim_cycle  inventory_date
#datatype   string  string  string  string  date:M/D/YYYY
    GID On Street   Species Trim Cycle  Inventory Date
    1   ADDISON AV  Celtis australis    Large Tree Routine Prune    10/18/2010
    2   EMERSON ST  Liquidambar styraciflua Large Tree Routine Prune    6/2/2010

There is also a W3C CSV on the web community group.

2
Simon Bengtsson On

When I wanted to add some metadata to a csv file I ended up storing it as a json string as part of the last column header.

timestamp;x;y;z;heading#{"id": "-L4uNnjWXL2cLY_xpU_s"}
89318.329;0.00541;-0.00548;-0.00219;-1.0
89318.339;0.00998;0.0063;-0.00328;-1.0
89318.349;0.0043;0.01318;0.00069;-1.0
89318.359;0.00477;0.0138;0.0007;-1.0

The advantage of this compared to OP's suggestion is that it's a valid csv file in terms of all rows having equal amount of columns and each column containing only one type of data. A parser that knows about the format could split the header at the # and parse the json metadata. A parser that doesn't know of the format will simply display the last column's header with the metadata.

0
donquixote On

I recently ran into this problem, and found a "solution" that worked for me.

The idea is to encode the metadata in a singular cell in the first and/or the last row of the csv, encoded as json.

In your program, you can support this with a stack of decorator layers:

  • A standard component that reads the CSV file as a stream of raw data, that is, a stream where each row is an array of strings (the "cells"). For this layer, the first row, even if it contains the labels, is no different conceptually than all the other rows. Ideally each row would have the same number of cells, but this is not absolutely required.
  • An optional component that can look at the raw file and extract the contents of the last row, without reading the rest of the file.
  • An optional decorator/adapter that consumes the first row, parses the first cell of this row as json, and uses it as metadata.
  • An optional decorator/adapter that discards the last row, knowing that it contains metadata.
  • An optional decorator/adapter layer that consumes the first row as column labels, and then returns each row as an associative array/structure with the column labels as keys. The result of this is a stream of associative arrays, where the values are still strings.
  • An optional decorator/adapter layer that processes the cell contents. E.g. if one column contains json, it would parse those cells and turn them into objects. This layer behaves as a stream of objects, where the property values can be of various types.

I personally did this with PHP, but it should be possible in other languages / environments as well.

Notes

Why would you store some metadata at the end of the file, instead of the beginning?

A typical example would be the total row count, which might only be known after the rest of the file is completely written.

In fact this is the only use case so far where I used this solution. I named these files *.ncsv, indicating that the last row contains the row count. But still this is technically valid csv, if we accept rows with different cell count.

Of course whatever you do here, every piece of code that deals with this file will need to know about the special format you came up with.