Deserialize a YAML "Table" of data

11.8k views Asked by At

I am using yamldotnet and c# to deserialize a file created by a third party software application. The following YAML file examples are both valid from the application:

#File1
Groups:
  - Name: ATeam
    FirstName, LastName, Age, Height:
      - [Joe, Soap, 21, 184]
      - [Mary, Ryan, 20, 169]
      - [Alex, Dole, 24, 174]

#File2
Groups:
  - Name: ATeam
    FirstName, LastName, Height:
      - [Joe, Soap, 184]
      - [Mary, Ryan, 169]
      - [Alex, Dole, 174]

Notice that File2 doesnt have any Age column but the deserializer must still recognise that the third value on each line is a height rather than an age. This data is supposed to represent a table of people. In the case of File1 for example, Mary Ryan is age 20 and is 169cm tall. The deserializer needs to understand the columns it has (for File2 it only has FirstName, LastName and Height) and store the data accordingly in the right objects : Mary Ryan is 169cm tall.

Similarly the program documentation states that the order of the columns is not important so File3 below is an equally valid way to represent the data in File2 even though Height is now first:

#File3
Groups:
 - Name: ATeam
   Height, FirstName, LastName:
      - [184, Joe, Soap]
      - [169, Mary, Ryan]
      - [174, Alex, Dole]

I have a number of questions:

  1. Is this standard YAML? - I could not find anything about the use of a number of keys on the same line followed by a colon and lists of values to represent tables of data.
  2. How would I use yamldotnet to deserialize this? Are there modifications I can make to help it?
  3. If I can't use yamldotnet, how should I go about it?
3

There are 3 answers

2
Antoine Aubry On BEST ANSWER

As other answers stated, this is valid YAML. However, the structure of the document is specific to the application, and does not use any special feature of YAML to express tables.

You can easily parse this document using YamlDotNet. However you will run into two difficulties. The first is that, since the names of the columns are placed inside the key, you will need to use some custom serialization code to handle them. The second is that you will need to implement some kind of abstraction to be able to access the data in a tabular way.

I have put-up a proof of concept that will illustrate how to parse and read the data.

First, create a type to hold the information from the YAML document:

public class Document
{
    public List<Group> Groups { get; set; }
}

public class Group
{
    public string Name { get; set; }

    public IEnumerable<string> ColumnNames { get; set; }

    public IList<IList<object>> Rows { get; set; }
}

Then implement IYamlTypeConverter to parse the Group type:

public class GroupYamlConverter : IYamlTypeConverter
{
    private readonly Deserializer deserializer;

    public GroupYamlConverter(Deserializer deserializer)
    {
        this.deserializer = deserializer;
    }

    public bool Accepts(Type type)
    {
        return type == typeof(Group);
    }

    public object ReadYaml(IParser parser, Type type)
    {
        var group = new Group();

        var reader = new EventReader(parser);
        do
        {
            var key = reader.Expect<Scalar>();
            if(key.Value == "Name")
            {
                group.Name = reader.Expect<Scalar>().Value;
            }
            else
            {
                group.ColumnNames = key.Value
                    .Split(',')
                    .Select(n => n.Trim())
                    .ToArray();

                group.Rows = deserializer.Deserialize<IList<IList<object>>>(reader);
            }
        } while(!reader.Accept<MappingEnd>());
        reader.Expect<MappingEnd>();

        return group;
    }

    public void WriteYaml(IEmitter emitter, object value, Type type)
    {
        throw new NotImplementedException("TODO");
    }
}

Last, register the converter into the deserializer and deserialize the document:

var deserializer = new Deserializer();
deserializer.RegisterTypeConverter(new GroupYamlConverter(deserializer));

var document = deserializer.Deserialize<Document>(new StringReader(yaml));

You can test the fully working example here

This is only a proof of concept, but it should serve as a guideline for you own implementation. Things that could be improved include:

  • Checking for and handling invalid documents.
  • Improving the Group class. Maybe make it immutable, and also add an indexer.
  • Implementing the WriteYaml method if serialization support is desired.
0
Anthon On

All of these are valid YAML files. You are however mistaking interpreting a scalar key with commas as constituting a description in YAML of the "columns" in the sequences of the value associated with that key.

In File 1, FirstName, LastName, Age, Height is a single string scalar key for the mapping that is the first element of the sequence that is value for the key Group at the top level. Just like name is. You can, but don't have to in YAML, put quotes around the whole scalar.

The association you make between a string "Firstname" and "Joe" is not there in YAML, you can make that association in the program that interprets the key (by splitting it on ", ") as you seem to be doing, but YAML has no knowledge of that.

So if you want to be smart about this, then you need to split the string "FirstName, LastName, Age, Height" yourself and use some mechanism to then use the "subkeys" to index the sequences that are associated with the key.

If it helps to understand all this, the following is a json dump of the first files' contents, there you see clearly what the keys consist of:

{"Groups": [{"FirstName, LastName, Age, Height": [["Joe", "Soap", 21,
   184], ["Mary", "Ryan", 20, 169], ["Alex", "Dole", 24, 174]], 
   "Name": "ATeam"}]}

I used the Python based ruamel.yaml library for this (of which I am the author) but you could also use an online convertor/checker like http://yaml-online-parser.appspot.com/

0
A. Donda On

I'm coming late to this, but I have been thinking about the same question lately.

As others have pointed out, it would be better to record the column names as values, not keys, and you can also do away with the extra Name field:

Groups:
  ATeam:
    Columns: [FirstName, LastName, Height]
    Rows:
      - [Joe, Soap, 184]
      - [Mary, Ryan, 169]
      - [Alex, Dole, 174]

Or less explicit:

Groups:
  ATeam:
    - [FirstName, LastName, Height]
    - [Joe, Soap, 184]
    - [Mary, Ryan, 169]
    - [Alex, Dole, 174]

This is basically a YAML-formatted CSV file; table rows appear as lines.

The alternative, which I think makes more sense from the semantics of YAML structures, because it associates column names directly with values, is to have table columns appear as lines:

Groups:
  ATeam:
    FirstName: [Joe, Mary, Alex]
    LastName: [Soap, Ryan, Dole]
    Height: [184, 169, 174]

This way, an extra Age column can be added by adding a line, not changing the rest at all. Of course adding an extra row would affect many lines.