How to parse a YAML file with multiple documents?

38.7k views Asked by At

Here is my parsing code:

import yaml

def yaml_as_python(val):
    """Convert YAML to dict"""
    try:
        return yaml.load_all(val)
    except yaml.YAMLError as exc:
        return exc

with open('circuits-small.yaml','r') as input_file:
    results = yaml_as_python(input_file)
    print results
    for value in results:
         print value

Here is a sample of the file:

ingests:
  - timestamp: 1970-01-01T00:00:00.000Z
    id: SwitchBank_35496721
    attrs:
      Feeder: Line_928
      Switch.normalOpen: 'true'
      IdentifiedObject.description: SwitchBank
      IdentifiedObject.mRID: SwitchBank_35496721
      PowerSystemResource.circuit: '928'
      IdentifiedObject.name: SwitchBank_35496721
      IdentifiedObject.aliasName: SwitchBank_35496721
    loc: vector [43.05292, -76.126800000000003, 0.0]
    kind: SwitchBank
  - timestamp: 1970-01-01T00:00:00.000Z
    id: UndergroundDistributionLineSegment_34862802
    attrs:
      Feeder: Line_928
      status: de-energized
      IdentifiedObject.description: UndergroundDistributionLineSegment
      IdentifiedObject.mRID: UndergroundDistributionLineSegment_34862802
      PowerSystemResource.circuit: '928'
      IdentifiedObject.name: UndergroundDistributionLineSegment_34862802
    path:
    - vector [43.052942000000002, -76.126716000000002, 0.0]
    - vector [43.052585000000001, -76.126515999999995, 0.0]
    kind: UndergroundDistributionLineSegment
  - timestamp: 1970-01-01T00:00:00.000Z
    id: UndergroundDistributionLineSegment_34806014
    attrs:
      Feeder: Line_928
      status: de-energized
      IdentifiedObject.description: UndergroundDistributionLineSegment
      IdentifiedObject.mRID: UndergroundDistributionLineSegment_34806014
      PowerSystemResource.circuit: '928'
      IdentifiedObject.name: UndergroundDistributionLineSegment_34806014
    path:
    - vector [43.05292, -76.126800000000003, 0.0]
    - vector [43.052928999999999, -76.126766000000003, 0.0]
    - vector [43.052942000000002, -76.126716000000002, 0.0]
    kind: UndergroundDistributionLineSegment
... 
ingests:
  - timestamp: 1970-01-01T00:00:00.000Z
    id: OverheadDistributionLineSegment_31168454

In the traceback, note that it starts having a problem at the ...

Traceback (most recent call last):
  File "convert.py", line 29, in <module>
    for value in results:
  File "/Users/conduce-laptop/anaconda2/lib/python2.7/site-packages/yaml/__init__.py", line 82, in load_all
    while loader.check_data():
  File "/Users/conduce-laptop/anaconda2/lib/python2.7/site-packages/yaml/constructor.py", line 28, in check_data
    return self.check_node()
  File "/Users/conduce-laptop/anaconda2/lib/python2.7/site-packages/yaml/composer.py", line 18, in check_node
    if self.check_event(StreamStartEvent):
  File "/Users/conduce-laptop/anaconda2/lib/python2.7/site-packages/yaml/parser.py", line 98, in check_event
    self.current_event = self.state()
  File "/Users/conduce-laptop/anaconda2/lib/python2.7/site-packages/yaml/parser.py", line 174, in parse_document_start
    self.peek_token().start_mark)
yaml.parser.ParserError: expected '<document start>', but found '<block mapping start>'
  in "circuits-small.yaml", line 42, column 1

What I would like is for it to parse each of these documents as a separate object, perhaps all of them in the same list, or pretty much anything else that would work with the PyYAML module. I believe the ... is actually valid YAML so I am surprised that it doesn't handle it automatically.

2

There are 2 answers

3
Anthon On BEST ANSWER

The error message is quite specific that a document needs to start with a document start marker. Your first document doesn't have such a marker, although it has a document end marker. After you explicitly end the first document with ... you can no longer use a document without document boundary markers in PyYAML, you explicitly have to start it with ---:

The end of your file should look like:

    kind: UndergroundDistributionLineSegment
...
---
ingests:
  - timestamp: 1970-01-01T00:00:00.000Z
    id: OverheadDistributionLineSegment_31168454

You can leave out the explicit document start marker from the first document, but you need to include a start marker for every following document. Document end markers are optional.

If you don't have complete control over the input, using .load_all() is not safe. There normally is no reason to take that risk and you should be using .safe_load_all() and extend the SafeLoader to handle any specific tags that your YAML might contain.

Apart from that you should start your YAML documents with an explicit version directive before the document start indicator (which you should also add to the first document):

%YAML 1.1
---

This is for the benefit of future editors of your YAML files, because you are using PyYAML, which only supports (most of) YAML 1.1 and not the YAML 1.2 specification (form 2009). The alternative is of course to upgrade your YAML parser to e.g ruamel.yaml, which would also have warned you about your use of the unsafe load_all() (disclaimer: I am the author of that parser). ruamel.yaml doesn't allow you to have a bare document after an explicit end-of-document marker (which is allowed as @flyx pointed out), which is a bug.

2
gipsy On

I think you have an invalid yaml

Look at the second document in the sample it begins with a ... instead of ---

... 
ingests:
  - timestamp: 1970-01-01T00:00:00.000Z
    id: OverheadDistributionLineSegment_31168454