Python Raising Errors within List Comprehension (or a better alternative)

354 views Asked by At

I have a nested structure read from a json string, that looks similar to the following...

[
  {
    "id": 1,
    "type": "test",
    "sub_types": [
      {
        "id": "a",
        "type": "sub-test",
        "name": "test1"
      },
      {
        "id": "b",
        "name": "test2",
        "key_value_pairs": [
          {
            "key": 0,
            "value": "Zero"
          },
          {
            "key": 1,
            "value": "One"
          }
        ]
      }
    ]
  }
]

I need to extract and pivot the data, ready to be inserted in to a database...

[
  (1, "b", 0, "Zero"),
  (1, "b", 1, "One")
]

I'm doing the following...

data_list = [
  (
    type['id'],
    sub_type['id'],
    key_value_pair['key'],
    key_value_pair['value']
  )
  for type in my_parsed_json_array
  if 'sub_types' in type
  for sub_type in type['sub_types']
  if 'key_value_pairs' in sub_type
  for key_value_pair in sub_type['key_value_pairs']
]

So far, so good.

What I need to do next, however, is enforce some constraints. For example...

if type['type'] == 'test': raise ValueError('[test] types can not contain key_value_pairs.')

But I can't put that in to the comprehension. And I don't want to resort to loops. My best thought so far is...

def make_row(type, sub_type, key_value_pair):
    if type['type'] == 'test': raise ValueError('sub-types of a [test] type can not contain key_value_pairs.')
    return (
        type['id'],
        sub_type['id'],
        key_value_pair['key'],
        key_value_pair['value']
    )

data_list = [
  make_row(
    type,
    sub_type,
    key_value_pair
  )
  for type in my_parsed_json_array
  if 'sub_types' in type
  for sub_type in type['sub_types']
  if 'key_value_pairs' in sub_type
  for key_value_pair in sub_type['key_value_pairs']
]

That works, but it will make the check for each and every key_value_pair, which feels redundant. (Each set of key value pairs could have thousands of pairs, and the check only needs to be made once to know that they're all fine.)

Also, there will be other rules similar to this, that apply at different levels of the hierarchy. Such as "test" types can only contain "sub_test" sub_types.

What are the options other than those above?

  • More elegant?
  • More extensible?
  • More performant?
  • More "Pythonic"?
3

There are 3 answers

1
AudioBubble On BEST ANSWER

You should read about how to validate your json data and specify explicit schema constraints with JSON Schema This library allows you to set keys which are required, specify default values, add type validation, etc.

This library has It's python implementation here: jsonschema package

EXAMPLE:

from jsonschema import Draft6Validator

schema = {
    "$schema": "https://json-schema.org/schema#",

    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "email": {"type": "string"},
    },
    "required": ["email"]
}
Draft6Validator.check_schema(schema)
3
juanpa.arrivillaga On

I would just use a plain loop, but you can add it to the first conditional check, if you put the statement into a function:

def type_check(type):
    if type['type'] == 'test':
        raise ValueError('sub-types of a [test] type can not contain key_value_pairs.')
    return True


data_list = [
  (
    type['id'],
    sub_type['id'],
    key_value_pair['key'],
    key_value_pair['value']
  )
  for type in my_parsed_json_array
  if 'sub_types' in type
  for sub_type in type['sub_types']
  if  'key_value_pairs' in sub_type and type_check(type)
  for key_value_pair in sub_type['key_value_pairs']
]
1
Karl Knechtel On

You could try an architecture along the lines of

def validate_top(obj):
    if obj['type'] in BAD_TYPES:
        raise ValueError("oof")
    elif obj['type'] not in IRRELEVANT_TYPES: # actually need to include this
        yield obj

def validate_middle(obj):
    # similarly for the next nested level of data

# and so on

[
    make_row(r)
    for t in validate_top(my_json)
    for m in validate_middle(t)
    # etc...
    for r in validate_last(whatever)
]

The general pattern I have here is using generators (functions, not expressions) to process the data, and then comprehensions to collect it.

In simpler cases, where it isn't worthwhile to separate out multiple levels of processing (or they don't naturally exist), you can still write a single generator and just do something like list(generator(source)). This is, in my mind, still cleaner than using an ordinary function and building the list manually - it still separates 'processing' vs 'collecting' concerns.