Parallel and conditional: NoneType object has no attribute '__dict__'

3.9k views Asked by At

For more setup, see this question. I want to create lots of instances of class Toy, in parallel. Then I want to write them to an xml tree.

import itertools
import pandas as pd
import lxml.etree as et
import numpy as np
import sys
import multiprocessing as mp


def make_toys(df):
    l = []
    for index, row in df.iterrows():
        toys = [Toy(row) for _ in range(row['number'])]
        l += [x for x in toys if x is not None]
    return l


class Toy(object):
    def __new__(cls, *args, **kwargs):
        if np.random.uniform() <= 1:
            return super(Toy, cls).__new__(cls, *args, **kwargs)

    def __init__(self, row):
        self.id = None
        self.type = row['type']

    def set_id(self, x):
        self.id = x

    def write(self, tree):
        et.SubElement(tree, "toy", attrib={'id': str(self.id), 'type': self.type})


if __name__ == "__main__":
    table = pd.DataFrame({
        'type': ['a', 'b', 'c', 'd'],
        'number': [5, 4, 3, 10]})

    n_cores = 2
    split_df = np.array_split(table, n_cores)

    p = mp.Pool(n_cores)
    pool_results = p.map(make_toys, split_df)
    p.close()
    p.join()
    l = [a for L in pool_results for a in L]

    box = et.Element("box")
    box_file = et.ElementTree(box)

    for i, toy in itertools.izip(range(len(l)), l):
        Toy.set_id(toy, i)

    [Toy.write(x, box) for x in l]

    box_file.write(sys.stdout, pretty_print=True)

This code runs beautifully. But I redefined the __new__ method to only have a random chance of instantiating a class. So if I set if np.random.uniform() < 0.5, I want to create half as many instances as I asked for, randomly determined. Doing this returns the following error:

Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 380, in _handle_results
    task = get()
AttributeError: 'NoneType' object has no attribute '__dict__'

I don't know what this even means, or how to avoid it. If I do this process monolithically, as in l = make_toys(table), it runs well for any random chance.

Another solution

By the way, I know that this can be solved by leaving the __new__ method alone and instead rewriting make_toys() as

def make_toys(df):
    l = []
    for index, row in df.iterrows():
        prob = np.random.binomial(row['number'], 0.1)
        toys = [Toy(row) for _ in range(prob)]
        l += [x for x in toys if x is not None]
    return l

But I'm trying to learn about the error.

1

There are 1 answers

4
unutbu On BEST ANSWER

I think you've uncovered a surprising "gotcha" caused by Toy instances becoming None as they are passed through the multiprocessing Pool's result Queue.

The multiprocessing.Pool uses Queue.Queues to pass results from the subprocesses back to the main process.

Per the docs:

When an object is put on a queue, the object is pickled and a background thread later flushes the pickled data to an underlying pipe.

While the actual serialization might be different, in spirit the pickling of an instance of Toy becomes a stream of bytes such as this:

In [30]: import pickle

In [31]: pickle.dumps(Toy(table.iloc[0]))
Out[31]: "ccopy_reg\n_reconstructor\np0\n(c__main__\nToy\np1\nc__builtin__\nobject\np2\nNtp3\nRp4\n(dp5\nS'type'\np6\nS'a'\np7\nsS'id'\np8\nNsb."

Notice that the module and class of the object is mentioned in the stream of bytes: __main__\nToy.

The class itself is not pickled. There is only a reference to the name of the class.

When the stream of bytes is unpickled on the other side of the pipe, Toy.__new__ is called to instantiate a new instance of Toy. The new object's __dict__ is then reconstituted using unpickled data from the byte stream. When the new object is None, it has no __dict__ attribute, and hence the AttributeError is raised.

Thus, as a Toy instance is passed through the Queue, it might become None on the other side.

I believe this is the reason why using

class Toy(object):
    def __new__(cls, *args, **kwargs):
        x = np.random.uniform() <= 0.5
        if x:
            return super(Toy, cls).__new__(cls, *args, **kwargs)
        logger.info('Returning None')

leads to

AttributeError: 'NoneType' object has no attribute '__dict__'

If you add logging to your script,

import itertools
import pandas as pd
import lxml.etree as et
import numpy as np
import sys
import multiprocessing as mp
import logging
logger = mp.log_to_stderr(logging.INFO)

def make_toys(df):
    result = []
    for index, row in df.iterrows():
        toys = [Toy(row) for _ in range(row['number'])]
        result += [x for x in toys if x is not None]
    return result


class Toy(object):
    def __new__(cls, *args, **kwargs):
        x = np.random.uniform() <= 0.97
        if x:
            return super(Toy, cls).__new__(cls, *args, **kwargs)
        logger.info('Returning None')

    def __init__(self, row):
        self.id = None
        self.type = row['type']

    def set_id(self, x):
        self.id = x

    def write(self, tree):
        et.SubElement(tree, "toy", attrib={'id': str(self.id), 'type': self.type})


if __name__ == "__main__":
    table = pd.DataFrame({
        'type': ['a', 'b', 'c', 'd'],
        'number': [5, 4, 3, 10]})

    n_cores = 2
    split_df = np.array_split(table, n_cores)

    p = mp.Pool(n_cores)
    pool_results = p.map(make_toys, split_df)
    p.close()
    p.join()
    l = [a for L in pool_results for a in L]

    box = et.Element("box")
    box_file = et.ElementTree(box)

    for i, toy in itertools.izip(range(len(l)), l):
        toy.set_id(i)

    for x in l:
        x.write(box)

    box_file.write(sys.stdout, pretty_print=True)

you will find that the AttributeError only occurs after a logging message of the form

[INFO/MainProcess] Returning None

Notice that the logging message comes from the MainProcess, not one of the PoolWorker processes. Since the Returning None message comes from Toy.__new__, this shows that Toy.__new__ was called by the main process. This corroborates the claim that unpickling is calling Toy.__new__ and transforming instances of Toy into None.


The moral of the story is that for Toy instances to be passed through a multiprocessing Pool's Queue, Toy.__new__ must always return an instance of Toy. And as you noted, the code can be fixed by instantiating only the desired number of Toys in make_toys:

def make_toys(df):
    result = []
    for index, row in df.iterrows():
        prob = np.random.binomial(row['number'], 0.1)
        result.extend([Toy(row) for _ in range(prob)])
    return result

By the way, it is non-standard to call instance methods with Toy.write(x, box) when x is an instance of Toy. The preferred way is to use

x.write(box)

Similary, use toy.set_id(i) instead of Toy.set_id(toy, i).