A test fixture lock for downloading and writing data with pytest-xdist

249 views Asked by At

I have Python tests written with pytest. These tests download test data and cache it as locally written files.

Now I am parallelising tests with pytest-xdist. How can I prevent parallel writes in test fixtures, as it will result in corrupted data and failed tests?

Ideally, only one test process needs to download the data and cache it as a file.

1

There are 1 answers

0
Mikko Ohtamaa On

You can use filelock library to create a lock file for each download happening within test fixtures or tests.

  • Create a lock file that prevents reading/writing all but a single process
  • The first process that acquires the lock downloads the data and writes the file
  • The subsequent processes return the cached data

Here is an example function wait_other_writers() which accomplishes the goals above:

@contextmanager
def wait_other_writers(path: Path | str, timeout=120):
    """Wait other potential writers writing the same file.

    - Work around issues when parallel unit tests and such
      try to write the same file

    Example:

    .. code-block:: python

        import urllib
        import tempfile

        import pytest
        import pandas as pd

        @pytest.fixture()
        def my_cached_test_data_frame() -> pd.DataFrame:

            # Al tests use a cached dataset stored in the /tmp directory
            path = os.path.join(tempfile.gettempdir(), "my_shared_data.parquet")

            with wait_other_writers(path):

                # Read result from the previous writer
                if not path.exists():
                    # Download and write to cache
                    urllib.request.urlretrieve("https://example.com", path)

                return pd.read_parquet(path)

    :param path:
        File that is being written

    :param timeout:
        How many seconds wait to acquire the lock file.

        Default 2 minutes.
    """

    if type(path) == str:
        path = Path(path)

    assert isinstance(path, Path), f"Not Path object: {path}"

    assert path.is_absolute(), f"Did not get an absolute path: {path}\n" \
                               f"Please use absolute paths for lock files to prevent polluting the local working directory."

    # If we are writing to a new temp folder, create any parent paths
    os.makedirs(path.parent, exist_ok=True)

    # https://stackoverflow.com/a/60281933/315168
    lock_file = path.parent / (path.name + '.lock')

    lock = FileLock(lock_file, timeout=timeout)
    with lock:
        yield

For the example use of this function, see here.