How to write to a file in batches in python performantly. Anything else I need?

36 views Asked by At

I am trying to move a large amount of data around and I read I can use a couple libraries in python to speed things up.

  1. generators
  2. itertools.islice
  3. multiprocessing

Here is some code, generically, that I intend to expand on. Is this the right path:

import multiprocessing
import itertools

# Define a generator function to yield lines of text
def generate_lines():
    # Replace this with your logic to generate lines of text
    for i in range(10000):
        yield f"Line {i + 1}"

# Function to write lines to a file
def write_lines(filename, lines):
    with open(filename, 'w') as file:
        for line in lines:
            file.write(line + '\n')

if __name__ == '__main__':
    # Create a pool of processes
    with multiprocessing.Pool(2) as pool:
        # Use itertools.islice to split the generator into chunks of 5000 lines each
        chunk_size = 5000
        for i in range(0, 10000, chunk_size):
            chunk = itertools.islice(generate_lines(), i, i + chunk_size)
            pool.apply_async(write_lines, (f'file{i // chunk_size + 1}.txt', chunk))

        # Wait for all processes to complete
        pool.close()
        pool.join()

    print("Writing completed successfully.")

I basically don't want to ever expand the whole list in memory and I also want to double my speed using the pool.

My last issue is this: When I read from a source file instead of generating fake lines... is there any way to read lines from a large source file in generator batches as well?

0

There are 0 answers