I am trying to move a large amount of data around and I read I can use a couple libraries in python to speed things up.
- generators
- itertools.islice
- multiprocessing
Here is some code, generically, that I intend to expand on. Is this the right path:
import multiprocessing
import itertools
# Define a generator function to yield lines of text
def generate_lines():
# Replace this with your logic to generate lines of text
for i in range(10000):
yield f"Line {i + 1}"
# Function to write lines to a file
def write_lines(filename, lines):
with open(filename, 'w') as file:
for line in lines:
file.write(line + '\n')
if __name__ == '__main__':
# Create a pool of processes
with multiprocessing.Pool(2) as pool:
# Use itertools.islice to split the generator into chunks of 5000 lines each
chunk_size = 5000
for i in range(0, 10000, chunk_size):
chunk = itertools.islice(generate_lines(), i, i + chunk_size)
pool.apply_async(write_lines, (f'file{i // chunk_size + 1}.txt', chunk))
# Wait for all processes to complete
pool.close()
pool.join()
print("Writing completed successfully.")
I basically don't want to ever expand the whole list in memory and I also want to double my speed using the pool.
My last issue is this: When I read from a source file instead of generating fake lines... is there any way to read lines from a large source file in generator batches as well?