I'm trying to process a very large text file (~11 GB) in a remote server (AWS). The processing needed to be done on the file is pretty complex and with a regular python program the total run time is ~1 month. In order to reduce runtime I'm trying to divide the work on the file between some processes. Computer specs: Computer specs
def initiate_workers(works, num_workers, output_path): """ :param works: Iterable of lists of strings (The work to be processed divided in num_workers pieces) :param num_workers: Number of workers :return: A list of Process objects where each object is ready to process its share. """ res =  for i in range(num_workers): # process_batch is the processing function res.append(multiprocessing.Process(target=process_batch, args=(output_path + str(i), works[i]))) return res def run_workers(workers): """ Run the workers and wait for them to finish :param workers: Iterable of Process objects """ logging.info("Starting multiprocessing..") for i in range(len(workers)): workers[i].start() logging.info("Started worker " + str(i)) for j in range(len(workers)): workers[j].join()
I get the following traceback:
Traceback (most recent call last): File "w2v_process.py", line 93, in <module> run_workers(workers) File "w2v_process.py", line 58, in run_workers workers[i].start() File "/usr/lib/python3.6/multiprocessing/process.py", line 105, in start self._popen = self._Popen(self) File "/usr/lib/python3.6/multiprocessing/context.py", line 223, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "/usr/lib/python3.6/multiprocessing/context.py", line 277, in _Popen return Popen(process_obj) File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__ self._launch(process_obj) File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 66, in _launch self.pid = os.fork() OSError: [Errno 12] Cannot allocate memory
and it doesn't matter if num_workers=1 or 6 or 14, it always crashes.
What am I doing wrong?
Found the problem. I saw somewhere on SO that fork (last line on traceback) is actually doubling the RAM. While processing the file I loaded it into the memory what filled ~18GB and given that the entire capacity of the RAM is 30GB indeed there is a memory allocation error. I divided the large file into smaller files (the number of workers) and gave each Process object the path to this file. This way, each process reads the data in a lazy fashion and everything works great!