How to avoid loading a large file into a python script repeatedly?

4.7k views Asked by At

I've written a python script to take a large file (a matrix ~50k rows X ~500 cols) and use it as a dataset to train a random forest model.

My script has two functions, one to load the dataset and the other to train the random forest model using said data. These both work fine, but the file upload takes ~45 seconds and it's a pain to do this every time I want to train a subtly different model (testing many models on the same dataset). Here is the file upload code:

def load_train_data(train_file):
    # Read in training file
    train_f = io.open(train_file)
    train_id_list = []
    train_val_list = []
    for line in train_f:
        list_line = line.strip().split("\t")
        if list_line[0] != "Domain":
            train_identifier = list_line[9]
            train_values = list_line[12:]
            train_id_list.append(train_identifier)
            train_val_float = [float(x) for x in train_values]
            train_val_list.append(train_val_float)
    train_f.close()
    train_val_array = np.asarray(train_val_list)

    return(train_id_list,train_val_array)

This returns a numpy array with col. 9 as the label and cols. 12-end as the metadata to train the random forest.

I am going to train many different forms of my model with the same data, so I just want to upload the file one time and have it available to feed into my random forest function. I want the file to be an object I think (I am fairly new to python).

4

There are 4 answers

0
user985366 On BEST ANSWER

If I understand you correctly, the data set does not change but the model parameters do change and you are changing the parameters after each run.

I would put the file load script in one file, and run this in the python interpreter. Then the file will load and be saved in memory with whatever variable you use.

Then you can import another file with your model code, and run that with the training data as argument.

If all your model changes can be determined as parameters in a function call, all you need is to import your model and then call the training function with different parameter settings.

If you need to change the model code between runs, save with a new filename and import that one, run again and send the source data to that one.

If you don't want to save each model modification with a new filename, you might be able to use the reload functionality depending on python version, but it is not recommended (see Proper way to reload a python module from the console)

0
Michael Lorton On

Simplest way would be to cache the results, like so:

_train_data_cache = {}
def load_cached_train_data(train_file):
  if train_file not in _train_data_cache:
    _train_data_cache[train_file] = load_train_data(train_file)
  return _train_data_cache[train_file]
0
umbreon222 On

Try to learn about Python data serialization. You would basically be storing the large file as a python specific, serialized binary object using python's marshal function. This would drastically speed up IO of the file. See these benchmarks for performance variations. However, if these random forest models are all trained at the same time then you could just train them against the data-set you already have in memory then release train data after completion.

0
Mehdi Rostami On

Load your data in ipython.

my_data = open("data.txt")

Write your codes in a python script, say example.py, which uses this data. At the top of the script example.py add these lines:

import sys

args = sys.argv

data = args[1]
...

Now run the python script in ipython:

%run example.py $mydata

Now, when running your python script, you don't need to load data multiple times.