How to split the dataset into mutiple folds while keeping the ratio of an attribute fixed

126 views Asked by At

Let's say that I have a dataset with multiple input features and one single output. For the sake of simplicity, let's say the output is binary. Either zero or one.

I want to split this dataset into k parts and use a k-fold cross-validation model to learn the mapping from the input features to the output one. If the dataset is imbalanced, the ratio between the number of records with output 0 and 1 is not going to be one. To make it concrete, let's say that 90% of the records are 0 and only 10% are 1.

I think it's important that within each part of k-folds we should see the same ratio of 0s and 1s in order for successful training (the same 9 to 1 ratio). I know how to do this in Pandas but my question is how to do it in TFX.

Reading the TFX documentation, I know that I can split a dataset by specifying an output_config to the class loading the examples:

output = tfx.proto.Output(
             split_config=tfx.proto.SplitConfig(splits=[
                 tfx.proto.SplitConfig.Split(name='fold_1', hash_buckets=1),
                 tfx.proto.SplitConfig.Split(name='fold_2', hash_buckets=1),
                 tfx.proto.SplitConfig.Split(name='fold_3', hash_buckets=1),
                 tfx.proto.SplitConfig.Split(name='fold_4', hash_buckets=1),
                 tfx.proto.SplitConfig.Split(name='fold_5', hash_buckets=1)
             ]))
example_gen = CsvExampleGen(input_base=input_dir, output_config=output)

But then, the aforementioned ratio of the examples in each fold will be random at best. My question is: Is there any way I can specify what goes into each split? Can I somehow enforce the ratio of a feature?

BTW, I have seen and experimented with the partition_feature_name argument of the SplitConfig class. It's not useful here unless there's a feature with the ID of the fold for each example which I think is not practical since I might want to change the number of folds as part of the experiment without changing the dataset.

1

There are 1 answers

0
Mehran On

I'm going to answer my own question but only as a workaround. I'll be happy to see someone develop a real solution to this question.

What I could come up with at this point was to split the dataset into a number of tfrecord files. I've chosen a "composite" number of files so I can split them into (almost) any number I want. For this, I've settled down on 60 since it can be divided by 2, 3, 4, 5, 6, 10, and 12 (I don't think anyone would want KFold with k higher than 12). Then at the time of loading them, I have to somehow select which files will go into each split. There are two things to consider here.

First, the ImportExampleGen class from TFX supports glob file patterns. This means we can have multiple files loaded for each split:

input = tfx.proto.Input(splits=[
    tfx.proto.Input.Split(name="fold_1", pattern="fold_1*"),
    tfx.proto.Input.Split(name="fold_2", pattern="fold_2*")
])
example_gen = tfx.components.ImportExampleGen(input_base=_dataset_folder,
                                              input_config=input)

Next, we need some ingenuity to enable splitting the files into any number we like at the time of loading them. And this is my approach to it:

fold_3.0_4.0_5.0_6.0_10.0/part-###.tfrecords.gz
fold_3.0_4.0_5.1_6.0_10.6/part-###.tfrecords.gz
fold_3.0_4.0_5.2_6.0_10.2/part-###.tfrecords.gz
fold_3.0_4.0_5.3_6.0_10.8/part-###.tfrecords.gz
...

The file pattern is like this. Between each two _ I include the divisor, a ., and then the remainder. And I'll have as many of these as I want to have the "split possibility" later, at the time of loading the dataset.

In the example above, I'll have the option to load them into 3, 4, 5, 6, and 10 folds. The first file will be loaded as part of the 0th split if I want to split the dataset into any number of folds while the second file will be in the 1st split of 5-fold and 6th split of 10-fold.

And this is how I'll load them:

NUM_FOLDS = 5

input = tfx.proto.Input(splits=[
    tfx.proto.Input.Split(name=f'fold_{index + 1}',
                          pattern=f"fold_*{str(NUM_FOLDS)+'.'+str(index)}*/*")
    for index in range(NUM_FOLDS)
])
example_gen = tfx.components.ImportExampleGen(input_base=_dataset_folder,
                                              input_config=input)

I could change the NUM_FOLDS to any of the options 3, 4, 5, 6, or 10 and the loaded dataset will consist of pre-curated k-fold splits. It is worth mentioning that I have made sure of the ratio of the samples within each file at the time of creating them. So any combination of them will also have the same ratio.

Again, this is only a trick in the absence of an actual solution. The main drawback of this approach is the fact that you have to split the dataset manually yourself. I've done so, in this case, using pandas. That meant that I had to load the whole dataset into memory. Which might not be possible for all the datasets.