How to properly make a train/test split using `torchdata`?

1.1k views Asked by At

I've been using the torchdata library (v0.6.0) to construct datapipes for my machine learning model, but I can't seem to figure out how torchdata expects its users to make a train/test split.

Supposing I have a datapipe dp, my first attempt was to use the Sampler datapipe along with a torch.utils.data.SubsetRandomSampler (which is what I expected from this part of the documentation), but this doesn't work how I would've thought:

>>> dp = SequenceWrapper(range(5))
>>> Sampler(dp,SubsetRandomSampler([0, 1, 2]))
Traceback (most recent call last):
...
TypeError: 'SubsetRandomSampler' object is not callable

Maybe torchdata has it's own samplers I'm not familiar with.

The only other way I can think of doing this would be to use a Demultiplexer, but this feels unclean to me, because we have to enumerate then "de-enumerate":

>>> train_len = len(dp) * 0.8
>>> dp1, dp2 = dp.enumerate().demux(num_instances=2, classifier_fn=lambda x: x[0] >= train_len)
>>> dp1, dp2 = (d.map(lambda x: x[1]) for d in (dp1, dp2))

Is there an "intended" way of doing this with torchdata which I'm missing?

1

There are 1 answers

5
Djinn On BEST ANSWER

PyTorch's tutorial on using DataPipes answers the question:

import torchdata.datapipes.iter as pipes
from torch.utils.data import DataLoader, random_split

# initialize DataPipe with dummy values
dp = pipes.IterableWrapper(range(5))

# create train/test split ratio sizes (assuming 80/20 split)
train_size, test_test = int(len(dp) * 0.8), len(dp) - (int(len(dp) * 0.8))

# split dataset into train/test sets
train_dataset, test_dataset = random_split(dp, [train_size, test_size])

# create batch sizes for train and test dataloaders
# (loading everything into memory, no minibatches)
batch_train, batch_test = len(train_dataset), len(test_dataset)

# create train and test dataloaders
train_dataloader = DataLoader(train_dataset, batch_size=batch_train, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=batch_test)

# train model
for i, j in train_dataloader:
    ...
    preds = model(i)
    loss = loss_fn(preds, j)
    ....

If you want to use the built-in random_split() method of Iterable-style DataPipe:

train_dataset, test_dataset = dp.random_split(total_length=len(dp), weights={"train": 0.8, "test": 0.2}, seed=42)

train_dataloader = DataLoader(train_dataset, batch_size=batch_train, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=batch_test)

Edit: You can directly access the DataPipe from within the split dataset (this works with both IterDataPipe and MapDataPipe:

train_dp = train_dataset.dataset
test_dp = test_dataset.dataset

If you want the output of the random_split() function to be a MapDataPipe, you can always wrap the outputs in SequenceWrapper():

from torchdata.datapipes.map import SequenceWrapper

train_dataset, test_dataset = random_split(dp, [train_size, test_size])
train_mdp = SequenceWrapper(train_dataset)
test_mdp = SequenceWrapper(test_dataset)

And same idea with IterDataPipe:

train_dataset, test_dataset = random_split(dp, [train_size, test_size])
train_idp = pipes.IterableWrapper(train_dataset)
test_idp = pipes.IterableWrapper(test_dataset)