Assume that we have the following dataset, where 's' stands for 'step'.
f1 f2 f3 f4 target
1 2 3 4 5
2 3 4 5 6
3 4 5 6 7
4 5 6 7 8
5 6 7 8 9
The model consists of 4 (time) steps. And it gives a single number as output (target). In the very first sample, the step1 input is 1, step2 input is 2, step3 input is 3, and step4 input is 4. And we will train a Sequence model (with RNN, LSTM, or whatever) which will then output "5" for this particular sequence. And the logic is the same in other samples as well.
I am concerned about how to divide such a dataset into train and dev sets. (Just ignore the test set for the time being.)
Alternative 1: Say that the first 3 samples make the train set and the following 2 samples make the dev set, as illustrated below.
Train set:
f1 f2 f3 f4 target
1 2 3 4 5
2 3 4 5 6
3 4 5 6 7
Dev set:
f1 f2 f3 f4 target
4 5 6 7 8
5 6 7 8 9
My concern is: If you look at the last train set sample ([3, 4, 5, 6], 7) and the first dev set sample ([4, 5, 6, 7], 8), you will see that 3 input steps are identical. (And there is a similar problem even with the other dev set sample.)
Q1: Is that a problem that some input steps are identical? Or can we say that it should not matter just because (1) even if input steps are identical, they are used in different steps of the sequence and (2) target values are still different for each sequence example.
Q2: Wrt the problem above, how should the testset be created?
Yes, it doesn't matter because they are in different time steps, And the sequences are not identical. They have different targets also. So your model should definitely learn to predict the next character if you train your model well.