The problem is a variant of subarray counting. Given an array of numbers, let's say, 1,2,2,3,2,1,2,2,2,2
I look for subarrays and count the frequency of each. I start with looking from some K
length subarrays (example K = 3).
Count of subarray 1,2,2
is C1:2
.
Count of subarray 2,2,3
is 1
.
Count of subarray 2,3,2
is 1
.
and so on.
Now, I look for subarrays of length 2.
Count of subarray 1,2
is C2: 2
. But (1,2) is a subset of the subarray 1,2,2
. So, I calculate its count by subtracting C1
from C2
which gives count of 1,2
as 0. Similarly, count of 2,2
is 1
.
My problem is in handling cases where more than one parent subset exists. I don't consider the sub-arrays in my result set whose frequency comes out to be 1. Example:
1,2,3,1,2,3,1,2,2,3
Here, Count of 1,2,3
is 2
.
Count of 2,3,1
is 2
.
Now, when I look for count of 2,3
, it should be 1 as all the greater length parents have covered the occurrences. How shall I handle these cases?
The approach I thought was to mark all the pattern occurrences of the parent. In the above case, mark all the occurrences of 1,2,3
and 2,3,1
. Array looks like this:
1,2,3,1,2,3,1,2,2,3
X,X,X,X,X,X,X,2,2,3
where X denotes the marked position. Now, frequency of 2,3
we see is 1 as per the positions which are unmarked. So, basically, I mark all the pattern occurrences I find at the current step. For the next step, I start looking for patterns from the unmarked locations only to get the correct count.
I am dealing with the large data on which this seems a bit not-so-good thing to do. Also, I'm not sure if it's correct or not. Any other approaches or ideas can be of big help?
Build suffix array for given array.
To count all repeating subarrays with given length - walk through this suffix array, comparing neighbor suffixes by needed prefix length.
For your first example
With length 2 we can count two subarrays
1,2
and four subarrays2,2
If you want to count any given subarray - for example, all suffixes beginning with
(1,2)
, just use binary search to get the first and the last indexes (likestd:upperbound
andstd:lowerbound
operations in C++ STL).For the same example indexes of the first and last occurrences of
(1,2)
in suffix array are 0 and 1, so count islast-first+1=2