I saw the Wikipedia page but still am not clear with the idea.
To find the longest common substring of 2 strings (T
and S
), I've read that we must build a suffix tree for the string T($1)S($2)
, where`($1) and ($2) are special characters not part of the strings.
But the Wikipedia image for the strings ABAB
and BABA
looks like this:
Why doesn't it contain the entire string ABAB($1)BABA($2)
? Isn't it a suffix of the concatenated string?
What are those numbers on the leaves?
A generalized suffix tree is a variation on a suffix tree in which the suffixes for two (or more) distinct strings T1 and T2 are stored, not just the suffixes of one string T.
One way to build a generalized suffix tree is to start by making a suffix tree for T1$1T2$2. This resulting suffix tree will contain all the suffixes of T1 and T2, but it will also contain a lot of "spurious" suffixes that start in T1 and spread into T2. To fix this, after building the initial suffix tree, you would typically make a second pass over the tree structure and eliminate any suffixes that extend past the $1 marker. This is why, for example, the generalized suffix tree you gave above doesn't contain ABAB$1BABA$2.
As to your next question - what are the numbers in the leaves? - each leaf in a suffix tree is typically tagged with the start index of the suffix that the leaf corresponds to. In a generalized suffix tree, each leaf is tagged with two pieces of information - the start index of the suffix, and which string that suffix belongs to. The notation a:b on a leaf means "this suffix comes from string a, and it starts at index b in that string." For example, the marker 1:3 on the leftmost leaf means "this suffix comes from string 1 and it starts at position 3." You can see that this corresponds to the suffix A$1, which does indeed start at position 3 in ABAB$1, assuming 1-indexing.
Hope this helps!