Building a suffix tree by inserting each suffix in Haskell

1.1k views Asked by At

I am working with the following data type:

data SuffixTree = Leaf Int | Node [(String, SuffixTree)] 
                deriving (Eq, Show)

Each subtree has a corresponding label (string). The idea is to build the corresponding suffix tree by adding each suffix and its index into an accumulating tree (at the beginning it is Node []).

This is already defined

buildTree s
    = foldl (flip insert) (Node []) (zip (suffixes s) [0..length s-1]) 

where suffixes is correctly defined.

I've been trying to implement the insert function for a while but can't seem to succeed.

This is what I have now (the names and style are not the best since this is still work in progress):

insert :: (String, Int) -> SuffixTree -> SuffixTree
insert pair tree@(Node content) 
  = insert' pair tree content
  where
    insert' :: (String, Int) -> SuffixTree -> [(String, SuffixTree)] -> SuffixTree
    insert' (s, n) (Node []) subtrees
      = Node ((s, Leaf n) : subtrees)
    insert' (s, n) (Node content@((a, tree) : pairs)) subtrees
      | null p = insert' (s, n) (Node pairs) subtrees
      | p == a = insert' (r, n) tree subtrees
      | p /= a = Node ((p, newNode) : (subtrees \\ [(a, tree)]))
      where
        (p, r, r')  = partition s a
        newNode     = Node [(r, (Leaf n)), (r', tree)]

The partition function takes two strings and returns a tuple consisting of:

  1. The common prefix (if it exists)
  2. The first string without the prefix
  3. The second string without the prefix

I think I understand the rules needed to build the tree.

We start by comparing the label of the first subtree to the string we want to insert (say, str). If they don't have a prefix in common, we try to insert in the next subtree.

If the label is a prefix of str, we continue to look into that subtree, but instead of using str we try to insert str without the prefix.

If str is a prefix of label, then we replace the existing subtree with a new Node, having a Leaf and the old subtree. We also adjust the labels.

If we don't have a match between str and any label then we add a new Leaf to the list of subtrees.

However, the biggest problem that I have is that I need to return a new tree containing the changes, so I have to keep track of everything else in the tree (not sure how to do this or if I'm thinking correctly about this).

The code appears to be working correctly on this string: "banana":

Node [("a",Node [("",Leaf 5),("na",Node [("",Leaf 3),("na",Leaf 1)])]),
("na",Node [("",Leaf 4),("na",Leaf 2)]),("banana",Leaf 0)]

However, on this string "mississippi" I get an Exception: Non-exhaustive patterns in function insert'.

Any help or ideas are greatly appreciated!

3

There are 3 answers

3
behzad.nouri On BEST ANSWER

You are using a quadratic algorithm; whereas optimally, suffix tree can be constructed in linear time. That said, sticking with the same algorithm, a possibly better approach would be to first build the (uncompressed) suffix trie (not tree) and then compress the resulting trie.

The advantage would be that a suffix trie can be represented using Data.Map:

data SuffixTrie
  = Leaf' Int
  | Node' (Map (Maybe Char) SuffixTrie)

which makes manipulations both more efficient and easier than list of pairs. Doing so, you may also completely bypass common prefix calculations, as it comes out by itself:

import Data.List (tails)
import Data.Maybe (maybeToList)
import Control.Arrow (first, second)
import Data.Map.Strict (Map, empty, insert, insertWith, assocs)

data SuffixTree
  = Leaf Int
  | Node [(String, SuffixTree)]
  deriving Show

data SuffixTrie
  = Leaf' Int
  | Node' (Map (Maybe Char) SuffixTrie)

buildTrie :: String -> SuffixTrie
buildTrie s = foldl go (flip const) (init $ tails s) (length s) $ Node' empty
  where
  go run xs i (Node' ns) = run (i - 1) $ Node' tr
    where tr = foldr loop (insert Nothing $ Leaf' (i - 1)) xs ns
  loop x run = insertWith (+:) (Just x) . Node' $ run empty
    where _ +: Node' ns = Node' $ run ns

buildTree :: String -> SuffixTree
buildTree = loop . buildTrie
  where
  loop (Leaf' i) = Leaf i
  loop (Node' m) = Node $ con . second loop <$> assocs m
  con (Just x, Node [(xs, tr)]) = (x:xs, tr) -- compress single-child nodes
  con n = maybeToList `first` n

then:

\> buildTree "banana"
Node [("a",Node [("",Leaf 5),
                 ("na",Node [("",Leaf 3),
                             ("na",Leaf 1)])]),
      ("banana",Leaf 0),
      ("na",Node [("",Leaf 4),
                  ("na",Leaf 2)])]

similarly:

\> buildTree "mississippi"
Node [("i",Node [("",Leaf 10),
                 ("ppi",Leaf 7),
                 ("ssi",Node [("ppi",Leaf 4),
                              ("ssippi",Leaf 1)])]),
      ("mississippi",Leaf 0),
      ("p",Node [("i",Leaf 9),
                 ("pi",Leaf 8)]),
      ("s",Node [("i",Node [("ppi",Leaf 6),
                            ("ssippi",Leaf 3)]),
                 ("si",Node [("ppi",Leaf 5),
                             ("ssippi",Leaf 2)])])]
0
K. A. Buhr On

Here's how the problem is occurring.

Let's say you're processing buildTree "nanny". After you've inserted the suffixes "nanny", "anny", and "nny", your tree looks like t1 given by:

let t1 = Node t1_content
    t1_content = [("n",t2),("anny",Leaf 1)]
    t2 = Node [("ny",Leaf 2),("anny",Leaf 0)]

Next, you try to insert the prefix "ny":

insert ("ny", 3) t1
= insert' ("ny", 3) t1 t1_content
-- matches guard p == a with p="n", r="y", r'=""
= insert' ("y", 3) t2 t1_content

What you intend to do next is insert ("y", 3) into t2 to yield:

Node [("y", Leaf 3), ("ny",Leaf 2),("anny",Leaf 0)])

Instead, what happens is:

insert' ("y", 3) t2 t1_content
-- have s="y", a="ny", so p="", r="y", r'="ny"
-- which matches guard: null p
= insert' ("y", 3) (Node [("anny", Leaf 0)]) t1_content
-- have s="y", a="anny", so p="", r="y", r'="anny"
-- which matches guard: null p
= insert' ("y", 3) (Node []) t1_content
= Node [("y", Leaf 3), ("n",t2), ("anny",Leaf 1)]

and suffix "y" has been added to t1 instead of t2.

When you next try to insert suffix "y", the guard p==a case tries to insert ("y",3) into Leaf 3 and you get a pattern error.

The reason it works on banana is that you only ever insert a new node at the top level of the tree, so "adding to t2" and "adding to t1" are the same thing.

I suspect you'll need to substantially rethink the structure of your recursion to get this working.

0
David On

Looks like this code does the job, although there may still be improvements to make. I hope that it's general enough to work on any string. I also tried to avoid using ++, but it's still better than nothing.

getContent (Node listOfPairs)
  = listOfPairs

insert :: (String, Int) -> SuffixTree -> SuffixTree
insert (s, n) (Node [])
  = Node [(s, Leaf n)]
insert (s, n) (Node (pair@(a, tree) : pairs))
  | p == a   = Node ((a, insert (r, n) tree) : pairs)
  | null p   = Node (pair : (getContent (insert (r, n) (Node pairs))))
  | p /= a   = Node ([(p, Node [(r, Leaf n), (r', tree)])] ++ pairs)
  where
    (p, r, r') = partition s a