Reconstructing Huffman tree from (preorder) bitstring in Haskell

556 views Asked by At

I have the following Haskell polymorphic data type:

data Tree a = Leaf Int a | Node Int (Tree a) (Tree a)

The tree will be compressed in a bitstring of 0s and 1s. A '0' signifies a Node and it is followed by the encoding of the left subtree, then the encoding of the right subtree. A '1' signifies a Leaf and is followed by 7 bits of information (for example it might be a char). Each node/leaf is supposed to also contain the frequency of the information stored, but this is not important for this problem (so we can put anything there).

For example, starting from this encoded tree

[0,0,0,1,1,1,0,1,0,1,1,1,1,1,1,0,1,0,0,0,0,1,1,1,1,0,0,0,1,1,1,
 1,0,0,1,1,1,1,1,1,1,0,0,1,0,0,1,1,1,1,0,1,1,1,1,1,1,0,0,0,0,1]

it is supposed to give back something like this

Node 0 (Node 0 (Node 0 (Leaf 0 'k') (Leaf 0 't')) 
       (Node 0 (Node 0 (Leaf 0 'q') (Leaf 0 'g')) (Leaf 0 'r'))) 
(Node 0 (Leaf 0 'w') (Leaf 0 'a'))

(spacing is not important, but it did not fit on one line).

I have little experience working with trees, especially when implementing code. I have a vague idea about how I'd solve this on paper (using something similar to a stack to deal with the depth/levels) but I am still a bit lost.

Any help or ideas are appreciated!

3

There are 3 answers

3
Euge On BEST ANSWER

Ok, here's a simple (ad-hoc, but easier to understand) way.

We need to buid a function parse, with the following type:

parse  :: [Int] -> Tree Char

The approach you mentioned, with stacks, is the imperative one. Here we just lay on the recursive calls. The stack will be built by the compiler and it will just have each recursive call stored in it (At least you can imagine it that way, if you want, or just ignore all this paragraph).

So, the idea is the following: whenever you find a 0, you need to make two recursive calls to the algorithm. The first recursive call will read one branch (the left one) of the tree. The second one needs to be called with the rest of the list as argument. The rest left by the first recursive call. So, we need a auxiliar function parse' with the following type (now we return a pair, being the second value the rest of list):

parse' :: [Int] -> (Tree Char, [Int])

Next, you can see a piece of code where the 0 case is just as described before.
For the 1 case, we just need to take the next 7 numbers and make them into a char somehow (I leave the definition of toChar for you), then, just return a Leaf and the rest of the list.

parse' (0:xs) = let (l, xs')    = parse' xs
                    (r, xs'')   = parse' xs' in (Node 0 l r, xs'') --xs'' should be []
parse' (1:xs) = let w = toChar (take 7 xs) in (Leaf 0 w , drop 7 xs)

Finally, our parse function just calls the auxiliary parse one and returns the first element of the pair.

parse xs = fst $ parse' xs
1
behzad.nouri On

do a right fold:

import Data.Char (chr)

data Tree a = Leaf a | Node (Tree a) (Tree a)
  deriving Show

build :: [Int] -> [Tree Char]
build xs = foldr go (\_ _ -> []) xs 0 0
  where
  nil = Leaf '?'
  go 0 run 0 0 = case run 0 0 of
    []     -> Node nil nil:[]
    x:[]   -> Node x   nil:[]
    x:y:zs -> Node x   y  :zs

  go 1 run 0 0 = run 0 1
  go _ _   _ 0 = error "this should not happen!"
  go x run v 7 = (Leaf $ chr (v * 2 + x)): run 0 0
  go x run v k = run (v * 2 + x) (k + 1)

then:

\> head $ build [0,0,0,1,1,1,0, ...] -- the list of 01s as in the question
Node (Node (Node (Leaf 'k') (Leaf 't'))
      (Node (Node (Leaf 'q') (Leaf 'g')) (Leaf 'r')))
 (Node (Leaf 'w') (Leaf 'a'))
2
Benjamin Hodgson On

Well, you're trying to parse a tree of bytes from a bit-stream. Parsing's one of those cases where it pays to set up some structure: we're going to write a miniature parser combinator library in the style of How to Replace Failure by a List of Successes, which will allow us to write our code in an idiomatic functional style and delegate a lot of the work to the machine.

Translating the old rhyme into the language of monad transformers, and reading "string" as "bit-string", we have

newtype Parser a = Parser (StateT [Bool] [] a)
    deriving (Functor, Applicative, Monad, Alternative)

runParser :: Parser a -> [Bool] -> [(a, [Bool])]
runParser (Parser m) = runStateT m

A parser is a monadic computation which operates statefully on a stream of Booleans, yielding a collection of successfully-parsed as. GHC's GeneralizedNewtypeDeriving superpowers allow me to elide the boilerplate instances of Monad et al.

The goal, then, is to write a Parser (Tree SevenBits) - a parser which returns a tree of septuples of Booleans. (You can turn the 7 bits into a Word8 at your leisure by deriving a Functor instance for Tree and using fmap.) I'm going to use the following definition of Tree because it's simpler - I'm sure you can figure out how to adapt this code to your own ends.

data Tree a = Leaf a | Node (Tree a) (Tree a) deriving Show

type SevenBits = (Bool, Bool, Bool, Bool, Bool, Bool, Bool)

Here's a parser that attempts to consume a single bit from the input stream, failing if it's empty:

one :: Parser Bool
one = Parser $ do
    stream <- get
    case stream of
        [] -> empty
        (x:xs) -> put xs *> return x

Here's one which attempts to consume a particular bit from the input stream, failing if it doesn't match:

bit :: Bool -> Parser ()
bit b = do
    i <- one
    guard (i == b)

Here I'm pulling a sequence of seven Booleans from the input stream using replicateM and packing them into a tuple. We'll be using this to populate Leaf nodes' contents.

sevenBits :: Parser SevenBits
sevenBits = pack7 <$> replicateM 7 one
    where pack7 [a,b,c,d,e,f,g] = (a, b, c, d, e, f, g)

Now we can finally write the code which parses the tree structure itself. We'll be choosing between the Node and Leaf alternatives using <|>.

tree :: Parser (Tree SevenBits)
tree = node <|> leaf
    where node = bit False *> liftA2 Node tree tree
          leaf = bit True *> fmap Leaf sevenBits

If node succeeds in parsing a low bit from the head of the stream, it continues to recursively parse the encoding of the left subtree followed by the right subtree, sequencing the applicative actions with liftA2. The trick is that node fails if it doesn't encounter a low bit at the head of the input stream, which tells <|> to give up on node and try leaf instead.

Note how the structure of tree reflects the structure of the Tree type itself. This is applicative parsing at work. We could alternately have structured this parser monadically, first using one to parse an arbitrary bit and then using a case analysis on the bit to determine whether we should continue to parse a pair of trees or a leaf. In my opinion this version is simpler, more declarative, and less verbose.

Also compare the clarity of this code to the low-level style of @behzad.nouri's foldr-based solution. Rather than building an explicit finite-state machine which switches between parsing nodes and leaves - an imperative-flavoured idea - my design allows you to declaratively describe the grammar to the machine using standard functions like liftA2 and <|> and trust that the abstractions will do the right thing.

Anyway, here I'm parsing a simple tree consisting of a pair of Leafs containing the (binary-encoded) numbers 0 and 1. As you can see, it returns the single successful parse and an empty stream of remaining bits.

ghci> runParser tree $ map (>0) [0, 1, 0,0,0,0,0,0,0, 1, 0,0,0,0,0,0,1]
[(Node (Leaf (False, False, False, False, False, False, False)) (Leaf (False, False, False, False, False, False, True)),[])]