How to regex per line with Conduit

47 views Asked by At

Based on provided example, we can get length of each line

import Conduit
import Data.Text (Text, pack)
import Text.Regex.TDFA ((=~), getAllTextMatches)
import Control.Monad.IO.Class (liftIO)

wc :: IO ()
wc = runResourceT
       $ runConduit
       $ sourceFile "input.txt"
       .| decodeUtf8C
       .| peekForeverE (lineC lengthCE >>= liftIO . print)

However, how would I get all matches based on regex? and in the end write them to a file?

regex :: IO ()
regex = runResourceT
      $ runConduit
      $ sourceFile "input.txt"
      .| decodeUtf8C
      .| do
         line <- mapCE (\l -> getAllTextMatches (l =~ "^foo") :: [Text])
         liftIO $ print $ line

Update:

Figured out there's built-in lines function, but is there a way to print a line and pass it along without consuming it?

grep :: IO ()
grep = runResourceT
    $ runConduit
    $ yield "foo\ndoo"
    .| decodeUtf8C
    .| Data.Conduit.Text.lines
    .| mapC (\a -> a =~ ("[fd]oo" :: Text))
    .| mapM_C (liftIO . (print :: Text -> IO ()))
    .| encodeUtf8C
    .| stdoutC

The above does print per line, but stdoutC ends up being not consumed

ghci> grep
"foo"
"doo"

Update 2: Figured out how to print in a pipeline

grep :: IO ()
grep = runResourceT
    $ runConduit
    $ yieldMany ["foo\ndoo", "\nduh"]
    .| decodeUtf8C
    .| Data.Conduit.Text.lines
    .| mapC (\a -> a =~ ("[fd]oo" :: Text) :: Text)
    .| log1
    .| unlinesC
    .| encodeUtf8C
    .| stdoutC

But why does order of await matters?

log1 :: ConduitT Text Text (ResourceT IO) ()
log1 = do
       Just l <- await -- <- has to be first
       liftIO $ print l
       yield l
1

There are 1 answers

1
K. A. Buhr On BEST ANSWER

It's not that clear from your question what you're trying to do, but if you are trying to copy all matching lines from "input.txt" to "output.txt", kind of like the grep command line utility, then you probably want a conduit that looks something like this:

sourceFile "input.txt" .| decodeUtf8C .| linesUnboundedC
  .| filterC (=~ ("[fd]oo" :: Text))
  .| unlinesC .| encodeUtf8C .| sinkFile "output.txt"

Note that linesUnboundedC is a function in the "conduit" package that's equivalent to the deprecated lines function from "conduit-extra". Also, using filterC here is probably more natural than your mapC for filtering matching lines, rather than generating empty matches.

Operating on the text file:

A famous linguist once said
that of all the phrases in the English language,
of all the endless combinations of words in all of history, that
"cellar door"
is the most beautiful.
That's some food for thought.

this conduit will copy the two matching lines to the output:

"cellar door"
That's some food for thought.

If you want to write the matching lines to both standard output and output.txt simultaneously, the conduit-friendly method is probably to end your conduit with a sequenceSinks component. (The void call here is needed to get the return type right.)

import Control.Monad (void)

... .| void (sequenceSinks [stdoutC, sinkFile "output.txt"])

If you prefer a log conduit that you can insert in the middle to write a copy to stdout, then the following ought to work:

log1 :: (MonadIO m) => ConduitT Text Text m ()
log1 = passthroughSink (unlinesC .| encodeUtf8C .| stdoutC) pure

or, if you're okay with having the Haskell quoted representations printed (i.e., surrounded by quotation marks with character escaping), then:

log2 :: (MonadIO m, Show a) => ConduitT a a m ()
log2 = passthroughSink printC pure

Some code to play around with:

{-# LANGUAGE OverloadedStrings #-}

import Conduit
import Control.Monad (void)
import Data.ByteString (ByteString)
import Data.Text (Text)
import Text.Regex.TDFA

c1, c2, c3, c4 :: ConduitT () Void (ResourceT IO) ()

-- copy matching lines from input.txt to output.txt
c1 = sourceFile "input.txt" .| decodeUtf8C .| linesUnboundedC
      .| filterC (=~ ("[fd]oo" :: Text))
      .| unlinesC .| encodeUtf8C .| sinkFile "output.txt" 

-- copy final stream to both output.txt and stdout
c2 = sourceFile "input.txt" .| decodeUtf8C .| linesUnboundedC
      .| filterC (=~ ("[fd]oo" :: Text))
      .| unlinesC .| encodeUtf8C
      .| void (sequenceSinks [stdoutC, sinkFile "output.txt"])

-- log Text to stdout in the middle of a conduit
c3 = sourceFile "input.txt" .| decodeUtf8C .| linesUnboundedC
      .| filterC (=~ ("[fd]oo" :: Text)) .| log1
      .| unlinesC .| encodeUtf8C .| sinkFile "output.txt"
  where log1 :: (MonadIO m) => ConduitT Text Text m ()
        log1 = passthroughSink (unlinesC .| encodeUtf8C .| stdoutC) pure

-- log Haskell representations of stream in middle of a conduit
c4 = sourceFile "input.txt" .| decodeUtf8C .| linesUnboundedC
      .| filterC (=~ ("[fd]oo" :: Text)) .| log2
      .| unlinesC .| encodeUtf8C .| sinkFile "output.txt"
  where log2 :: (MonadIO m, Show a) => ConduitT a a m ()
        log2 = passthroughSink printC pure

main :: IO ()
main = runResourceT $ runConduit $ c4  -- pick your conduit here