reading files with references to other files in haskell

279 views Asked by At

I am trying to expand regular markdown with the ability to have references to other files, such that the content in the referenced files is rendered at the corresponding places in the "master" file.

But the furthest I've come is to implement

createF :: FTree -> IO String
createF Null = return ""
createF (Node f children) = ifNExists f (_id f)
                              (do childStrings <- mapM createF children
                                  withFile (_path f) ReadMode $ \handle ->
                                    do fc <- lines <$> hGetContents handle
                                       return $ merge fc childStrings)

ifNExists is just a helper that can be ignored, the real problem happens in the reading of the handle, it just returns the empty string, I assume this is due to lazy IO.

I thought that the use of withFile filepath ReadMode $ \handle -> {-do stutff-}hGetContents handle would be the right solution as I've read fcontent <- withFile filepath ReadMode hGetContents is a bad idea.

Another thing that confuses me is that the function

createFT :: File -> IO FTree
createFT f = ifNExists f Null
               (withFile (_path f) ReadMode $ \handle ->
                  do let thisParse = fparse (_id f :_parents f)
                     children <-rights . map ( thisParse . trim) . lines <$> hGetContents handle
                     c <- mapM createFT children
                     return $ Node f c)

works like a charm.

So why does createF return just an empty string?

the whole project and a directory/file to test can be found at github


Here are the datatype definitions

type ID = String

data File = File {_id :: ID, _path :: FilePath, _parents :: [ID]}
          deriving (Show)
data FTree = Null
           | Node { _file :: File
                  , _children :: [FTree]} deriving (Show)
2

There are 2 answers

0
dfeuer On BEST ANSWER

As you suspected, lazy IO is probably the problem. Here's the (awful) rule you have to follow to use it properly without going totally nuts:

A withFile computation must not complete until all (lazy) I/O required to fully evaluate its result has been performed.

If something forces I/O after the handle is closed, you are not guaranteed to get an error, even though that would be very nice. Instead, you get completely undefined behavior.

You break this rule with return $ merge fc childStrings, because this value is returned before it's been fully evaluated. What you can do instead is something vaguely like

let retVal = merge fc childStrings
deepSeq retVal $ return retVal

An arguably cleaner alternative is to put all the rest of the code that relies on those results into the withFile argument. The only real reason not to do that is if you do a bunch of other work with the results after you're finished with that file. For example, if you're processing a bunch of different files and accumulating their results, then you want to be sure to close each of them when you're done with it. If you're just reading in one file and then acting on it, you can leave it open till you're finished.


By the way, I just submitted a feature request to the GHC team to see if they might be willing to make these kinds of programs more likely to fail early with useful error messages.


Update

The feature request was accepted, and such programs are now much more likely to produce useful error messages. See What caused this "delayed read on closed handle" error? for details.

1
Petr On

I'd strongly suggest you to avoid lazy IO as it always creates problems like this, as described in What's so bad about Lazy I/O? As in your case, where you need to keep the file open until it's fully read, but this would mean closing the file somewhere in pure code, when the content is actually consumed.

One possibility would be to use strict ByteStrings and read files using readFile. This would also make many operations more efficient.

Another option would be to use one of the libraries that address the lazy IO problem (see What are the pros and cons of Enumerators vs. Conduits vs. Pipes?). These libraries allow you to separate content production from its processing or consumption. So you could have a producer that reads input files and produces a stream of some tokens, and a pure consumer (not depending on IO) that consumes the stream and produces some result. For example, in conduit-extra there is a module that converts an atto-parsec parser into a consumer. See also Is there a better way to walk a directory tree?