In C#, How can I copy a file with arbitrary encoding, reading line by line, without adding or deleting a newline

770 views Asked by At

I need to be able to take a text file with unknown encoding (e.g., UTF-8, UTF-16, ...) and copy it line by line, making specific changes as I go. In this example, I am changing the encoding, however there are other uses for this kind of processing.

What I can't figure out is how to determine if the last line has a newline! Some programs care about the difference between a file with these records:

Rec1<newline>
Rec2<newline>

And a file with these:

Rec1<newline>
Rec2

How can I tell the difference in my code so that I can take appropriate action?

using (StreamReader reader = new StreamReader(sourcePath))
using (StreamWriter writer = new StreamWriter(destinationPath, false, outputEncoding))
{
    bool isFirstLine = true;

    while (!reader.EndOfStream)
    {
        string line = reader.ReadLine();

        if (isFirstLine)
        {
            writer.Write(line);
            isFirstLine = false;
        }
        else
        {
            writer.Write("\r\n" + line);
        }
    }


    //if (LastLineHasNewline)
    //{
    //  writer.Write("\n");
    //}

    writer.Flush();
}

The commented out code is what I want to be able to do, but I can't figure out how to set the condition lastInputLineHadNewline! Remember, I have no a priori knowledge of the input file encoding.

2

There are 2 answers

4
Jon Skeet On BEST ANSWER

Remember, I have no a priori knowledge of the input file encoding.

That's the fundamental problem to solve.

If the file could be using any encoding, then there is no concept of reading "line by line" as you can't possibly tell what the line ending is.

I suggest you first address this part, and the rest will be easy. Now, without knowing the context it's hard to say whether that means you should be asking the user for the encoding, or detecting it heuristically, or something else - but I wouldn't start trying to use the data before you can fully understand it.

3
David Rogers On

As often happens, the moment you go to ask for help, the answer comes to the surface. The commented out code becomes:

if (LastLineHasNewline(reader))
{
    writer.Write("\n");
}

And the function looks like this:

private static bool LastLineHasNewline(StreamReader reader)
{
    byte[] newlineBytes = reader.CurrentEncoding.GetBytes("\n");
    int newlineByteCount = newlineBytes.Length;

    reader.BaseStream.Seek(-newlineByteCount, SeekOrigin.End);

    byte[] inputBytes = new byte[newlineByteCount];
    reader.BaseStream.Read(inputBytes, 0, newlineByteCount);
    for (int i = 0; i < newlineByteCount; i++)
    {
        if (newlineBytes[i] != inputBytes[i])
            return false;
    }
    return true;
}