Merging multiple text files into one and related problems

Question

Merging multiple text files into one and related problems

2.1k views Asked by Vynylyn At 19 February 2015 at 18:54

I'm using Windows 7 and Python 3.4.

I have several multi-line text files (all in Persian) and I want to merge them into one under one condition: each line of the output file must contain the whole text of each input file. It means if there are nine text files, the output text file must have only nine lines, each line containing the text of a single file. I wrote this:

import os
os.chdir ('C:\Dir')
with open ('test.txt', 'w', encoding = 'UTF8') as OutFile:
    with open ('news01.txt', 'r', encoding = 'UTF8') as InFile:
        while True:
            _Line = InFile.readline()
            if len (_Line) == 0:
                break
            else:
                _LineString = str (_Line)
                OutFile.write (_LineString)

It worked for that one file but it looks like it takes more than one line in output file and also the output file contains disturbing characters like: &amp, &nbsp and things like that. But the source files don't contain any of them. Also, I've got some other texts: news02.txt, news03.txt, news04.txt ... news09.txt.

Considering all these:

How can I correct my code so that it reads all files one after one, putting each in only one line?
How can I clean these unfamiliar and strange characters or prevent them to appear in my final text?

Original Q&A

There are 2 answers

AudioBubble On 19 February 2015 at 19:56

Answering question 1:

You were right about the UTF-8 part.
You probably want to create a function which takes multiple files as a tuple of files/strings of file directories or *args. Then, read all input files, and replace all "\n" (newlines) with a delimiter (Default ""). out_file can be in in_files, but makes the assumption that the contents of files can be loaded in to memory. Also, out_file can be a file object, and in_files can be file objects.

def write_from_files(out_file, in_files, delimiter="", dir="C:\Dir"):
    import _io
    import os
    import html.parser  # See part 2 of answer
    os.chdir(dir)
    output = []
    for file in in_files:
        file_ = file
        if not isinstance(file_, _io.TextIOWrapper):
            file_ = open(file_, "r", -1, "UTF-8")  # If it isn't a file, make it a file
        file_.seek(0, 0)
        output.append(file_.read().replace("\n", delimiter))  # Replace all newlines
        file_.close()  # Close file to prevent IO errors      # with delimiter
    if not isinstance(out_file, _io.TextIOWrapper):
        out_file = open(out_file, "w", -1, "UTF-8")
    html.parser.HTMLParser().unescape("\n".join(output))
    out_file.write(join)
    out_file.close()
    return join  # Do not have to return

Answering question 2:

I think you may of copied from a webpage. This does not happen to me. The &amp and &nbsp are the HTML entities, (&) and ( ). You may need to replace them with their corresponding character. I would use HTML.parser. As you see in above, it turns HTML escape sequences into Unicode literals. E.g.:

>>> html.parser.HTMLParser().unescape("Alpha &lt &beta;")
'Alpha < β'

This will not work in Python 2.x, as in 3.x it was renamed. Instead, replace the incorrect lines with:

import HTMLParser
HTMLParser.HTMLParser().unescape("\n".join(output))

**aruisdante** · Accepted Answer · 2015-02-19T19:45:22+00:00

Here is an example that will do the merging portion of your question:

def merge_file(infile, outfile, separator = ""):
    print(separator.join(line.strip("\n") for line in infile), file = outfile)


def merge_files(paths, outpath, separator = ""):
    with open(outpath, 'w') as outfile:
        for path in paths:
            with open(path) as infile:
                merge_file(infile, outfile, separator)

Example use:

merge_files(["C:\file1.txt", "C:\file2.txt"], "C:\output.txt")

Note this makes the rather large assumption that the contents of 'infile' can fit into memory. Reasonable for most text files, but possibly quite unreasonable otherwise. If your text files will be very large, you can this alternate merge_file implementation:

def merge_file(infile, outfile, separator = ""):
    for line in infile:
        outfile.write(line.strip("\n")+separator)
    outfile.write("\n")

It's slower, but shouldn't run into memory problems.

TechQA.

Merging multiple text files into one and related problems

There are 2 answers

Related Questions in PYTHON-3.X

Related Questions in TEXTREADER

Related Questions in FILEMERGE

Popular Questions

Trending Questions