fileinput.hook_compressed gives me strings sometimes, bytes other times

251 views Asked by At

I'm trying to read lines from a number of files. Some are gzipped, and others are plain text files. In Python 2.7, I have been using the following code and it worked:

for line in fileinput.input(filenames, openhook=fileinput.hook_compressed):
    match = REGEX.match(line)
    if (match):
        # do things with line...

Now I moved to Python 3.8, and it still works ok with plain text files, but when it encounters gzipped files I get the following error:

TypeError: cannot use a string pattern on a bytes-like object

What's the best way to fix this? I know I can check if line is a bytes object and decode it into a string, but I would rather do it with some flag to automatically always iterate lines as string, if possible; and, I would prefer to write code that works with both Python 2 and 3.

2

There are 2 answers

4
Mad Physicist On BEST ANSWER

fileinput.input does fundamentally different things depending on whether it gets a gzipped file or not. For text files, it opens with regular open, which effectively opens in text mode by default. For gzip.open, the default mode is binary, which is sensible for compressed files of unknown content.

The binary-only restriction is artificially imposed by fileinput.FileInput. From the code of the __init__ method:

  # restrict mode argument to reading modes
   if mode not in ('r', 'rU', 'U', 'rb'):
       raise ValueError("FileInput opening mode must be one of "
                        "'r', 'rU', 'U' and 'rb'")
   if 'U' in mode:
       import warnings
       warnings.warn("'U' mode is deprecated",
                     DeprecationWarning, 2)
   self._mode = mode

This gives you two options for a workaround.

Option 1

Set the _mode attribute after __init__. To avoid adding extra lines of code to your usage, you can subclass fileinput.FileInput and use the class directly:

class TextFileInput(fileinput.FileInput):
    def __init__(*args, **kwargs):
        if 'mode' in kwargs and 't' in kwargs['mode']:
            mode = kwargs.pop['mode']
        else:
            mode = ''
        super().__init__(*args, **kwargs)
        if mode:
            self._mode = mode

for line in TextFileInput(filenames, openhook=fileinput.hook_compressed, mode='rt'):
    ...

Option 2

Messing with undocumented leading-underscore is pretty hacky, so instead, you can create a custom hook for zip files. This is actually pretty easy, since you can use the code for fileinput.hook_compressed as a template:

def my_hook_compressed(filename, mode):
    if 'b' not in mode:
        mode += 't'
    ext = os.path.splitext(filename)[1]
    if ext == '.gz':
        import gzip
        return gzip.open(filename, mode)
    elif ext == '.bz2':
        import bz2
        return bz2.open(filename, mode)
    else:
        return open(filename, mode)

Option 3

Finally, you can always decode your bytes to unicode strings. This is clearly not the preferable option.

0
lab115 On

Extending the answer by Mad Physicist to include xz and zst extensions.

def my_hook_compressed(filename, mode):
    """hook for fileinput so we can also handle compressed files seamlessly"""
    if 'b' not in mode:
        mode += 't'
    ext = os.path.splitext(filename)[1]
    if ext == '.gz':
        import gzip
        return gzip.open(filename, mode)
    elif ext == '.bz2':
        import bz2
        return bz2.open(filename, mode)
    elif ext == '.xz':
        import lzma
        return lzma.open(filename, mode)
    elif ext == '.zst':
        import zstandard, io
        compressed = open(filename, 'rb')
        decompressor = zstandard.ZstdDecompressor()
        stream_reader = decompressor.stream_reader(compressed)
        return io.TextIOWrapper(stream_reader)
    else:
        return open(filename, mode)

I have not tested on 2.7, but this works with 3.8+

for line in fileinput.input(filenames, openhook=my_hook_compressed):
    ...