Python bz2 readlines slow in byte-mode

224 views Asked by At

I have a bz2-compressed log file with a lot of lines. Every line has to undergo a small analysis which is of no importance here.

I started by reading the lines in text mode like:

import bz2

path = 'content.log.bz2' 

def method_1(path):
    with bz2.open(path, 'rt') as file:
        lines = [line for line in file]
    return lines

(The analysis happens in the list comprehension.) prun gives:

1394087 function calls (1394086 primitive calls) in 9.864 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    17350    3.295    0.000    3.295    0.000 {method 'decompress' of '_bz2.BZ2Decompressor' objects}
        1    3.042    3.042    9.780    9.780 <ipython-input-1-77d0033e4930>:7(<listcomp>)
  1143178    2.232    0.000    2.232    0.000 bz2.py:137(closed)
    16617    0.266    0.000    3.894    0.000 _compression.py:66(readinto)
    16617    0.138    0.000    3.474    0.000 _compression.py:72(read)
    16617    0.138    0.000    4.386    0.000 bz2.py:184(read1)
    66467    0.128    0.000    0.128    0.000 {built-in method builtins.len}
    16617    0.108    0.000    4.001    0.000 {method 'read1' of '_io.BufferedReader' objects}
    16617    0.086    0.000    0.153    0.000 codecs.py:319(decode)
        1    0.083    0.083    9.864    9.864 <string>:1(<module>)
    16617    0.076    0.000    0.247    0.000 _compression.py:16(_check_can_read)
    16619    0.070    0.000    0.171    0.000 bz2.py:151(readable)
    16621    0.068    0.000    0.101    0.000 _compression.py:12(_check_not_closed)
    16617    0.067    0.000    0.067    0.000 {built-in method _codecs.utf_8_decode}
    16617    0.058    0.000    0.058    0.000 {method 'cast' of 'memoryview' objects}
      886    0.009    0.000    0.009    0.000 {method 'read' of '_io.BufferedReader' objects}

I thought I might be better off, doing this in byte mode.

def method_2(path):
    with bz2.open(path, 'rb') as file:
        lines = [line for line in file]
    return lines

As it turns out, this is much worse performance-wise than my first solution. prun now gives:

 8020433 function calls in 39.857 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1126551   10.761    0.000   36.901    0.000 bz2.py:206(readline)
  1126551    4.739    0.000   11.553    0.000 bz2.py:151(readable)
  1126551    4.655    0.000   16.208    0.000 _compression.py:16(_check_can_read)
  1126551    4.517    0.000    6.814    0.000 _compression.py:12(_check_not_closed)
    17350    4.333    0.000    4.333    0.000 {method 'decompress' of '_bz2.BZ2Decompressor' objects}
  1126551    3.023    0.000    7.947    0.000 {method 'readline' of '_io.BufferedReader' objects}
        1    2.880    2.880   39.780   39.780 <ipython-input-1-77d0033e4930>:12(<listcomp>)
  1126554    2.297    0.000    2.297    0.000 bz2.py:137(closed)
  1126552    1.985    0.000    1.985    0.000 {built-in method builtins.isinstance}
    16617    0.273    0.000    4.924    0.000 _compression.py:66(readinto)
    16617    0.140    0.000    4.515    0.000 _compression.py:72(read)
    66467    0.128    0.000    0.128    0.000 {built-in method builtins.len}
        1    0.073    0.073   39.857   39.857 <string>:1(<module>)
    16617    0.040    0.000    0.040    0.000 {method 'cast' of 'memoryview' objects}
      886    0.009    0.000    0.009    0.000 {method 'read' of '_io.BufferedReader' objects}

Does anyone know what is the problem here? I suspected the byte mode to be much more "machine-like" and therefore faster. But it isn't.

0

There are 0 answers