I have a bz2-compressed log file with a lot of lines. Every line has to undergo a small analysis which is of no importance here.
I started by reading the lines in text mode like:
import bz2
path = 'content.log.bz2'
def method_1(path):
with bz2.open(path, 'rt') as file:
lines = [line for line in file]
return lines
(The analysis happens in the list comprehension.) prun gives:
1394087 function calls (1394086 primitive calls) in 9.864 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
17350 3.295 0.000 3.295 0.000 {method 'decompress' of '_bz2.BZ2Decompressor' objects}
1 3.042 3.042 9.780 9.780 <ipython-input-1-77d0033e4930>:7(<listcomp>)
1143178 2.232 0.000 2.232 0.000 bz2.py:137(closed)
16617 0.266 0.000 3.894 0.000 _compression.py:66(readinto)
16617 0.138 0.000 3.474 0.000 _compression.py:72(read)
16617 0.138 0.000 4.386 0.000 bz2.py:184(read1)
66467 0.128 0.000 0.128 0.000 {built-in method builtins.len}
16617 0.108 0.000 4.001 0.000 {method 'read1' of '_io.BufferedReader' objects}
16617 0.086 0.000 0.153 0.000 codecs.py:319(decode)
1 0.083 0.083 9.864 9.864 <string>:1(<module>)
16617 0.076 0.000 0.247 0.000 _compression.py:16(_check_can_read)
16619 0.070 0.000 0.171 0.000 bz2.py:151(readable)
16621 0.068 0.000 0.101 0.000 _compression.py:12(_check_not_closed)
16617 0.067 0.000 0.067 0.000 {built-in method _codecs.utf_8_decode}
16617 0.058 0.000 0.058 0.000 {method 'cast' of 'memoryview' objects}
886 0.009 0.000 0.009 0.000 {method 'read' of '_io.BufferedReader' objects}
I thought I might be better off, doing this in byte mode.
def method_2(path):
with bz2.open(path, 'rb') as file:
lines = [line for line in file]
return lines
As it turns out, this is much worse performance-wise than my first solution. prun now gives:
8020433 function calls in 39.857 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1126551 10.761 0.000 36.901 0.000 bz2.py:206(readline)
1126551 4.739 0.000 11.553 0.000 bz2.py:151(readable)
1126551 4.655 0.000 16.208 0.000 _compression.py:16(_check_can_read)
1126551 4.517 0.000 6.814 0.000 _compression.py:12(_check_not_closed)
17350 4.333 0.000 4.333 0.000 {method 'decompress' of '_bz2.BZ2Decompressor' objects}
1126551 3.023 0.000 7.947 0.000 {method 'readline' of '_io.BufferedReader' objects}
1 2.880 2.880 39.780 39.780 <ipython-input-1-77d0033e4930>:12(<listcomp>)
1126554 2.297 0.000 2.297 0.000 bz2.py:137(closed)
1126552 1.985 0.000 1.985 0.000 {built-in method builtins.isinstance}
16617 0.273 0.000 4.924 0.000 _compression.py:66(readinto)
16617 0.140 0.000 4.515 0.000 _compression.py:72(read)
66467 0.128 0.000 0.128 0.000 {built-in method builtins.len}
1 0.073 0.073 39.857 39.857 <string>:1(<module>)
16617 0.040 0.000 0.040 0.000 {method 'cast' of 'memoryview' objects}
886 0.009 0.000 0.009 0.000 {method 'read' of '_io.BufferedReader' objects}
Does anyone know what is the problem here? I suspected the byte mode to be much more "machine-like" and therefore faster. But it isn't.