I have to parse a large log file (2GB) using reg ex in python. In the log file regular expression matches line which I am interested in. Log file can also have unwanted data.
Here is a sample from the file:
"#DEBUG:: BFM [L4] 5.4401e+08ps MSG DIR:TX SCB_CB TYPE:DATA_REQ CPortID:'h8 SIZE:'d20 NumSeg:'h0001 Msg_Id:'h00000000"
My regular expression is ".DEBUG.*MSG."
First I will split it using the white spaces then the "field:value" patterns are inserted into the sqlite3 database; but for large files it takes around 10 to 15 minutes to parse the file.
Please suggest the best way to do the above task in minimal time.
As others have said, profile your code to see why it is slow. The
cProfile
module in conjunction with thegprof2dot
tool can produce nice readable informationWithout seeing your slow code, I can guess a few things that might help:
First is you can probably get away with using the builtin string methods instead of a regex - this might be marginally quicker. If you need to use regex's, it's worthwhile precompiling outside the main loop using
re.compile
Second is to not do one insert query per line, instead do the insertions in batches, e.g add the parsed info to a list, then when it reaches a certain size, perform one INSERT query with
executemany
method.Some incomplete code, as an example of the above: