Parsing large csv file to tinydb takes forever

4.3k views Asked by At

I have a large csv file containing 15 columns and approximately 1 million rows. I want to parse the data into tinyDB. The code I use is below:

import csv
from tinydb import TinyDB

db = TinyDB('db.monitor')
table = db.table('Current')

i=0

datafile = open('newData.csv', 'rb')
data=csv.reader(datafile, delimiter = ';')

for row in data:
    table.insert({'WT_ID': row[0], 'time': row[1], 'MeanCurrent': row[2], 'VapourPressure': row[3], 'MeanVoltage':row[4], 'Temperature': row[5], 'Humidity': row[6], 'BarPressure': row[7], 'RPM': row[8], 'WindSector': row[9], 'WindSpeed': row[10], 'AirDensity': row[12], 'VoltageDC': row[13], 'PowerSec': row[14], 'FurlingAngle': row[15]})
    i=i+1
    print i

However, it really takes forever. I have set the i variable to track the progress, and while in the first lines it runs fast, now its been more than an hour and it has parsed about 10000 lines at a pace of almost 1Hz

I couldn't find anything similar so any help would be appreciated

Thank you

6

There are 6 answers

0
iFlo On BEST ANSWER

Is TinyDB the best choice ? You seem to need a transational database and TinyDB is document oriented. On top of that, from the doc : Wy not use TinyDB

If you need advanced features or high performance, TinyDB is the wrong database for you

Your process run really slow because you are accumulating data into the RAM. As a workaround, you could split your csv in smaller trunk and populate your script with it. This way, the memory could be clean between each iteration.

tinyDB is quite not able to manage this amount of informations.

0
Shijo On

I never used tinyDB, but you could try below 2 options

  1. Pandas to DB

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html

Pandas can read csv data in bulk and faster than reading line by line from file

  1. Read csv file in chunk instead of line by line and then create insert statement for that instead of an insert statement for each line.
0
piterdias On

I had a similar problem this week whose solution was using a CachingMiddleware:

import csv
from tinydb import TinyDB
from tinydb.storages import JSONStorage
from tinydb.middlewares import CachingMiddleware

db = TinyDB('db.monitor', storage=CachingMiddleware(JSONStorage))
table = db.table('Current')

i=0

datafile = open('newData.csv', 'rb')
data=csv.reader(datafile, delimiter = ';')

for row in data:
    table.insert({'WT_ID': row[0], 'time': row[1], 'MeanCurrent': row[2], 'VapourPressure': row[3], 'MeanVoltage':row[4], 'Temperature': row[5], 'Humidity': row[6], 'BarPressure': row[7], 'RPM': row[8], 'WindSector': row[9], 'WindSpeed': row[10], 'AirDensity': row[12], 'VoltageDC': row[13], 'PowerSec': row[14], 'FurlingAngle': row[15]})
    i=i+1
    print i
0
Stavros Korokithakis On

TinyDB is going to be slow, as it loads everything into RAM. I'm not sure of any good alternatives, though you may be interested in this SQLite-backed document store library I wrote a while back:

https://github.com/skorokithakis/goatfish

You can create indexes on some fields, and it turns everything into SQL queries, so performance is pretty decent.

0
Sahil Shaikh On

User Pandas to write to the Json file which will be a lot quicker.

generated_rules.to_json('./Data/rules.json', orient='index', index='true')

Then edit the generated json file and add table or default in json file eg

Before

{"0": {...}, "1":{...}

After

{rules:{"0": {...}, "1":{...}}

then just read this json file as tiny db.

0
rbf22 On

It is way faster to build the file by hand if you are parsing the data from a csv file. I would suggest something like the following:

import json
counter = 0
with open(csv_file) as fh, open('test.db','w') as db:
    keys = None
    db.write('{"_default": {')
    spacer = ''
    for line in fh:
        if not keys:
            # This is the header row
            keys = line[:-1].replace(" ", "_").split('\t')
        else:
            counter += 1
            # These are the data rows
            values = line[:-1].split('\t')
            db.write('{}"{}":{}'.format(spacer,counter,json.dumps(dict(zip(keys, values)))))
            spacer = ','
        if counter % 1000 == 0:
            print( 'counter: {:10d}'.format(counter), end='\r')
    db.write('}}')