Parsing large csv file to tinydb takes forever

Question

Parsing large csv file to tinydb takes forever

4.3k views Asked by tzoukritzou At 23 December 2016 at 13:52

I have a large csv file containing 15 columns and approximately 1 million rows. I want to parse the data into tinyDB. The code I use is below:

import csv
from tinydb import TinyDB

db = TinyDB('db.monitor')
table = db.table('Current')

i=0

datafile = open('newData.csv', 'rb')
data=csv.reader(datafile, delimiter = ';')

for row in data:
    table.insert({'WT_ID': row[0], 'time': row[1], 'MeanCurrent': row[2], 'VapourPressure': row[3], 'MeanVoltage':row[4], 'Temperature': row[5], 'Humidity': row[6], 'BarPressure': row[7], 'RPM': row[8], 'WindSector': row[9], 'WindSpeed': row[10], 'AirDensity': row[12], 'VoltageDC': row[13], 'PowerSec': row[14], 'FurlingAngle': row[15]})
    i=i+1
    print i

However, it really takes forever. I have set the i variable to track the progress, and while in the first lines it runs fast, now its been more than an hour and it has parsed about 10000 lines at a pace of almost 1Hz

I couldn't find anything similar so any help would be appreciated

Thank you

Original Q&A

There are 6 answers

Shijo On 23 December 2016 at 14:05

I never used tinyDB, but you could try below 2 options

Pandas to DB

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html

Pandas can read csv data in bulk and faster than reading line by line from file

Read csv file in chunk instead of line by line and then create insert statement for that instead of an insert statement for each line.

piterdias On 04 November 2019 at 20:08

I had a similar problem this week whose solution was using a CachingMiddleware:

import csv
from tinydb import TinyDB
from tinydb.storages import JSONStorage
from tinydb.middlewares import CachingMiddleware

db = TinyDB('db.monitor', storage=CachingMiddleware(JSONStorage))
table = db.table('Current')

i=0

datafile = open('newData.csv', 'rb')
data=csv.reader(datafile, delimiter = ';')

for row in data:
    table.insert({'WT_ID': row[0], 'time': row[1], 'MeanCurrent': row[2], 'VapourPressure': row[3], 'MeanVoltage':row[4], 'Temperature': row[5], 'Humidity': row[6], 'BarPressure': row[7], 'RPM': row[8], 'WindSector': row[9], 'WindSpeed': row[10], 'AirDensity': row[12], 'VoltageDC': row[13], 'PowerSec': row[14], 'FurlingAngle': row[15]})
    i=i+1
    print i

Stavros Korokithakis On 15 June 2020 at 17:33

TinyDB is going to be slow, as it loads everything into RAM. I'm not sure of any good alternatives, though you may be interested in this SQLite-backed document store library I wrote a while back:

https://github.com/skorokithakis/goatfish

You can create indexes on some fields, and it turns everything into SQL queries, so performance is pretty decent.

Sahil Shaikh On 07 February 2022 at 00:02

User Pandas to write to the Json file which will be a lot quicker.

generated_rules.to_json('./Data/rules.json', orient='index', index='true')

Then edit the generated json file and add table or default in json file eg

Before

{"0": {...}, "1":{...}

After

{rules:{"0": {...}, "1":{...}}

then just read this json file as tiny db.

rbf22 On 10 August 2020 at 01:22

It is way faster to build the file by hand if you are parsing the data from a csv file. I would suggest something like the following:

import json
counter = 0
with open(csv_file) as fh, open('test.db','w') as db:
    keys = None
    db.write('{"_default": {')
    spacer = ''
    for line in fh:
        if not keys:
            # This is the header row
            keys = line[:-1].replace(" ", "_").split('\t')
        else:
            counter += 1
            # These are the data rows
            values = line[:-1].split('\t')
            db.write('{}"{}":{}'.format(spacer,counter,json.dumps(dict(zip(keys, values)))))
            spacer = ','
        if counter % 1000 == 0:
            print( 'counter: {:10d}'.format(counter), end='\r')
    db.write('}}')

**iFlo** · Accepted Answer · 2016-12-23T14:00:53+00:00

Is TinyDB the best choice ? You seem to need a transational database and TinyDB is document oriented. On top of that, from the doc : Wy not use TinyDB

If you need advanced features or high performance, TinyDB is the wrong database for you

Your process run really slow because you are accumulating data into the RAM. As a workaround, you could split your csv in smaller trunk and populate your script with it. This way, the memory could be clean between each iteration.

tinyDB is quite not able to manage this amount of informations.

TechQA.

Parsing large csv file to tinydb takes forever

There are 6 answers

Related Questions in PYTHON

Related Questions in PERFORMANCE

Related Questions in CSV

Related Questions in TINYDB

Popular Questions

Popular Tags

Trending Questions