Free data warehouse - Infobright, Hadoop/Hive or what?

4.1k views Asked by At

I need to store large amount of small data objects (millions of rows per month). Once they're saved they wont change. I need to :

  • store them securely
  • use them to analysis (mostly time-oriented)
  • retrieve some raw data occasionally
  • It would be nice if it could be used with JasperReports or BIRT

My first shot was Infobright Community - just a column-oriented, read-only storing mechanism for MySQL

On the other hand, people says that NoSQL approach could be better. Hadoop+Hive looks promissing, but the documentation looks poor and the version number is less than 1.0 .

I heard about Hypertable, Pentaho, MongoDB ....

Do you have any recommendations ?

(Yes, I found some topics here, but it was year or two ago)

Edit: Other solutions : MonetDB, InfiniDB, LucidDB - what do you think?

3

There are 3 answers

0
u425204 On

Am having the same problem here and made researches; two types of storages for BI :

  • column oriented. Free and known : monetDB, LucidDb, Infobright. InfiniDB
  • Distributed : hTable, Cassandra (also column oriented theoretically)
  • Document oriented / MongoDb, CouchDB

The answer depends on what you really need :

http://www.mysqlperformanceblog.com/2010/01/07/star-schema-bechmark-infobright-infinidb-and-luciddb/

  • If the rows are added in real time.. then column oriented DB are bad. You can either choose two have two separate DB (that's my choice : one noSQL for real feeding of the stats by the front, and real time stats. The other DB column-oriented for BI). Or turn towards something that mixes column oriented (for out requests) and distribution (for writes) / like Cassandra.

Document oriented DBs are not suited for BI, they are more useful for CRM/CMS issues where you need frequent access to a particular row

As for the exact choice inside a category, I'm still undecided. Cassandra in distributed, and Monet or InfiniDB for CODB, are leaders. Monet is reported to have problem loading very big tables because it runs indexes in memory.

2
Mason On

You could also consider GridSQL. Even for a single server, you can create multiple logical "nodes" to utilize multiple cores when processing queries.

GridSQL uses PostgreSQL, so you can also take advantage of partitioning tables into subtables to evaluate queries faster. You mentioned the data is time-oriented, so that would be a good candidate for creating subtables.

0
blockcipher On

If you're looking for compatibility with reporting tools, something based on MySQL may be your best choice. As for what will work for you, Infobright may work. There are several other solutions as well, however you may want also to look at plain-old MySQL and the Archive table. Each record is compressed and stored and, IIRC, it's designed for your type of workload, however I think Infobright is supposed to get better compression. I haven't really used either, so I'm not sure which will work best for you.

As for the key-value stores (E.g. NoSQL), yes, they can work as well and there are plenty of alternatives out there. I know CouchDB has "views", but I haven't had the opportunity to use any, so I don't know how well any of them work.

My only concern with your data set is that since you mentioned time, you may want to ensure that whatever solution you use will allow you to archive data past a certain time. It's a common data warehouse practice to only keep N months of data online and archive the rest. This is where partitioning, as implemented in an RDBMS, comes in very useful.