Why do Column oriented databases such as Vertica/InfoBright/GreenPlum make a fuss of Hadoop?

6.2k views Asked by At

What is the point in feeding an Hadoop cluster and using that cluster to feed data into a Vertica/InfoBright datawarehouse ?

All thse vendor keep saying "we can connect with Hadoop", but I don't understand what's the point. What is the interest of storing in Hadoop and transfering into InfoBright ? Why not have the applications store directly in the Infobright/Vertica DW ?

Thank you !

9

There are 9 answers

0
Kingz On

What is the point in feeding an Hadoop cluster and using that cluster to feed data into a Vertica/InfoBright datawarehouse ?

The point is you would not want your users to fire up a query and wait for minutes, sometimes hours before you come back with an answer. Hadoop cannot provide you with a real time query response. Although this is changing with the advent of Cloudera's Impala and Hortonworks's Stinger. These are real-time data processing engines over Hadoop.

Hadoop's underlying data system, HDFS, allows chunking up your data and distributing it over the nodes in your cluster. In fact, HDFS can also be replaced with a 3rd party data storage like S3. Point is: Hadoop provides both -> storage + processing. So you are welcome to use hadoop as storage engine and extract the data into your data warehouse when needed. You can also use Hadoop to create cubes and marts and store these marts in the warehouse.

However, with the advent of Stinger and Impala, the strength of these claims will eventually be erased. So keep an eye out.

0
Miguel Ping On

Hadoop is more of a platform than a DB.

Think of Hadoop as a neat filesystem that supports lots of queries over different of file types. With this in mind, most people dump raw data onto Hadoop and use it as a staging layer in the data pipeline, where it can chew the data and push it to other systems like vertica or any other. You have several advantages that can be resumed to decoupling.

So Hadoop is turning into the facto storage platform for big data. It is simple, fault-tolerant, scales well, and it is easy to feed and to get data out of it. So most vendors are trying to push a product to companies that probably have a Hadoop installation.

0
geoffrobinson On

I'm not a Hadoop user (just a Vertica user/DBA), but I would assume the answer would be something along these lines:

-You already have a setup using Hadoop and you want to add a "Big Data" database for intensive analytical analysis.

-You want to use Hadoop for non-analytical functions and processing and a database for analysis. But it is the same data, so no need for two feeds.

0
Arnon Rotem-Gal-Oz On

There are several reasons you may want to do that 1. Cost per TB. The storage costs in Hadoop are much cheaper than Vertica/Netezza/greenplum and the like). You can get long-term retention in Hadoop and shorter term data in the analytics DB 2. Data ingestion capabilities in hadoop (performing transformations) is better in Hadoop 3. programatic analytics (libraries like Mahout ) so you can build advanced text analytics 4. dealing with unstructured data

The MPP dbs provide better performance in ad-hoc queries, better dealing with structured data and connectivity to traditional BI tools (OLAP and reporting) - so basically Hadoop complements the offering of these DBs

0
Paul Desjardins On

Why combine the solutions? Hadoop has some great capabilities (see url below). These capabilities though do not include allowing business users to run quick analytics. Queries that take 30 minutes to hours in Hadoop are being delivered in 10’s of seconds with Infobright.

BTW, your initial question did not presuppose an MPP architecture and for good reason. Infobright customers Liverail, AdSafe Media & InMobi, among others, utilize IEE with Hadoop.

If you register for an Industry White Paper http://support.infobright.com/Support/Resource-Library/Whitepapers/ you will see a view of the current marketplace where four suggested Use Cases for Hadoop are outlined. It was authored by Wayne Eckerson , Director of Research, Business Applications and Architecture Group, TechTarget, in September 2011.

1) Create an online archive.
With Hadoop, organizations don’t have to delete or ship the data to offline storage; they can keep it online indefinitely by adding commodity servers to meet storage and processing requirements. Hadoop becomes a low-cost alternative for meeting online archival requirements.

2) Feed the data warehouse.
Organizations can also use Hadoop to parse, integrate and aggregate large volumes of Web or other types of data and then ship it to the data warehouse, where both casual and power users can query and analyze the data using familiar BI tools. Here, Hadoop becomes an ETL tool for processing large volumes of Web data before it lands in the corporate data warehouse.

3) Support analytics.
The big data crowd (i.e., Internet developers) views Hadoop primarily as an analytical engine for running analytical computations against large volumes of data. To query Hadoop, analysts currently need to write programs in Java or other languages and understand MapReduce, a framework for writing distributed (or parallel) applications. The advantage here is that analysts aren’t restricted by SQL when formulating queries. SQL does not support many types of analytics, especially those that involve inter-row calculations, which are common in Web traffic analysis. The disadvantage is that Hadoop is batch-oriented and not conducive to iterative querying.

4) Run reports.
Hadoop’s batch-orientation, however, makes it suitable for executing regularly scheduled reports. Rather than running reports against summary data, organizations can now run them against raw data, guaranteeing the most accurate results.

0
Steve Severance On

To expand slightly on Arnon's answer, Hadoop has been recognized as a force that is not going away and is gaining increasing traction in organizations, many times via grassroots efforts from developers. MPP databases are good at answering questions that we know about at design time such as "How many transactions do we get per hour by country?".

Hadoop started as a platform for a new type of developer that lives somewhere between analysts and developers, one who can write code but also understands data analysis and machine learning. MPP databases (column or not) are very poor at serving this type of developer who often is analyzing unstructured data, using algorithms that require too much CPU power to run in a database or datasets which are too large. The sheer amount of CPU power required to build some models makes running these algorithms in any sort of traditional sharded DB impossible.

My personal pipeline using hadoop typically looks like:

  1. Run a number of very large global queries in Hadoop to get a basic feel for the data and the distribution of variables.
  2. Use Hadoop to build a smaller dataset with just the data I am interested in.
  3. Export the smaller dataset into a relational DB.
  4. Run lots of small queries on the relational db, build excel sheets, sometimes do a little R.

Bear in mind that this workflow only works for the "analyst developer" or "data scientist". Others mileage will vary.

Coming back to your question due to people like me abandoning their tools these companies are looking for ways to remain relevant in an age where Hadoop is synonymous with big data, the coolest startups and cutting edge technology (whether this is earned or not you may discuss amongst yourselves.) Also many Hadoop installations are an order of magnitude or more larger than an organizations MPP deployments, meaning more data is being retained for longer in Hadoop.

0
Kingz On

Unstructured data, by their nature, is not suitable for loading into your traditional data warehouse. Hadoop mapreduce jobs can extract structures out of your log files (ex) and then the same can then be ported into your DW for analytics. Hadoop is batch processing, therefore is not suitable for analytic query processing. So you can process your data using hadoop to bring some structure, and then make it query ready via your visualization/sql layer.

0
dmeister On

Massive parallel database like Greenplum DB are excellent for handling massive amounts of structured data. Hadoop is excellent at handling even more massive amounts of unstructured data, e.g. websites.

Nowadays, a ton of interesting analytics combines these both types of data to gain insight. Therefore it is important for these database systems to be able to integrate with Hadoop.

For example you could do text processing on the Hadoop Cluster using MapReduce until you have some scoring value per product or something. This scoring value then could be used by the database to combine it with other data that is already stored in the database or data that has been loaded into the database from other sources.

0
Up_One On

What makes the joint deployment so effective for this software ?

First, both platforms have a lot in common:

  • Purpose-built from scratch for Big Data transformation and analytics
  • Leverage MPP architecture to scale out with commodity hardware, capable of managing TBs through PBs of data
  • Native HA support with low administration overhead

Hadoop is ideal for the initial exploratory data analysis, where the data is often available in HDFS and is schema-less, and batch jobs usually suffice, whereas Vertica is ideal for stylized, interactive analysis, where a known analytic method needs to be applied repeatedly to incoming batches of data.

By using Vertica’s Hadoop connector, users can easily move data between the two platforms. Also, a single analytic job can be decomposed into bits and pieces that leverage the execution power of both platforms; for instance, in a web analytics use case, the JSON data generated by web servers is initially dumped into HDFS. A map-reduce job is then invoked to convert such semi-structured data into relational tuples, with the results being loaded into Vertica for optimized storage and retrieval by subsequent analytic queries.

What are the Key differences that make Hadoop and Vertica complement each other when addressing Big Data.

  • Interface and extensibility

    Hadoop
    Hadoop’s map-reduce programming interface is designed for developers.The platform is acclaimed for its multi-language support as well as ready-made analytic library packages supplied by a strong community.
    Vertica
    Vertica’s interface complies with BI industry standards (SQL, ODBC, JDBC etc). This enables both technologists and business analysts to leverage Vertica in their analytic use cases. The SDK is an alternative to the map-reduce paradigm, and often delivers higher performance.

  • Tool chain/Eco system



    Hadoop
    Hadoop and HDFS integrate well with many other open source tools. Its integration with existing BI tools is emerging.
    Vertica
    Vertica integrates with the BI tools because of its standards compliant interface. Through Vertica’s Hadoop connector, data can be exchanged in parallel between Hadoop and Vertica.

  • Storage management



    Hadoop
    Hadoop replicates data 3 times by default for HA. It segments data across the machine cluster for loading balancing, but the data segmentation scheme is opaque to the end users and cannot be tweaked to optimize for the analytic jobs.
    Vertica
    Vertica’s columnar compression often achieves 10:1 in its compression ratio. A typical Vertica deployment replicates data once for HA, and both data replicas can attain different physical layout in order to optimize for a wider range of queries. Finally, Vertica segments data not only for load balancing, but for compression and query workload optimization as well.

  • Runtime optimization

    Hadoop
    Because the HDFS storage management does not sort or segment data in ways that optimize for an analytic job, at job runtime the input data often needs to be resegmented across the cluster and/or sorted, incurring a large amount of network and disk I/O.

    Vertica
    The data layout is often optimized for the target query workload during data loading, so that a minimal amount of I/O is incurred at query runtime. As a result, Vertica is designed for real-time analytics as opposed to batch oriented data processing.

  • Auto tuning

    Hadoop
    The map-reduce programs use procedural languages (Java, python, etc), which provide the developers fine-grained control of the analytic logic, but also requires that the developers optimize the jobs carefully in their programs.
    Vertica
    The Vertica Database Designer provides automatic performance tuning given an input workload. Queries are specified in the declarative SQL language, and are automatically optimized by the Vertica columnar optimizer.