What are the approaches to the Big-Data problems?

1.5k views Asked by At

Let us consider the following problem. We have a system containing a huge amount of data (Big-Data). So, in fact we have a data base. As the first requirement we want to be able to write to and to read from the data base quickly. We also want to have a web-interface to the data-bases (so that different clients can write to and read from the data base remotely).

But the system that we want to have should be more than a data base. First, we want to be able to run different data-analysis algorithm on the data to find regularities, correlations, abnormalities and so on (as before we do care a lot about the performance). Second, we want to bind a machine learning machinery to the data-base. Which means that we want to run machine learning algorithms on the data to be able to learn "relations" present on the data and based on that predict the values of entries that are not yet in the data base.

Finally, we want to have a nice clicks based interface that visualize the data. So that the users can see the data in form of nice graphics, graphs and other interactive visualisation objects.

What are the standard and widely recognised approaches to the above described problem. What programming languages have to be used to deal with the described problems?

5

There are 5 answers

0
Drakes On BEST ANSWER

I will approach your question like this: I assume you are firmly interested in big data database use already and have a real need for one, so instead of repeating textbooks upon textbooks of information about them, I will highlight some that meet your 5 requirements - mainly Cassandra and Hadoop.


1) The first requirement we want to be able to write to and to read from the database quickly.

You'll want to explore NoSQL databases which are often used for storing “unstructured” Big Data. Some open-source databases include Hadoop and Cassandra. Regarding the Cassandra,

Facebook needed something fast and cheap to handle the billions of status updates, so it started this project and eventually moved it to Apache where it's found plenty of support in many communities (ref).

References:

2) We also want to have a web interface to the database

See the list of 150 NoSQL databases to see all the various interfaces available, including web interfaces.

Cassandra has a cluster admin, a web-based environment, a web-admin based on AngularJS, and even GUI clients.

References:

3) We want to be able to run different data-analysis algorithm on the data

Cassandra, Hive, and Hadoop are well-suited for data analytics. For example, eBay uses Cassandra for managing time-series data.

References:

4) We want to run machine learning algorithms on the data to be able to learn "relations"

Again, Cassandra and Hadoop are well-suited. Regarding Apache Spark + Cassandra,

Spark was developed in 2009 at UC Berkeley AMPLab, open sourced in 2010, and became a top-level Apache project in February, 2014. It has since become one of the largest open source communities in big data, with over 200 contributors in 50+ organizations (ref).

Regarding Hadoop,

With the rapid adoption of Apache Hadoop, enterprises use machine learning as a key technology to extract tangible business value from their massive data assets.

References:

5) Finally, we want to have a nice clicks-based interface that visualize the data.

Visualization tools (paid) that work with the above databases include Pentaho, JasperReports, and Datameer Analytics Solutions. Alternatively, there are several open-source interactive visualization tools such as D3 and Dygraphs (for big data sets).

References:

2
AliAvci On

These three languages are most commonly used for machine learning and data mining on the Server side: R, Python, SQL. If you are aiming for heavy mathematical functions and graph generation, Haskell is very popular.

0
Sixty4Bit On

Big Data is a tough problem primarily because it isn't one single problem. First if your original database is a normal OLTP database that is handling business transactions throughout the day, you will not want to also do your big data analysis on this system since the data-analysis you will want to do will interfere with the normal business traffic.

Problem #1 is what type of database do you want to use for data-analysis? You have many choices ranging from RDBMS, Hadoop, MongoDB, and Spark. If you go with RDBMS then you will want to change the schema to be more compliant with data-analysis. You will want to create a data warehouse with a star schema. Doing this will make many tools available to you because this method of data analysis has been around for a very long time. All of the other "big data" and data analysis databases do not have the same level of tooling available, but they are quickly catching up. Each one of these will require research on which one you will want to use based on your problem set. If you have big batches of data RDBMS and Hadoop will be good. If you have streaming types of data then you will want to look at MongoDB and Spark. If you are a Java shop then RDBMS, Hadoop or Spark. If you are JavaScript MongoDB. If you are good with Scala then Spark.

Problem #2 is getting your data from your transactional database into your big data storage. You will need to find a programming language that has libraries to talk to both databases and you will have to decide when and where you will be moving this data. You can use Python, Java or Ruby to do this work.

Problem #3 is your UI. If you decide to go with RDBMS then you can use many of the available tools available or you can build your own. The other data storage solutions will have tool support but it isn't as mature is that available for the RDBMS. You are most likely going to build your own here anyway because your analysts will want to have the tools built to their specifications. Java works with all of these storage mechanisms but you can probably get Python to work too. You may want to provide a service layer built in Java that provides a RESTful interface and then put a web layer in front of that service layer. If you do this, then your web layer can be built in any language you prefer.

0
Giordano On

Start looking at: what kind of data you want to store in the Database? what kind of relationship between data you got? how this data will be accessed? (for instance you need to access a certain set of data quite often) are they documents? text? something else?

Once you got an answer for all those questions, you can start looking at which NoSQL Database you could use that would give you the best results for your needs.

You can choose between 4 different types: Key-Value, Document, Column family stores, and graph databases. Which one will be the best fit can be determined answering the question above.

There are ready to use stack that may really help to start with your project:

Elasticsearch that would be your Database (it has a REST API that you can use to write them to the DB and to make queries and analysis)

Kibana is a visualization tool, it will allows you to explore and visualize your data, it it quite powerful and will be more than enough for most of your needs

Logstash can centralize the data processing and help you process and save them in elasticsearch, it already support quite few sources of logs and events, and you can also write your own plugin as well.

Some people refers to them as the ELK stack.

I don't believe you should worry about the programming language you have to use at this point, try to select the tools first, sometimes the choices are limited by the tools you want to use and you can still use a mixture of languages and make the effort only if/when it make sense.

0
Guy On

A common way to solve such a requirements is to use Amazon Redshift and the ecosystem around it.

Redshift is a peta-scale data warehouse (it can also start with giga-scale), that exposes Ansi SQL interface. As you can put as much data as you like into the DWH and you can run any type of SQL you wish against this data, this is a good infrastructure to build almost any agile and big data analytics system.

Redshift has many analytics functions, mainly using Window functions. You can calculate averages and medians, but also percentiles, dense rank etc.

You can connect almost every SQL client you want using JDBS/ODBC drivers. It can be from R, R studio, psql, but also from MS-Excel.

AWS added lately a new service for Machine Learning. Amazon ML is integrating nicely with Redshift. You can build predictive models based on data from Redshift, by simply giving an SQL query that is pulling the data needed to train the model, and Amazon ML will build a model that you can use both for batch prediction as well as for real-time predictions. You can check this blog post from AWS big data blog that shows such a scenario: http://blogs.aws.amazon.com/bigdata/post/TxGVITXN9DT5V6/Building-a-Binary-Classification-Model-with-Amazon-Machine-Learning-and-Amazon-R

Regarding visualization, there are plenty of great visualization tools that you can connect to Redshift. The most common ones are Tableau, QliView, Looker or YellowFin, especially if you don't have any existing DWH, where you might want to keep on using tools like JasperSoft or Oracle BI. Here is a link to a list of such partners that are providing free trial for their visualization on top of Redshift: http://aws.amazon.com/redshift/partners/

BTW, Redshift also provides a free trial for 2 months that you can quickly test and see if it fits your needs: http://aws.amazon.com/redshift/free-trial/