I am a rising senior CS major at a large state university and I'm working as an intern at a large publicly traded technology company in their data science department. I've learned in school about data structures and algorithms (Maps, Trees, Graphs, Sorting Algorithms, Seaching Algorithms, MapReduce, etc.) and I have some experience through personal projects with MySQL and SQL queries.

My project for this internship is to create a dashboard for displaying analytics gathered from a Hadoop database. I'm struggling to understand how this data is structured and queried. I'm pretty sure all of the data in Hadoop is coming from the production Oracle Relational DB that runs their platform. I guess my core question is why are Hadoop and distributed processing required to gather analytics from a database that is already in a structured format? What does the data look like stored in Hadoop? Are there tables like MySQL, or JSON docs like MongoDB? I will be querying Hadoop through Druid but I'm not sure what is even in this database.

The engineers I have been working with have been great at explaining things to me, especially questions about their specific implementation, but they only have a certain amount of time to dedicate to helping an intern and I want to take the initiative to learn some of this on my own.

As a side note, it is incredible how different it is working on a school project than a project at a company with millions of active users and petabytes of sensitive information.

1 Answers

0
cricket_007 On

Hadoop is not a database, and it therefore has no such thing as tables or any inherit structure of relations or documents.

You can place a schema over stored files of various formats like CSV, JSON, Avro, Parquet, etc using Hive, Presto, SparkSQL, for example, but these are all tools that read from the Hadoop FileSystem, and not part of Hadoop itself. Tables and databases at that level are only metadata, and not entirely representative as what the raw data looks like

Hadoop is simply capable of storing more data than an Oracle database, and is free, however for fast analytics, it's recommended to compute statistics within Hadoop frameworks in distributed fashion, then load back into a indexed system (such as Druid) or just any actual database