I am a rising senior CS major at a large state university and I'm working as an intern at a large publicly traded technology company in their data science department. I've learned in school about data structures and algorithms (Maps, Trees, Graphs, Sorting Algorithms, Seaching Algorithms, MapReduce, etc.) and I have some experience through personal projects with MySQL and SQL queries.
My project for this internship is to create a dashboard for displaying analytics gathered from a Hadoop database. I'm struggling to understand how this data is structured and queried. I'm pretty sure all of the data in Hadoop is coming from the production Oracle Relational DB that runs their platform. I guess my core question is why are Hadoop and distributed processing required to gather analytics from a database that is already in a structured format? What does the data look like stored in Hadoop? Are there tables like MySQL, or JSON docs like MongoDB? I will be querying Hadoop through Druid but I'm not sure what is even in this database.
The engineers I have been working with have been great at explaining things to me, especially questions about their specific implementation, but they only have a certain amount of time to dedicate to helping an intern and I want to take the initiative to learn some of this on my own.
As a side note, it is incredible how different it is working on a school project than a project at a company with millions of active users and petabytes of sensitive information.