I am working on my thesis, and i have the opportunity to set up a working environment to test the functionality and how it works.
the following points should be covered:
- jupyterhub (within a private cloud)
- pandas, numpy, sql, nbconvert, nbviewer
- get Data into DataFrame (csv), analyze Data, store the data (RDD?, HDF5?, HDFS?)
- spark for future analysis
The test scenario will consist:
- multiple user environment with notebooks for Users/Topics
- analyze structured tables (RSEG, MSEG, EKPO) with several million lines in a 3-way-match with pandas, numpy and spark (spark-sql), matplotlib.... its about 3GB of Data in those 3 tables.
- export notebooks with nbconvert, nbviewer to pdf, read-only notbook and/or reveal.js
Can you guys please give me some hints or experiences on how many notes i should use for testing, which Linux distribution is a good start? i am sure there are many more questions, i have problems to find ways or info how to evaluate possible answers.
thanks in advance!