I've found several things about how to manage a data science projet with GIT but I didn't find something about how to manage a set of projects.
In 90% of the case I'm working alone and over the month a lot of people ask me to check:
- the performance of our marketing operations
- the impact on sales of special period like christmas.
- clustering of our customers
- simple predictive models (churn,...)
Here is my typical workflow for a single project:
- Prepare the data in SQL
- Make descriptive and predictive analysis in R/Python. I often use my own library of code which I update over the time
- Create output results in Markdown or powerpoint presentation.
Here is the folder organisation for each project:
- Data
- base
- processed
R scripts
Python scripts
Outputs (figures, markdown, powerpoint,...)
And two libraries of code in R and Python that I use for all the projects.
Question: In this case what is the best strategy ?
- A single repository with all the projects because the libraries are shared among several projects ?
If yes, is it ok to have dozen of branches in the same repository like:
R_library_prod
R_library_dev
Python_library_prod
Python_library_dev
clustering_2015_prod
clustering_2015_dev
christmas_sales_analysis_prod
christmas_sales_analysis_dev
and so on
- A repository for each project ? (with potentially only 2 branches: prod and dev)
If yes, how to manage the updates of the R and Python libraries ? Should I have a distinct repo for them and updates the libraries manually in the analytics projects repositories ?