My company has different SQL database for different departments. They want to have a single place/warehouse from where they can access all the data/databases for analytical purpose. The data is relational and the volume is not high. All the databases does not have a common primary key. Example: marketing department and HR department does not have any common primary key What will be the overall best solution? Also, best storage solution. Please explain the thoughts behind the decision. I am new to data engineering.
Should I use AWS s3 for storage?
You are looking to create a data lake so that you can do cross-database queries. This is not a straightforward process, you will need to spend a lot of time attempting to connect these databases, creating synthetic keys based on what you do know (names, birth-dates, employee #s, email addresses, etc). You can start with something like AWS Data Lake Formation to get you up with the kinds of structures you will need (you will want more than S3), but it won't magically fix your data. You will need to do the work using Amazon glue or other solutions to clean up and normalize your data so that you can intelligently query it across departments.