As per our AWS environment , we have 2 different types SAGs( service account Group) for Data storage. One SAG is for generic storage , another SAG for secure data which will only hold PII or restricted data. In our environment, we are planning to deploy Glue . In that case , Would we have one metastore over both secure and non-secure? If we needed two meta stores, how would this work with Databricks? If one metastore, how to handle the secure datas ? Please help us to more details on this in .
AWS glue: Deploy model in aws environment
185 views Asked by Karthikeyan Rasipalay Durairaj AtThere are 2 answers
In AWS Glue, each AWS account has one persistent metadata store per region (called Glue Data catalog). It contains database definitions, table definitions, job definitions, and other control information to manage your AWS Glue environment. You manage permissions to that objects using IAM (e.g., who can make GetTable or GetDatabase API calls to that objects).
In addition to AWS Glue permissions, you would also need to configure permissions to the data itself (e.g., who can make GetObject API call to the data stored on S3).
So, answering your questions. Yes, you would have a single data catalog. However, depending on your security requirements, you would be able to define resource-based and role-based permissions on metadata and content.
You can find a detailed overview here - https://aws.amazon.com/blogs/big-data/restrict-access-to-your-aws-glue-data-catalog-with-resource-level-iam-permissions-and-resource-based-policies
To integrate your metastore with Databricks for (1), you will have to create two Glue Catalog instance profiles with resource level access. One instance profile will have access to generic database and tables while the other will have access to the secure databases and tables.
To integrate your metastores with Databricks for (2), you will simply create two Glue Catalog instance profiles with access to the respective metastore.
It is recommended to go with the second option as it will save you guys a lot of maintenance cost and human errors on longer run. More details on Glue Catalog and Databricks integration.
Edit: Based on the discussion in comments, if we have to access both datasets inside the same Databricks Runtime, option 2 won't work. Option 1 can be used with 2 permission sets. First only for generic data and second for both generic and secure data.