How to use data blocks as source of keys in for_each in terraform?

461 views Asked by At

I have a terraform code where I am trying to add a bunch of AD users to a databricks workspace from certain AD groups. The ad_group_names variable is a simple list of names of AD groups.

# Make sure the names are correct and groups are present in AD
data "azuread_groups" "this" {
  display_names = var.ad_group_names
}

# Get all of the AD groups details
data "azuread_group" "this" {
  for_each     = toset(data.azuread_groups.this.display_names)
  display_name = each.key
}

# Get all users from all groups
data "azuread_users" "this" {
  object_ids = flatten([for group in data.azuread_group.this : [group.members]])
}

# Add all distinct users to databricks
resource "databricks_user" "this" {
  for_each              = { for user in distinct(data.azuread_users.this.users) : lower(user.mail) => user }
  user_name             = lower(each.value.mail)
  external_id           = each.value.object_id
  display_name          = each.value.display_name
  databricks_sql_access = true
  workspace_access      = true
}

When running the code I am getting the following error:

 Error: Invalid for_each argument
│ 
│   on modules/module-databricks-ad-sync/main.tf line 43, in resource "databricks_user" "this":
│   43:   for_each              = { for user in distinct(data.azuread_users.this.users) : lower(user.mail) => user }
│     ├────────────────
│     │ data.azuread_users.this.users is a list of object, known only after apply
│ 
│ The "for_each" map includes keys derived from resource attributes that
│ cannot be determined until apply, and so Terraform cannot determine the
│ full set of keys that will identify the instances of this resource.
│ 
│ When working with unknown values in for_each, it's better to define the map
│ keys statically in your configuration and place apply-time results only in
│ the map values.
│ 
│ Alternatively, you could use the -target planning option to first apply
│ only the resources that the for_each value depends on, and then apply a
│ second time to fully converge.
╵

Now I understand that according to documentation there are limits as to what you can use to serve as a keys of map for the for-each. The map of keys must be made of known values. However I have an environment where this code has already been deployed and continues to work whenever I run it and it does add new users from these groups. The plan of the working code looks like so: terraform plan of working code

What I am failing to understand is why is this code working in the other environment and it refuses to work in the new one. The are no differences in terms of Service principal or AD groups used. All versions of providers and terraform are the same (although most likely upon original deployment of the working env the versions were slightly different to what are they now). How do I make this work in the new env?

EDIT:

I've decided to avoid this weird and unclear terraform behaviour and have created a sepearate python script that queries the Azure AD and creates a JSON file with all the required groups and users. This file is then supplied to terraform and it does its magic.

1

There are 1 answers

5
VonC On BEST ANSWER

In the databricks_user resource block, I see a for_each argument which is attempting to iterate over a dynamic set of values derived from data.azuread_users.this.users: they are not known until after apply.

43: for_each = { for user in distinct(data.azuread_users.this.users) : lower(user.mail) => user }

In this line, you are trying to create a map for for_each using keys derived from resource attributes (user.mail) that are not yet known to Terraform at the planning phase, hence the error message you received about invalid for_each argument.

The Terraform documentation ("Limitations on values used in for_each") mentions that the map of keys used in for_each must be made of known values.

To address this, you would need to restructure your code such that the keys of the map used in for_each are determined statically from your configuration, or use the -target option as a workaround.

For instance:

# Get all of the AD groups details
data "azuread_group" "this" {
  for_each     = toset(var.ad_group_names)
  display_name = each.key
}

# Rest of your code

# Use a local value to prepare the data for databricks_user resource
locals {
  user_map = { for user in distinct(flatten([for group in data.azuread_group.this : [group.members]])) : lower(user.mail) => user }
}

# Add all distinct users to databricks
resource "databricks_user" "this" {
  for_each              = local.user_map
  user_name             = lower(each.value.mail)
  external_id           = each.value.object_id
  display_name          = each.value.display_name
  databricks_sql_access = true
  workspace_access      = true
}

Here, a locals block is used to create a map of users keyed by their email addresses, which is then used in the for_each argument of the databricks_user resource. That way, the map keys are determined statically from your configuration.

You get the workflow:

+--------------------+     +------------------------+     +-------------+
| azuread_groups     | --> | azuread_group          | --> | local value |
| (list of AD names) |     | (Details of AD groups) |     | (user_map)  |
+--------------------+     +-----------+------------+     +-------------+
                                       |
                                       v
                          +----------------------------+
                          | databricks_user            |
                          | (Sync users to Databricks) |
                          +----------------------------+

It doesn't work as data.azuread_group.members is a list of IDs and not a list of objects, hence there is a need to use azured_user to resolve the IDs into mails.
Nonetheless if I use the user IDs from the members list, the code starts to work, although it doesn't do what I need it to.

Could you explain why are the IDs known and the mails not? Is it because I am chaining data blocks? The truth is that all of the information is available to terraform in planning stage as long as it is willing to pull it from AD.

The IDs are known, and the emails are not, because of how Terraform's dependency graph works and how it evaluates and fetches data from resources and data sources.

Resource IDs are immediate: When you fetch a list of resources, like AD groups, the identifiers for these resources are typically available immediately. These are static, unchanging identifiers that Terraform can know at plan time because they are how the resource is indexed in the provider's API.

Attributes depend on Read/Apply: Other resource attributes, like emails in the AD users' case, are not known until the read operation is performed against the provider's API, which usually happens during the apply phase. That is because these attributes could change and are not used by Terraform to uniquely identify the resource during planning.

When you chain data blocks, you introduce a dependency from one block to the next. Terraform plans out the read operations for data sources in a way that respects these dependencies. If a data source depends on the output of another data source or resource, and that output is not known until the apply phase, Terraform cannot resolve this dependency during the planning phase.

Terraform's planning stage is designed to predict and outline what changes will be made before any real changes are applied. While the information might be available in AD, Terraform operates conservatively to make sure it does not make any changes or external API calls that could alter the state of your resources until it is in the apply stage. That is to avoid any side effects or unexpected changes.

To use the user's mail attribute in your for_each, you would need to first make sure Terraform can resolve these emails during the planning phase. However, because data.azuread_users fetches user details that might not be statically known ahead of time, Terraform errs on the side of caution and requires these details to be known before it can proceed with planning.

That is why, when you switch to using user IDs which are known ahead of time, your code starts to work; Terraform can plan out the resource creation because it is not dependent on any values that need to be fetched at apply time.

To work around this, you can use user IDs as keys for the for_each and then looks up the email addresses inside the resource block where Terraform can handle the unknown values during the apply phase.

You can adjust the Terraform code to use the user IDs as the keys in your for_each map, using azuread_user data sources to resolve the user IDs into their respective mail addresses properly.

# Get all of the AD groups details
data "azuread_group" "this" {
  for_each     = toset(data.azuread_groups.this.display_names)
  display_name = each.key
}

# Get all users from all groups
data "azuread_user" "this" {
  for_each   = toset(flatten([for group in data.azuread_group.this : group.members]))
  user_id    = each.key
}

# Prepare a map of users with user_id as the key and user object as the value
locals {
  users_map = { for user_id, user in data.azuread_user.this : user_id => user }
}

# Add all distinct users to databricks
resource "databricks_user" "this" {
  for_each              = locals.users_map
  user_name             = lower(each.value.mail)
  external_id           = each.key # user_id is used here
  display_name          = each.value.display_name
  databricks_sql_access = true
  workspace_access      = true
}

Each azuread_user is fetched using its ID, which is known during the planning phase.
A local map (users_map) is created to associate each user ID with the corresponding azuread_user data object.
The databricks_user resource uses the users_map for its for_each, ensuring that the iteration is based on known user IDs.
User properties, such as mail and display_name, are resolved at apply time and used to configure each databricks_user.

This should work around the issue of Terraform not knowing the user emails during the planning stage, by using the user IDs which are known.


Any ideas as to why it had originally worked though

The original success of the Terraform configuration in a different environment might have been due to:

  • Terraform state: If the code was previously applied and the state file already contained the necessary attributes (like user emails), Terraform would not need to query Azure AD again. It could use the cached data from the state file.
  • implicit dependencies: Terraform may have been able to implicitly determine the correct order of operations in the original environment. This could have resulted in the necessary data being available at the right time.
  • provider caching: The Azure AD provider might cache responses or handle data sources differently under certain conditions, which could have allowed the emails to be available during the planning phase.
  • Terraform versions: Changes in Terraform's core behavior between versions, even patch versions, can alter how resources and data sources are handled.
  • API Responses: Variations in the Azure AD API's response times or data availability can also affect whether Terraform considers certain data to be known or unknown during the planning phase.