Error Initializing FactorizedTopK in TensorFlow Recommenders on SageMaker: "Cannot convert 'counter' to a shape"

111 views Asked by At

I'm working on a recommendation system using TensorFlow and TensorFlow Recommenders (TFRS), and I've run into a perplexing issue during the initialization of the FactorizedTopK metric within my RecommendationModel. Specifically, the error emerges when the model attempts to add a weight named "counter" in the Streaming layer of tfrs.metrics.FactorizedTopK. I am following this following documentation to make my reccomenation model: https://www.tensorflow.org/recommenders/examples/deep_recommenders

My development environment is AWS SageMaker, and

Here's the relevant section of my model code:

programs = tf_dataset.map(lambda x: {
    "program_id": x["program_id"],
    "name": x["name"],
    "Country": x["Country"],
    "Studylvl": x["Studylvl"],
    "majors": x["majors"],
})


desired_index = 20

desired_data = next(iter(programs.skip(desired_index).take(1)))

print("Program ID:", desired_data["program_id"].numpy().decode())
print("Name:", desired_data["name"].numpy().decode())
print("Country:", desired_data["Country"].numpy().decode())
print("Study Level:", desired_data["Studylvl"].numpy().decode())
print("Majors:", desired_data["majors"].numpy().decode())


Program ID: 157027
Name: m.s.e in robotics
Country: united states of america
Study Level: postgraduate
Majors: automation science and engineering, biorobotics, control and dynamical systems, medical robotics and computer integrated surgical , perception and cognitive systems, general robotics
class ProgramModel(tf.keras.Model):
    def __init__(self):
        super().__init__()

        max_tokens = 10_000
        embedding_dimension = 32

        self.program_id_embedding = tf.keras.Sequential([
            tf.keras.layers.StringLookup(
                vocabulary=unique_program_id, mask_token=None),
            tf.keras.layers.Embedding(len(unique_program_id) + 1, embedding_dimension),
        ])

        self.name_embedding = tf.keras.Sequential([
            tf.keras.layers.StringLookup(
                vocabulary=unique_program_name, mask_token=None),
            tf.keras.layers.Embedding(len(unique_program_name) + 1, embedding_dimension),
        ])

        self.name_text_vectorizer = tf.keras.layers.TextVectorization(max_tokens=max_tokens, output_mode='int', output_sequence_length=32)

        self.name_text_embedding = tf.keras.Sequential([
            self.name_text_vectorizer,
            tf.keras.layers.Embedding(max_tokens, embedding_dimension, mask_zero=True),
            tf.keras.layers.GlobalAveragePooling1D(),
        ])

        self.name_text_vectorizer.adapt(unique_program_name)

        self.country_embedding = tf.keras.Sequential([
            tf.keras.layers.StringLookup(
                vocabulary=unique_countries, mask_token=None),
            tf.keras.layers.Embedding(len(unique_countries) + 1, embedding_dimension),
        ])

        self.study_lvl_embedding = tf.keras.Sequential([
            tf.keras.layers.StringLookup(
                vocabulary=unique_study_lvl, mask_token=None),
            tf.keras.layers.Embedding(len(unique_study_lvl) + 1, embedding_dimension),
        ])


        self.major_text_vectorizer = tf.keras.layers.TextVectorization(max_tokens=max_tokens, output_mode='int', output_sequence_length=32)


        self.major_text_embedding = tf.keras.Sequential([
            self.major_text_vectorizer,
            tf.keras.layers.Embedding(max_tokens, embedding_dimension, mask_zero=True),
            tf.keras.layers.GlobalAveragePooling1D()

        ])

        self.major_text_vectorizer.adapt(majors)




    def call(self, inputs):
        return tf.concat([
            self.country_embedding(inputs["Country"]),
            self.study_lvl_embedding(inputs["Studylvl"]),
            self.name_embedding(inputs["name"]),
            self.name_text_embedding(inputs["name"]),
            self.major_text_embedding(inputs["majors"]),
            self.program_id_embedding(inputs["program_id"]),
        ], axis=1)

class CandidateModel(tf.keras.Model):
    def __init__(self, layer_sizes):
        super().__init__()

        self.embedding_model = ProgramModel()

        self.dense_layers = tf.keras.Sequential()

        for layer_size in layer_sizes[:-1]:
            self.dense_layers.add(tf.keras.layers.Dense(layer_size, activation="relu"))
            self.dense_layers.add(tf.keras.layers.BatchNormalization())

        for layer_size in layer_sizes[-1:]:
            self.dense_layers.add(tf.keras.layers.Dense(layer_size))

    def call(self, inputs):
        feature_embedding = self.embedding_model(inputs)
        return self.dense_layers(feature_embedding)

class RecommendationModel(tfrs.models.Model):
    def __init__(self, layer_sizes):
        super().__init__()
        self.query_model = QueryModel(layer_sizes)
        self.candidate_model = CandidateModel(layer_sizes) 
        self.task = tfrs.tasks.Retrieval(
            metrics= tfrs.metrics.FactorizedTopK(
                candidates=programs.batch(128).map(self.candidate_model)
            )
        )

    def compute_loss(self, features, training=False):
        query_embeddings = self.query_model({
            "Country": features["Country"],
            "Studylvl": features["Studylvl"],
            "name": features["name"],
            "majors": features["majors"],
        })
        candidate_embeddings = self.candidate_model({
            "Country": features["Country"],
            "Studylvl": features["Studylvl"],
            "name": features["name"],
            "majors": features["majors"],
            "program_id": features["program_id"],
        })

        return self.task(query_embeddings, candidate_embeddings)


model = RecommendationModel([128, 64, 32])
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001),
)

model.fit(
    x=train.batch(2000),
    epochs=20,
    verbose=True,
    validation_data=test.batch(500)
)

Upon attempting to initialize the RecommendationModel, I encounter the following ValueError:

ValueError: Cannot convert '('c', 'o', 'u', 'n', 't', 'e', 'r')' to a shape. Found invalid entry 'c' of type '<class 'str'>'.

Here is the full ErrorLog:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[64], line 1
----> 1 model = RecommendationModel([128, 64, 32])
      2 model.compile(
      3     optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001),
      4 )
      6 # Train the model

Cell In[63], line 7, in RecommendationModel.__init__(self, layer_sizes)
      4 self.query_model = QueryModel(layer_sizes)
      5 self.candidate_model = CandidateModel(layer_sizes) 
      6 self.task = tfrs.tasks.Retrieval(
----> 7     metrics= tfrs.metrics.FactorizedTopK(
      8         candidates=programs.batch(128).map(self.candidate_model)
      9     )
     10 )

File /usr/local/lib/python3.9/site-packages/tensorflow_recommenders/metrics/factorized_top_k.py:79, in FactorizedTopK.__init__(self, candidates, ks, name)
     75 super().__init__(name=name)
     77 if isinstance(candidates, tf.data.Dataset):
     78   candidates = (
---> 79       layers.factorized_top_k.Streaming(k=max(ks))
     80       .index_from_dataset(candidates)
     81   )
     83 self._ks = ks
     84 self._candidates = candidates

File /usr/local/lib/python3.9/site-packages/tensorflow_recommenders/layers/factorized_top_k.py:376, in Streaming.__init__(self, query_model, k, handle_incomplete_batches, num_parallel_calls, sorted_order)
    373 self._num_parallel_calls = num_parallel_calls
    374 self._sorted = sorted_order
--> 376 self._counter = self.add_weight("counter", dtype=tf.int32, trainable=False)

File /usr/local/lib/python3.9/site-packages/keras/src/layers/layer.py:499, in Layer.add_weight(self, shape, initializer, dtype, trainable, regularizer, constraint, name)
    497 initializer = initializers.get(initializer)
    498 with backend.name_scope(self.name, caller=self):
--> 499     variable = backend.Variable(
    500         initializer=initializer,
    501         shape=shape,
    502         dtype=dtype,
    503         trainable=trainable,
    504         name=name,
    505     )
    506 # Will be added to layer.losses
    507 variable.regularizer = regularizers.get(regularizer)

File /usr/local/lib/python3.9/site-packages/keras/src/backend/common/variables.py:74, in KerasVariable.__init__(self, initializer, shape, dtype, trainable, name)
     72 else:
     73     if callable(initializer):
---> 74         shape = self._validate_shape(shape)
     75         value = initializer(shape, dtype=dtype)
     76     else:

File /usr/local/lib/python3.9/site-packages/keras/src/backend/common/variables.py:97, in KerasVariable._validate_shape(self, shape)
     96 def _validate_shape(self, shape):
---> 97     shape = standardize_shape(shape)
     98     if None in shape:
     99         raise ValueError(
    100             "Shapes used to initialize variables must be "
    101             "fully-defined (no `None` dimensions). Received: "
    102             f"shape={shape} for variable path='{self.path}'"
    103         )

File /usr/local/lib/python3.9/site-packages/keras/src/backend/common/variables.py:426, in standardize_shape(shape)
    424     continue
    425 if not is_int_dtype(type(e)):
--> 426     raise ValueError(
    427         f"Cannot convert '{shape}' to a shape. "
    428         f"Found invalid entry '{e}' of type '{type(e)}'. "
    429     )
    430 if e < 0:
    431     raise ValueError(
    432         f"Cannot convert '{shape}' to a shape. "
    433         "Negative dimensions are not allowed."
    434     )

ValueError: Cannot convert '('c', 'o', 'u', 'n', 't', 'e', 'r')' to a shape. Found invalid entry 'c' of type '<class 'str'>'. 

This error suggests an issue with interpreting parameters during weight initialization within TensorFlow or TFRS's internal code, but I'm at a loss for how to resolve it. I've confirmed that my inputs don't contain any NaN values or other obvious issues, and my learning rate seems reasonable.

After debugging for a while, I realized that I encounter this issue exclusively on AWS SageMaker, regardless of whether I use a CPU-only instance (ml.g4dn.xlarge) or instances with GPU support enabled. This issue seems to be specific to the SageMaker environment, as I don't encounter it in other environments like Google Collab or local.

Has anyone encountered a similar issue or have suggestions on what might be going wrong? I'm using TensorFlow 2.13.0. Any insights or guidance would be greatly appreciated!

1

There are 1 answers

0
Harald Goebel On

I solved this issue by explicitly installing TensorFlow v2.15.0:

pip install tensorflow==2.15.0

(After getting that same error which you report, I checked on Google Colab which versions of tensorflow (v2.15.0), tensorflow_datasets (v4.9.4), and tensorflow_recommenders (v0.7.3) they currently use. On my local Windows 10 Pro, I had previously installed the latest TF version available via PyPI, which is v2.16.1 - and got that same error reported by you in a Python v3.11.8 shell. So it looks like the TF package is the cause of the issue. Maybe there is either some bug in TF v2.16.1, or some incompatibility with other packages which causes the error.)