TypeError: 'int' object is not iterable" and PCA Assertion Error in Python Clustering Function

38 views Asked by At

I'm working on a Python function (cluster_articles) to perform document clustering and return a dictionary of results. However, I'm encountering the following test errors:

TypeError: 'int' object is not iterable (in test_number_of_observations_kmeans10 and possibly test_proper_dict_return) AssertionError: Assertion error at PCA explained value (in test_pca_explained)

import pickle
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import completeness_score, v_measure_score

def cluster_articles(data):
    # K-Means on original data
    kmeans_100 = KMeans(n_clusters=10, random_state=2, tol=0.05, max_iter=50)
    kmeans_100.fit(data['vectors'])
    labels_100 = kmeans_100.labels_

    # PCA Dimensionality Reduction
    pca = PCA(n_components=10, random_state=2)
    reduced_data = pca.fit_transform(data['vectors'])

    # K-Means on reduced data
    kmeans_10 = KMeans(n_clusters=10, random_state=2, tol=0.05, max_iter=50)
    kmeans_10.fit(reduced_data)
    labels_10 = kmeans_10.labels_

    print(type(kmeans_10.n_iter_))  # Debugging output

    # Results Dictionary (Potential issue here)
    result = {
        'nobs_100': kmeans_100.n_iter_,
        'nobs_10': kmeans_10.n_iter_,
        'pca_explained': pca.explained_variance_ratio_[0],
        # ... rest of the results
    }
    return result 

Task and Data Description:

Goal: Cluster documents using K-Means (with and without PCA). Calculate metrics like completeness score, V-measure, and PCA explained variance.

Data Structure (data dictionary):

  • .id: Document IDs
  • .vectors: Doc2Vec vectors (size 100)
  • .groups: True group labels (0 to 9)

Relevant Packages:

  • scikit-learn (0.24.1)

  • NumPy (1.20.1)

  • SciPy (1.6.1)

  • pandas (1.2.3)

Questions:

  • How can I resolve the TypeError: 'int' object is not iterable error? I suspect the issue is in how I'm constructing the results dictionary, but I'm not sure how to fix it.
  • Why is my PCA explained variance failing the assertion? Could this be due to randomness or different data in the tests?

What I've Tried: Printing the type of kmeans_10.n_iter_ confirms it's an integer.

Additional Notes:

  • I don't have access to the test code.

  • There might be a file "subset_documents.p" which could be relevant.

Thank you for your help!

0

There are 0 answers