Fixing 'ValueError: labels must be unique if ordered=True; pass ordered=False for duplicate labels'

105 views Asked by At

My function 'conversion_rates' intakes a dataframe 'df', and a target column 'target', and it is supposed to the following:

  • creates a folder 'output' (if it doesn't already exist), that stores a conversion rate PDF graph for EACH column in the dataset, with the target. (column x-axis, target y-axis). Each bin is a bar in the graph.

  • the categorical variables are to be binned in the graphs by their categories, and meet the clause that if any category for any given column is present in less than 10% of the dataset, that it is labeled as 'Other'. This would result in putting numerous categories sub-10% together under one bin 'Other'.

  • the numerical variables are to be binned into deciles, but the bins are to be labelled with their range of values, rather than just the decile number.

  • lastly, I wanted to have the PDF's printed in descending order of their strongest change in conversion rate score. this means evaluating highest conversion rate - lowest conversion rate for each column. thus, it would be the highest bar - the lowest bar (as each bin/bar is the conversion rate graphed). the biggest difference would be the first PDF listed in my 'output' folder. I guess they would have to be labeled by number included to ensure the proper order based on the difference of the bins.

Now that I have finished explaining my goals; I am encountering the error 'ValueError: labels must be unique if ordered=True; pass ordered=False for duplicate labels'. I am confused and unsure what this means in terms of how to fix.

My code:

`def conversion_rates(self):
        insights_list = []

        for col in self.df.columns:
            if col == self.target:
                continue

            col_type = self.df[col].dtype

            if col_type == 'object':
                value_counts = self.df[col].value_counts(normalize=True)
                mask = value_counts.cumsum() < 0.1
                small_bins = value_counts.index[mask]
                self.df[col] = self.df[col].apply(lambda x: 'Other' if x in small_bins else x)

            elif col_type in ['int64', 'float64']:
                unique_values = self.df[col].nunique()
                num_bins = min(unique_values, 10)
                _, bin_edges = pd.cut(self.df[col], bins=num_bins, retbins=True, duplicates='drop')
                bin_labels = [f'{int(bin_edges[i])} - {int(bin_edges[i+1])}' for i in range(len(bin_edges)-1)]
                self.df[col] = pd.cut(self.df[col], bins=bin_edges, labels=bin_labels, duplicates='drop')

            if self.df[col].nunique() > 1:
                std_dev = self.df.groupby(col)[self.target].mean().std()

                if pd.notna(std_dev) and std_dev > 0:
                    conversion_diff = self.df.groupby(col)[self.target].mean().max() - self.df.groupby(col)[self.target].mean().min()

                    plt.figure(figsize=(10, 6))
                    sns.barplot(x=col, y=self.target, data=self.df, ci=None)
                    plt.title(f'Average Target vs {col}')
                    plt.xlabel(col)
                    plt.ylabel('Average Target')
                    plt.xticks(rotation=45, ha='right')   
                    plt.tight_layout()

                    graph_number = len(insights_list) + 1 
                    graph_path = f'{self.output_folder_name}/{graph_number}_conversion_{col}.pdf'
                    plt.savefig(graph_path)
                    plt.close()

                    insights_list.append((col, std_dev, conversion_diff, graph_path))

        insights_list.sort(key=lambda x: x[2], reverse=True)

        return insights_list`

Prior to this I had other errors pertaining to having an equal number of labels as edges, when it should be 1 less.

The error comes from the line:

self.df[col] = pd.cut(self.df[col], bins=bin_edges, labels=bin_labels, duplicates='drop')

obviously I added the ordered=False to the line, but this alters the logic I am trying to formulate.

0

There are 0 answers