In Apache Spark, why am I receiving an error that a column reference is ambiguous even though my column names are in different cases?

73 views Asked by At

I'm receiving the following spark error / exception when grouping / aggregating in spark:

pyspark.sql.utils.AnalysisException: Reference 'type' is ambiguous, could be: type, type.

In my dataframe, the column names are actually Type and type (semantically important to include both columns of these in the resulting data so I can't just rename one of them).

I'm expecting to have both type and Type as columns in the final data frame but received the exception instead. If I rename one of the columns from say Type to Type2, it works fine, so clearly there's a relation to the case sensitivity.

1

There are 1 answers

0
beaudet On

This was in fact due to spark applying a default of case insensitivity to column names which is easily disabled with this configuration option in spark:

spark.sql.caseSensitive", "true"

So, if you're running into ambiguity issues where the only differences are the case of characters, this is quite likely the culprit. In my case, the exception was lower casing the upper case column which provided a bit of a hint that this was due to spark lowercasing column names prior to determining if the column name is unique in the data frame.