AWS Glue : AnalysisException: Table or view not found

532 views Asked by At

I am trying to create a view out of a dataframe in Glue 4.0 but I am getting the error - AnalysisException: Table or view not found. The data format for tables in glue database is hudi.

Code -

import sys
from awsglue.transforms import *
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import *

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

# Define the Glue Data Catalog database and table names
database_name = "hudi_db"
table4_name = 'd_person'

table4 = glueContext.create_data_frame.from_catalog(
    database=database_name,
    table_name=table4_name,
)

rows = table4.count()
distinct_rows = table4.distinct().count()
print(f"Number of rows in data frame: {rows} and distinct rows are: {distinct_rows}")


table4.createOrReplaceTempView(table4_name + '_glue_view')


custom_sql_query = """
      SELECT count(*)
    FROM d_person_glue_view
"""

# Execute the custom SQL query
result_df = spark.sql(custom_sql_query)

Are there any additional configs required for this? What could be the possible reasons that could result in this error?

Thank you.

I have tried the below things -

  1. provide your own SparkSession for it to use in the GlueContext constructor.
  2. run your sql on the spark_session object of Gluecontext
  3. Directly use sparksql instead of creating a dataframe. This works, but I want to load into dataframe first and then create a view.
1

There are 1 answers

2
Shubham Joshi On
  1. Following is the way to read Hudi tables in Dataframe from S3 locaton, it works beautifully for me:
spark.read.format("hudi").load(S3_basePath).createOrReplaceTempView("test")

res = spark.sql("select * from test")
  1. When using "getCatalogSource" for reading non-streaming data sources stored in Glue Data Catalog, kindly use DynamicFrames instead of DataFrames, and then convert the Dynamic Frame into a Spark DataFrame using "toDF()" if needed. This is because function "getDataFrameFromCatalog()" is designed for AWS Glue streaming sources.

Therefore, following solution worked for me:

df1 = glueContext.create_data_frame.from_catalog(
    database="hudidb", table_name= "huditable"
)

AWSGlueDataCatalog_node1698763846214 = DynamicFrame.fromDF(
    df1,
    glueContext,
    "AWSGlueDataCatalog_node1698763846214",
)

df2= AWSGlueDataCatalog_node1698763846214.toDF()

df2.createOrReplaceTempView("hudiview")

Hope this helps!