Going through the AWS Glue docs I can't see any mention of how to connect to a Postgres RDS via a Glue job of "Python shell" type. I've set up a RDS connection in AWS Glue and verified I can connect to my RDS. Also, when creating the Python job I can see my connection and I've added it to the script.

How do I use the connection which I've added to the Glue job to run some raw SQL?

Thanks in advance,

1 Answers

Harsh Bafna On

There are 2 possible ways to access data from RDS in glue etl (spark):

1st Option:

  • Create a glue connection on top of RDS
  • Create a glue crawler on top of this glue connection created in first step
  • Run the crawler to populate the glue catalogue with database and table pointing to RDS tables.
  • Create a dynamic frame in glue etl using the newly created database and table in glue catalogue.

Code Sample :

from pyspark.context import SparkContext
from awsglue.context import GlueContext
glueContext = GlueContext(SparkContext.getOrCreate())
DyF = glueContext.create_dynamic_frame.from_catalog(database="{{database}}", table_name="{{table_name}}")

2nd Option

Create a dataframe using spark sql :

url = "jdbc:postgresql://<rds_host_name>/<database_name>"
properties = {
"user" : "<username>",
"password" : "<password>"
df = spark.read.jdbc(url=url, table="<schema.table>", properties=properties)

Note :

  • You will need to pass postgres jdbc jar for creating the database using spark sql.
  • I have tried first method on glue etl and second method on python shell (dev-endpoint)