How to list all databases and tables in AWS Glue Catalog?

21.9k views Asked by At

I created a Development Endpoint in the AWS Glue console and now I have access to SparkContext and SQLContext in gluepyspark console.

How can I access the catalog and list all databases and tables? The usual sqlContext.sql("show tables").show() does not work.

What might help is the CatalogConnection Class but I have no idea in which package it is. I tried importing from awsglue.context and no success.

4

There are 4 answers

4
MP23 On BEST ANSWER

I spend several hours trying to find some info about CatalogConnection class but haven't found anything. (Even in the aws-glue-lib repository https://github.com/awslabs/aws-glue-libs)

In my case I needed table names in Glue Job Script console

Finally I used boto library and retrieved database and table names with Glue client:

import boto3


client = boto3.client('glue',region_name='us-east-1')

responseGetDatabases = client.get_databases()

databaseList = responseGetDatabases['DatabaseList']

for databaseDict in databaseList:

    databaseName = databaseDict['Name']
    print '\ndatabaseName: ' + databaseName

    responseGetTables = client.get_tables( DatabaseName = databaseName )
    tableList = responseGetTables['TableList']

    for tableDict in tableList:

         tableName = tableDict['Name']
         print '\n-- tableName: '+tableName

Important thing is to setup the region properly

Reference: get_databases - http://boto3.readthedocs.io/en/latest/reference/services/glue.html#Glue.Client.get_databases

get_tables - http://boto3.readthedocs.io/en/latest/reference/services/glue.html#Glue.Client.get_tables

0
Will On

I'm posting since I ran into a similar issue and had more than 100 databases and 100 tables in a database (so needed to use tokens). It's hacked together and a bit more verbose than needed, but hopefully easy to modify if you need to.

import boto3
import pdb
import logging


logging.basicConfig(filename='databases_tables.log', level=logging.INFO)


def get_glue_databases() -> list:
    """
    Returns a list of databases, with each database as a dict, like:
    {
        'Name': 'analytics_raw_development',
        'CreateTime': datetime.datetime(2022, 4, 22, 13, 19, 49, tzinfo=tzlocal()),
        'CreateTableDefaultPermissions': [{'Principal': {'DataLakePrincipalIdentifier': 'IAM_ALLOWED_PRINCIPALS'}, 'Permissions': ['ALL']}],
        'CatalogId': '999999999'
    }
    """
    glue_client = boto3.client('glue')
    next_token = ""
    databases = []

    while True:
        # Get first batch of databases (max 100)
        response_databases = glue_client.get_databases(NextToken=next_token)
        database_list = response_databases['DatabaseList']
        for db in database_list:
            databases.append(db)

        # due to pagination, if there's more databases, will have a 'NextToken'
        next_token = response_databases.get('NextToken', None)

        # no more databases, can exit
        if next_token is None:
            break
    return databases


def get_glue_tables(databases) -> list:
    glue_client = boto3.client('glue')
    tables = []

    for database in databases:
        logging.info(f"Database: {database}")
        next_token = ""
        while True:
            # Get first batch of tables (max 100)
            response_tables = glue_client.get_tables(DatabaseName=database['Name'], CatalogId=database['CatalogId'], NextToken=next_token)
            tables_list = response_tables['TableList']
            for table in tables_list:
                logging.info(f"Adding table: {table}")
                tables.append(table)

            # due to pagination, if there's more tables, will have a 'NextToken'
            next_token = response_tables.get('NextToken', None)

            # no more tables, can exit
            if next_token is None:
                break
    return tables


if __name__ == '__main__':
    logging.info("Getting Databases")
    databases = get_glue_databases()  # list
    logging.info(f'Databases: {databases}')
    logging.info("Getting All Tables")
    tables = get_glue_tables(databases)
    logging.info(f"Tables: {tables}")
    print("Finished")
5
Bao Pham On

Glue returns back one page per response. If you have more than 100 tables, make sure you use NextToken to retrieve all tables.

def get_glue_tables(database=None):
    next_token = ""

    while True:
        response = glue_client.get_tables(
            DatabaseName=database,
            NextToken=next_token
        )

        for table in response.get('TableList'):
            print(table.get('Name'))

        next_token = response.get('NextToken')

        if next_token is None:
            break
0
Steve Coleman On

The boto3 api also supports pagination, so you could use the following instead:

import boto3

glue = boto3.client('glue')
paginator = glue.get_paginator('get_tables')
page_iterator = paginator.paginate(
    DatabaseName='database_name'   
)

for page in page_iterator:
    print(page['TableList'])

That way you don't have to mess with while loops or the next token.