I'm working with a database of 60+ tables and I'm trying to write a python function that takes in the tables you wish to work with, and outputs the join statement(s) you should be using to connect those tables. I've put all 60 tables into an ordered dictionary as shown below, that lists the name of the table, the primary key, and foreign keys within each table.

OrderedDict({
    'table1_name':{'pk':'id', 'fk':{'table2_name':'table2_id', 'table3_name':'table3_id'}}
    'table2_name':{'pk':'id'},
    'table3_name':{'pk':'id', 'fk':{'table1_name':'table1_id'}
}) #Etc...

I've started writing a function with little success as the foreign keys intertwine with each other, making it very complex to traverse from table to table and find the shortest path. My attempt at writing a function to do this looked something like so:

def join_creator(main_table, *tables):    
    #Test if we can join other tables directly to main
     try:
        main_table
        main_pk = table_dict[main_table]["pk"]
     except:
        print('No primary key, this cannot be your main table')
        return

     result = f'FROM "public"."{main_table}" \n'
     for table in other_tables:
        try: 
            fk = table_dict[table]['fk'][main_table]
            result += f'LEFT JOIN "public"."{table}" ON {main_table}.{main_pk}={table}.{fk}\n'
        except KeyError:
            pass
     print(result)

To summarize, the input of this function would be something like join_creator('table_1', 'table_2', 'table_3')

and the output would be a string like so:

FROM table_1
LEFT JOIN table_2 ON table_1.id = table_2.t1_id
LEFT JOIN table_3 ON table_1 = table_3.t1_id

Any suggestions on how to accomplish this at a high level would be much appreciated!

1 Answers

2
Community On Best Solutions

Rephrasing, you have a directed graph whose nodes are tables, and directed edges are the foreign keys. If I'm interpreting your meaning correctly, you want to find the smallest set of edges that contains paths from the start node (main table) toward each of the nodes you give.

One thing to ask is whether your graph is acyclic (meaning there is no cycle of foreign keys all referencing each other; a table's child's child's ... child is never the original table). If you don't have cycles, some simplifications can be made, but for the sake of argument, suppose you don't know whether or not there are cycles.

On the other hand, I will assume that any arbitrary valid join order is really what you want (be careful - that is quite the assumption on your data model!)

One algorithm you could use is a breadth-first search (bfs) to get the shortest path from the main table to each child. This will not be guaranteed to be optimal (it greedily takes the shortest path for each destination node when combining paths somehow could be easier). However, it is certainly a good guess for the optimum, and it copes with the possibility of cycles.

Once you have all of those paths figured out, you have a tree rooted at the main table, and you want to add in the tables to the statement, parents first.

Below is some hasty Python code I wrote that does this. Feel free to refactor or comment for clarity.

from collections import deque


def join_sql(schema, main_table, *tables):
    # for each table, which tables does it connect to?
    children_map = {table:set() for table in schema}
    for child, properties in schema.items():
        parents = properties['fk']
        for parent in parents:
            children_map[parent].add(child)

    # What are all the tables in consideration?
    nodes = set(schema.keys())

    # get a tree of parent tables via breadth-first search.
    parent_tree = bfs(nodes, children_map, main_table)

    # Create a topological ordering on the graph;
    # order so that parent is joined before child.
    join_order = []
    used = {main_table}
    def add_to_join_order(t):
        if t in used or t is None:
            return
        parent = parent_tree.get(t, None)
        add_to_join_order(parent)
        join_order.append(t)
        used.add(t)

    for table in tables:
        add_to_join_order(table)

    lines = [f"FROM {main_table}"]
    for fk_table in join_order:
        parent_table = parent_tree[fk_table]
        parent_col = schema[parent_table]['pk']
        fk_col = schema[fk_table]['fk'][parent_table]
        lines.append(f'INNER JOIN {fk_table} ON {fk_table}.{fk_col} = {parent_table}.{parent_col}')
    return "\n".join(lines)


def bfs(nodes, children, start):
    parent = {}
    q = deque([start])

    while q:
        v = q.popleft()
        for w in children[v]:
            if w not in parent:
                parent[w] = v
                q.append(w)

    return parent



if __name__ == "__main__":
    schema = {'table1_name': {'pk': 'id', 'fk': {'table2_name': 'table2_id', 'table3_name': 'table3_id'}},
              'table2_name': {'pk': 'id', 'fk': {}},
              'table3_name': {'pk': 'id', 'fk': {'table1_name': 'table1_id'}},
              'table4_name': {'pk': 'id', 'fk': {'table3_name': 'table3_id'}},
              'table5_name': {'pk': 'id', 'fk': {'table3_name': 'table3_id'}},
              }
    print(join_sql(schema, 'table2_name', 'table2_name', 'table3_name', 'table4_name', 'table5_name'))


# FROM table2_name
# INNER JOIN table1_name ON table1_name.table2_id = table2_name.id
# INNER JOIN table3_name ON table3_name.table1_id = table1_name.id
# INNER JOIN table4_name ON table4_name.table3_id = table3_name.id
# INNER JOIN table5_name ON table5_name.table3_id = table3_name.id