python sqlparse - WHERE clause eats custom keywords

116 views Asked by At

I'm working with sqlparse and defining my own custom keywords for the dialect I'm using, but I'm noticing that if my keyword 'FACET' appears after my WHERE clause, it gets grouped into the WHERE's TokenList. However, if something else comes between WHERE and FACET, (for example: GROUP BY), then FACET jumps to the top level of the parsed TokenList. Am I doing something wrong? I can't find any information on how to correct this.

import re

import sqlparse
from sqlparse import keywords
from sqlparse.lexer import Lexer
from pprint import pprint

# get the lexer singleton object to configure it
lex = Lexer.get_default_instance()

# Clear the default configurations.
# After this call, reg-exps and keyword dictionaries need to be loaded
# to make the lexer functional again.
lex.clear()

my_regex = (r"FACET\b", sqlparse.tokens.Keyword)

# slice the default SQL_REGEX to inject the custom object


sql_regexes = keywords.SQL_REGEX[:38] + [my_regex] + keywords.SQL_REGEX[38:]

lex.set_SQL_REGEX(sql_regexes)

# pprint([(str(sql_regex[0]), sql_regex[1]) for sql_regex in sql_regexes])

pprint(keywords.KEYWORDS)
# add the default keyword dictionaries
lex.add_keywords(keywords.KEYWORDS_COMMON)
lex.add_keywords(keywords.KEYWORDS_ORACLE)
lex.add_keywords(keywords.KEYWORDS_PLPGSQL)
lex.add_keywords(keywords.KEYWORDS_HQL)
lex.add_keywords(keywords.KEYWORDS_MSACCESS)
lex.add_keywords(keywords.KEYWORDS)

# add a custom keyword dictionary

lex.add_keywords({'FACET': sqlparse.tokens.Keyword})

querystring = "SELECT x, latest(y) FROM (SELECT x, y, z FROM table) WHERE x = 5 FACET x.name"

# no configuration is passed here. The lexer is used as a singleton.
parsed = sqlparse.parse(querystring)

pprint(parsed[0].tokens)

# produces:
# 
# [<DML 'SELECT' at 0x104A934C0>,
#  <Whitespace ' ' at 0x104A93820>,
#  <IdentifierList 'x, lat...' at 0x104D1A450>,
#  <Whitespace ' ' at 0x104CEEF80>,
#  <Keyword 'FROM' at 0x104CEEFE0>,
#  <Whitespace ' ' at 0x104CEF040>,
#  <Parenthesis '(SELEC...' at 0x104D19DD0>,
#  <Whitespace ' ' at 0x104CEF640>,
#  <Where 'WHERE ...' at 0x104D19ED0>]

# However, when querystring is
# querystring = "SELECT x, latest(y) FROM (SELECT x, y, z FROM table) WHERE x = 5 GROUP BY y FACET x.name"
# 
# produces:
# 
# [<DML 'SELECT' at 0x1002674C0>,
#  <Whitespace ' ' at 0x100267820>,
#  <IdentifierList 'x, lat...' at 0x1004EE3D0>,
#  <Whitespace ' ' at 0x1004C2F80>,
#  <Keyword 'FROM' at 0x1004C2FE0>,
#  <Whitespace ' ' at 0x1004C3040>,
#  <Parenthesis '(SELEC...' at 0x1004EDD50>,
#  <Whitespace ' ' at 0x1004C3640>,
#  <Where 'WHERE ...' at 0x1004EDE50>,
#  <Keyword 'GROUP ...' at 0x1004C39A0>,
#  <Whitespace ' ' at 0x1004C3A00>,
#  <Identifier 'y' at 0x1004EE2D0>,
#  <Whitespace ' ' at 0x1004C3AC0>,
#  <Keyword 'FACET' at 0x1004C3B20>,
#  <Whitespace ' ' at 0x1004C3B80>,
#  <Identifier 'x.name' at 0x1004EDED0>]
1

There are 1 answers

0
relent95 On

An LR or LL parser generator such as Bison, ANTLR, etc. can generate a parser which can parse sources according to a BNF specification, where you can easily add a new keyword and the corresponding syntax using non-terminal symbols. But with Sqlparse, you have to modify the internal implementation.

In your case, you have to add a new pattern for closing the WHERE group like this. (See the current implementation, sql.py and grouping.py for details.)

ttype, values = sqlparse.sql.Where.M_CLOSE
sqlparse.sql.Where.M_CLOSE = ttype, values + ('FACET',)

But it may be broken in the future. So, consider using a parser generator.