I'm working with sqlparse and defining my own custom keywords for the dialect I'm using, but I'm noticing that if my keyword 'FACET' appears after my WHERE clause, it gets grouped into the WHERE's TokenList. However, if something else comes between WHERE and FACET, (for example: GROUP BY), then FACET jumps to the top level of the parsed TokenList. Am I doing something wrong? I can't find any information on how to correct this.
import re
import sqlparse
from sqlparse import keywords
from sqlparse.lexer import Lexer
from pprint import pprint
# get the lexer singleton object to configure it
lex = Lexer.get_default_instance()
# Clear the default configurations.
# After this call, reg-exps and keyword dictionaries need to be loaded
# to make the lexer functional again.
lex.clear()
my_regex = (r"FACET\b", sqlparse.tokens.Keyword)
# slice the default SQL_REGEX to inject the custom object
sql_regexes = keywords.SQL_REGEX[:38] + [my_regex] + keywords.SQL_REGEX[38:]
lex.set_SQL_REGEX(sql_regexes)
# pprint([(str(sql_regex[0]), sql_regex[1]) for sql_regex in sql_regexes])
pprint(keywords.KEYWORDS)
# add the default keyword dictionaries
lex.add_keywords(keywords.KEYWORDS_COMMON)
lex.add_keywords(keywords.KEYWORDS_ORACLE)
lex.add_keywords(keywords.KEYWORDS_PLPGSQL)
lex.add_keywords(keywords.KEYWORDS_HQL)
lex.add_keywords(keywords.KEYWORDS_MSACCESS)
lex.add_keywords(keywords.KEYWORDS)
# add a custom keyword dictionary
lex.add_keywords({'FACET': sqlparse.tokens.Keyword})
querystring = "SELECT x, latest(y) FROM (SELECT x, y, z FROM table) WHERE x = 5 FACET x.name"
# no configuration is passed here. The lexer is used as a singleton.
parsed = sqlparse.parse(querystring)
pprint(parsed[0].tokens)
# produces:
#
# [<DML 'SELECT' at 0x104A934C0>,
# <Whitespace ' ' at 0x104A93820>,
# <IdentifierList 'x, lat...' at 0x104D1A450>,
# <Whitespace ' ' at 0x104CEEF80>,
# <Keyword 'FROM' at 0x104CEEFE0>,
# <Whitespace ' ' at 0x104CEF040>,
# <Parenthesis '(SELEC...' at 0x104D19DD0>,
# <Whitespace ' ' at 0x104CEF640>,
# <Where 'WHERE ...' at 0x104D19ED0>]
# However, when querystring is
# querystring = "SELECT x, latest(y) FROM (SELECT x, y, z FROM table) WHERE x = 5 GROUP BY y FACET x.name"
#
# produces:
#
# [<DML 'SELECT' at 0x1002674C0>,
# <Whitespace ' ' at 0x100267820>,
# <IdentifierList 'x, lat...' at 0x1004EE3D0>,
# <Whitespace ' ' at 0x1004C2F80>,
# <Keyword 'FROM' at 0x1004C2FE0>,
# <Whitespace ' ' at 0x1004C3040>,
# <Parenthesis '(SELEC...' at 0x1004EDD50>,
# <Whitespace ' ' at 0x1004C3640>,
# <Where 'WHERE ...' at 0x1004EDE50>,
# <Keyword 'GROUP ...' at 0x1004C39A0>,
# <Whitespace ' ' at 0x1004C3A00>,
# <Identifier 'y' at 0x1004EE2D0>,
# <Whitespace ' ' at 0x1004C3AC0>,
# <Keyword 'FACET' at 0x1004C3B20>,
# <Whitespace ' ' at 0x1004C3B80>,
# <Identifier 'x.name' at 0x1004EDED0>]
An LR or LL parser generator such as Bison, ANTLR, etc. can generate a parser which can parse sources according to a BNF specification, where you can easily add a new keyword and the corresponding syntax using non-terminal symbols. But with Sqlparse, you have to modify the internal implementation.
In your case, you have to add a new pattern for closing the
WHERE
group like this. (See the current implementation, sql.py and grouping.py for details.)But it may be broken in the future. So, consider using a parser generator.