PostgreSQL: vocabulary overview in json, window functions... recursion?

203 views Asked by At

I have a "dictionary" into PostgreSQL 9.3. I want to get all terms, splitting them by triples (the first three characters) between pages with up to 30 terms per page. So, no any triple should be broken between pages, for example the first page should contain terms "aaa" to "aaf", the second one --- "aag" to "aan", but no any page should contain a part of a "triples collection".

I have this query so far:

WITH results AS (
    WITH terms AS (
        WITH triples AS (
        -- 1. triples with cumulative numbers of appearances:
        SELECT
            LOWER(substring("term" FROM 1 FOR 3)) AS triple,
            ROW_NUMBER() OVER(PARTITION BY LOWER(substring("term" FROM 1 FOR 3))) AS rnum
          FROM terms
          GROUP BY triple, "term"
        )
        -- 2. GROUPs by rnum, removes triple duplicates:
        SELECT
            triples.triple,
            MAX(triples.rnum) AS amount
          FROM triples
          GROUP BY triples.triple
    )
    -- 3. makes { "triple": triple, "amount": amount },
    --    assigns "page number" (~30 per page):
    SELECT
        COALESCE(substring(terms.triple FROM 1 FOR 1), '') AS first,
        ('{ "triple": "' || COALESCE(terms.triple, '') || '", "amount": ' || terms.amount || ' }')::json AS terms,
        (sum((terms.amount)::int) OVER (ORDER BY terms.triple)) / 30 AS chunk
    FROM terms
    GROUP BY first, terms.triple, terms.amount
    ORDER BY first, terms.triple
)
-- 4. collects "page" triples into rows:
SELECT
    first,
    COALESCE(json_agg(results.terms), ('{ "triple" :' || NULL || ', "amount":' || 1 || '}')::json) AS triplesdata,
    sum((results.terms->>'amount')::int) AS sum,
    chunk
FROM results
   GROUP BY first, chunk
   ORDER BY results.first, json_agg(results.terms)->0->>'triple'

To be clear, the SELECT #1 gives me:

 triple | rnum 
--------+------
 аар    |    1
 аба    |    1
 абе    |    1
 абе    |    2
 аби    |    1
 аби    |    2
 абл    |    1
 ...

SELECT #2 gives me all triples and amount of words starting with them:

 triple | amount 
--------+--------
 аар    |      1
 аба    |      1
 абе    |      2
 аби    |      2
 абл    |      1
 або    |      1
 абс    |      1
 ...

SELECT #3 gives me almost the same information, but triples are jsons now and chunk number column added:

 first |              terms               | chunk 
-------+----------------------------------+-------
 а     | { "triple": "аар", "amount": 1 } |     0
 а     | { "triple": "аба", "amount": 1 } |     0
 а     | { "triple": "абе", "amount": 2 } |     0
 а     | { "triple": "аби", "amount": 2 } |     0
 а     | { "triple": "абл", "amount": 1 } |     0
 а     | { "triple": "або", "amount": 1 } |     0
 а     | { "triple": "абс", "amount": 1 } |     0
 ...

And the whole query gives me:

 first |                  triplesdata                  | sum | chunk
-------+-----------------------------------------------+-----+------- 
 а     | [{ "triple": "аар", "amount": 1 } ...(others) |  28 |     0
 a     | [{ "triple": "аве", "amount": 5 } ...(others) |  30 |     1
 ...
 д     | [{ "triple": "доб", "amount": 69 }, ...       |  89 |   138
 ...

I could work with this; however some chunks contain too much data --- some triples should be broken into "quadruples" and deeper into "multi-ples".

I wrote Python script which does this job recursively.

But I am very engaged: is it possible to do this recursive job in PostgreSQL?

And another question --- which index(-es?) will be optimal for terms.term column?

And another question: what I am doing wrong? --- I am somewhat new to sql.

UPDATE: no accepted answer so far, because there is no answer(s) for my questions. And yes, I am using python script now. But I would like to have some answers.

1

There are 1 answers

3
jwg On

My best answer is that you shouldn't try to do this in SQL instead of using Python. SQL isn't good at multiple levels of recursion (you can happily do queries within queries, but not queries nested arbitrarily deeply).

SQL is designed to be 'declarative' - you say what you want, and the database figures out how to give it to you. This is why you can examine the 'plan' of a complex query to understand how the data is retrieved. When you have code using WITH like your above, you have a reached a point where just to express to the SQL interpreter what you want, you have to second-guess what the intermediate steps are. This means you are in a lose-lose situation: the interpreter doesn't have full flexibility to retrieve the data in an optimised way, but you are expressing the steps of your computation in a convoluted language which is best used for expressing what results you want.

On the other hand, Python, or many functional languages, are expressly designed to make tasks like this quick and elegant to express. You will find yourself working with the material instead of against it.

It's an interesting problem. By all means post again if you think that your performance/infrastructure requirements mean that it's difficult to do this task outside of the database server.