Get list of matching keywords for each post

68 views Asked by At

I have two tables keywords and posts in my PieCloudDB Database.

Each topic can be expressed by one or more keywords. If a keyword of a certain topic exists in the content of a post (case insensitive) then the post has this topic.

For example:

topic_id keyword
1 basketball
2 music
3 food
4 war
post_id content
1 A typhoon warning has been issued in southern Japan
2 We are going to play neither basketball nor volleyball
3 I am indulging in both the delightful music and delectable food
4 That basketball player fouled again

Now I want to find the topics of each post according to the following rules:

  • If the post does not have keywords from any topic, its topic should be "Vague!".

  • If the post has at least one keyword of any topic, its topic should be a string of the IDs of its topics sorted in ascending order and separated by commas ','.

For the above example data, the results should be:

post_id topics
1 Vague!
2 1
3 2,3
4 1
SELECT post_id, COALESCE(array_to_string(array_agg(DISTINCT topic_id ORDER BY topic_id), ','), 'Vague!') AS topic
FROM (
    SELECT p.post_id, k.topic_id
    FROM Posts p 
    LEFT JOIN Keywords k 
    ON LOWER(content) LIKE '% ' || keyword || ' %' OR content LIKE keyword || ' %' OR content LIKE '% ' || keyword
  ) a
GROUP BY post_id
ORDER BY post_id

I tried this query but the results I got were not exactly correct. I don't know why the output of post 1 is null:

post_id topics
1
2 1
3 2,3
4 1

Can anyone give me a correct answer?
(If you don’t know the database I use, you can use PostgreSQL instead.)

4

There are 4 answers

0
Erwin Brandstetter On BEST ANSWER

Simple regex without index support

Without index support (only sensible for trivial cardinalities!) this is fastest in Postgres, while doing exactly what you ask for:

SELECT post_id, COALESCE(string_agg(topic_id::text, ',' ), 'Vague!') AS topic
FROM  (
   SELECT p.post_id, k.topic_id
   FROM   post p 
   LEFT   JOIN keyword k ON p.content ~* ('\m' || k.keyword || '\M')
   ORDER  BY p.post_id, k.topic_id
   ) sub
GROUP  BY post_id
ORDER  BY post_id;

~* is the case-insensitive regular-expression match operator. See:

Concerning my regex, I cite the manual:

\m ... matches only at the beginning of a word
\M ... matches only at the end of a word

This covers the start (^) and end ($) of the string implicitly. \W (as suggested in another answer) matches any non-word character, and is wrong for the task.)

Note how I apply ORDER BY once in a subquery instead of per-aggregate, because that's faster. See:

In this constellation, a simple COALESCE catches the case of no matches.

FTS index for strict matching

The simple (naive) approach above scales with O(N*M), i.e. terribly with a non-trivial number of rows in each table. Typically, you want index support.

While strictly matching keywords, the best index should be a Full Text Search index with the 'simple' dictionary, and a query that can actually use that index:

CREATE INDEX post_content_fts_simple_idx ON post USING gin (to_tsvector('simple', content));


SELECT post_id, COALESCE(topics, 'Vague!') AS topics
FROM  (
   SELECT post_id, string_agg(topic_id::text, ',') AS topics
   FROM  (
      SELECT p.post_id, k.topic_id
      FROM   keyword k
      JOIN   post    p ON to_tsvector('simple', p.content) @@ to_tsquery('simple', k.keyword)
      ORDER  BY p.post_id, k.topic_id
      ) sub
   GROUP  BY post_id
   ) sub1
RIGHT  JOIN post p USING (post_id)
ORDER  BY post_id;

FTS index for matching English words

To match natural language words with built-in stemming, use a matching dictionary, 'english' in the example:

CREATE INDEX post_content_fts_en_idx ON post USING gin (to_tsvector('english', content));

SELECT post_id, COALESCE(topics, 'Vague!') AS topics
FROM  (
   SELECT post_id, string_agg(topic_id::text, ',') AS topics
   FROM  (
      SELECT p.post_id, k.topic_id
      FROM   keyword k
      JOIN   post    p ON to_tsvector('english', p.content) @@ to_tsquery('english', k.keyword)
      ORDER  BY p.post_id, k.topic_id
      ) sub
   GROUP  BY post_id
   ) sub1
RIGHT  JOIN post p USING (post_id)
ORDER  BY post_id;

fiddle

For fuzzy matching consider a trigram index. See:

Related:

0
Vérace On

You're not distinguishing between NULL and the empty string ('') - this first query shows you where the problem is, the second shows a solution (all of the code below is available on the fiddle here):

SELECT 
  post_id, 
  
  COALESCE
  (
    array_to_string(array_agg(DISTINCT topic_id ORDER BY topic_id), ','), 
    'Vague!'
  ) AS topic,

  LENGTH(array_to_string(array_agg(DISTINCT topic_id ORDER BY topic_id), ',')),

  LENGTH(NULL) AS "Length of NULL",  -- <<== LENGTH(NULL) is always NULL, and NOT zero 

  PG_TYPEOF(NULL) AS "Type of NULL",
  
  PG_TYPEOF
  (
    array_to_string(array_agg(DISTINCT topic_id ORDER BY topic_id), ',')
  )
FROM 
(
  SELECT p.post_id, k.topic_id
  FROM post p 
  LEFT JOIN keyword k 
    ON LOWER(content) LIKE '% ' || topic || ' %' 
    OR content LIKE topic || ' %' OR content LIKE '% ' || topic
) a
GROUP BY post_id
ORDER BY post_id;

Note the use of the PG_TYPEOF() function (manual) - can be handy for debugging!

Result:

post_id topic length Length of NULL Type of NULL pg_typeof
      1     0                  null      unknown      text
      2     1      1           null      unknown      text
      3   2,3      3           null      unknown      text
      4     1      1           null      unknown      text

So, the correct approach is:

SELECT 
  post_id, 
  CASE
    WHEN
      (array_to_string(array_agg(DISTINCT topic_id ORDER BY topic_id), ',') IS NULL) 
        THEN 'Vague!'
    WHEN
      (array_to_string(array_agg(DISTINCT topic_id ORDER BY topic_id), ',') = '') 
        THEN 'Vague!'      
    ELSE
      array_to_string(array_agg(DISTINCT topic_id ORDER BY topic_id), ',')
  END AS topic
FROM 
(
  SELECT p.post_id, k.topic_id
  FROM post p 
  LEFT JOIN keyword k 
    ON LOWER(content) LIKE '% ' || topic || ' %' 
    OR content LIKE topic || ' %' OR content LIKE '% ' || topic
) a
GROUP BY post_id
ORDER BY post_id;

Result:

post_id    topic
      1   Vague!
      2        1
      3      2,3
      4        1
1
Jonas Metzler On

You can simplify your query and just use STRING_AGG, so you get rid of your subquery. By the way, please take care to always put the table alias before the column name, not only sometimes.

SELECT 
  p.post_id, 
  COALESCE(STRING_AGG(k.topic_id::varchar,',' 
             ORDER BY k.topic_id), 
           'Vague!') AS topic
  FROM Posts p 
  LEFT JOIN Keywords k 
    ON LOWER(p.content) LIKE '% ' || k.keyword || ' %' 
       OR p.content LIKE k.keyword || ' %' 
       OR p.content LIKE '% ' || k.keyword
  GROUP BY p.post_id
  ORDER BY p.post_id;

See this db<>fiddle with your data.

The fiddle also shows you could use CASE rather than COALESCE in your orignal query which would solve your issue too. The issue was COALESCE replaces NULL values, but you got an empty string, not NULL. Anyway, your previous query is far too complicated for your use case.

You might also be able to simplify those LIKE conditions and rather use a Regex approach. Since I don't use Pie DB and am not sure about regex functionality there, I leave this part up to you.

0
Shinichi Kudo On

To match keywords more accurately, you can try to use regex on your pieclouddb database.(The following query was tested on PostgreSQL)

SELECT p.post_id,
    CASE
        WHEN COUNT(k.topic_id) = 0 THEN 'Vague!'
        ELSE STRING_AGG(DISTINCT k.topic_id::TEXT, ',' ORDER BY k.topic_id::TEXT)
    END AS topics
FROM posts p
LEFT JOIN keywords k ON LOWER(p.content) ~* CONCAT('(^|\W)', LOWER(k.keyword), '(\W|$)')
GROUP BY p.post_id
ORDER BY p.post_id