I am creating a content system with Django where users can create and publish content. Each document created by the user can have multiple tags. I am designing a tag implication system where some tags can imply the presence of other tags.
For a simple example, assume "color" and "red" are tags. The following implication exists: "red" implies "color". When a user tags their document with "red", it also has a "color" tag added automatically. Implications are global and apply to every user and document.
What I want to do is make tag implications more powerful, I would like to make it possible to require the presence of multiple tags to imply another tag, like a Definite Horn clause.
For example, I want an implication where the presence of "red" and "blue" implies the tag "multiple_colors". This could be written Horn clause-like: "red", "blue" -> "multiple_colors".
Using Postgres and Django to implement tag implication resolution where only a single tag can imply a tag is simple to implement. You simply have a Django model:
class Tag(Model):
name = CharField()
class TagImplication(Model):
antecedent = ForeignKey(Tag)
consequent = ForeignKey(Tag)
To resolve all implied tags given a set of tags requires an iterative approach. You start with a set of tags, use the model to add the implied tags to the set, and repeat until no new tags are added. It's similar to a breadth-first accumulation of tags in a directed graph of tag implications.
When allowing Horn clause-like tag implications, the models become more complicated:
class Tag(Model):
name = CharField()
class TagImplication(Model):
consequent = ForeignKey(Tag)
class TagImplicationAntecedent(Model):
implication = ForeignKey(TagImplication)
antecedent = ForeignKey(Tag)
For a TagImplication to imply its consequent tag, all child TagImplicationAntecedents' antecedent must be matched by an existing/asserted tag. I believe that an iterative approach is also needed to resolve all implications when provided a set of tags just like the non-Horn clause version.
The Problem
How does one make effective use of Django queries to calculate an iteration of tag implication for Horn clause-like implications? I have a drafted algorithm, but the number of queries is potentially proportional to the number of tags the algorithm starts with and the number of implications defined in the database. I worry about database performance. The ideal solution would use one query per iteration. Maybe there is a better solution that must use raw SQL or skips manual iteration steps in the algorithm.
What I Drafted
This is an algorithm which I have drafted which uses Django queries to perform a single iteration of tag implication resolution with Horn clause-like implications. (In pseudo-python code)
# Let 'tags' be the set of tags that we currently have. At the end of this iteration, 'tags' will also have the directly implied tags.
tags: Set[Tag] = <current tags>
starting_tag_count = len(tags)
# Get candidate TagImplications.
# These TagImplications have at least one Antecedent matched from 'tags'
# this uses reverse relationships to query: https://docs.djangoproject.com/en/4.2/topics/db/queries/#lookups-that-span-relationships
canidate_imps = TagImplication.objects.filter(
Q(tag_implication_antecedent__antecedent=tags[0]
|
...
|
Q(tag_implication_antecedent__antecedent=tags[n]
)
# Get the implications that have satisfied all of their antecedents.
satisfied_imps: List[TagImplication] = []
for imp in canidate_imps:
# this query performance being called repeatedly worries me when there are a large number of implications defined.
# I assume that this query is easy to cache?
# What is worse is that this loop is run for each iteration of the resolution algorithm. Each algorithm iteration can increse the number of tags and canidate implications.
antecedents = TagImplicationAntecedent.objects(
implication=imp
)
satisfied = True
for ant in antecedents:
if ant.antecedent not in tags:
satisfied = False
if satisfied:
satisfied_imps.append(imp)
# add implied tags to current tags
for imp in satisfied_imps:
tags.append(imp.consequent)
if starting_tag_count == len(tags):
# stop algorithm
else:
# do another algorithm iteration. Go to the top.
Therefore I would not try to do this inside the database engine, since while a database is excellent for storing the definition of a graph, it is not so good at running graph theory algorithms.
Instead:
Load the graph specification from the database into any decent graph theory library. Probably you will only need to do this once.
Run your 'breadth-first accumulation of tags in a directed graph' algorithm using the graph theory library methods
Store the results ( the new tag assignments ) to the database
This will have better performance ( since the graph storage in the graph theory library is optimized for running graph theory methods ), require much less and much more understandable code, and minimize the amount of testing and debugging required ( assuming you can trust the graph theory library ).