I am working with Solr via Django-Haystack to refine a basic Django queryset by text-searching a large data table. My primary database is postgresql. I have approximately 3 million rows of Record objects and 13 million rows of Info objects. These will both grow over time, so I'm using Solr to index the description field.
I am running into significant performance issues when I try to work with the Solr results in Django.
Here are the models (abbreviated):
class Record:
name = models.CharField()
class Info:
record = models.ForeignKey(Record...)
description = models.CharField()
I'm starting with a queryset of all Records. The goal is to filter it down using a text query against descriptions in the Info model.
queryset = Record.objects.all()
When I query Solr, using Django-Haystack, I can get a list of Info objects pk's that match the query string:
sqset = SearchQuerySet().filter(text=query_string).values_list('pk', flat=True)
sqset can be 500+ items in length. So when I attempt to use it in a standard query using __in in Django, I end up executing a massive SQL query with hundreds of ORs:
queryset = queryset.filter(info__id__in=sqset)
Django Debug Toolbar tells me the SQL is performant (approx 60ms on average). The Solr portion of the query also completes virtually instantly. But the total query can take more than 7000ms to complete if sqset is around 500 items in length, which is not uncommon. It seems there is a lot of overhead bogging down the query.
Please let me know how I can make performant use of Solr in this case. A basic Django icontains lookup is slightly more performant than what I've described above (approx 6000ms given the same search criteria), but I do not see that as a viable solution as the database grows.