Xapian search terms which exceed the 245 character length: InvalidArgumentError: Term too long (> 245)

352 views Asked by At

I'm using Xapian and Haystack in my django app. I have a model which contains a text field that I want to index for searching. This field is used to store all sorts of characters: words, urls, html, etc.

I'm using the default document-based index template:

text = indexes.CharField(document=True, use_template=True)

This sometimes yields the following error when someone has pasted a particularly long link:

InvalidArgumentError: Term too long (> 245)

Now I understand the error. I've gotten around it before for other fields in other situations.

My question is, what's the preferred way to handle this exception?

It seems that handling this exception requires me to use a prepare_text() method:

def prepare_text(self, obj):
    content = []      
    for word in obj.body.split(' '):
        if len(word) <= 245:
            content += [word]
    return ' '.join(content)

It just seems clunky and prone to problems. Plus I can't use the search templates.

How have you handled this problem?

1

There are 1 answers

0
neuro On

I think you get it right. There's a patch on inkscape xapian_backend fork, inspired from xapian omega project.

I've done something like you've done on my project, with some trick in order to use the search index template:

# regex to efficiently truncate with re.sub
_max_length = 240
_regex = re.compile(r"([^\s]{{{}}})([^\s]+)".format(_max_length))

def prepare_text(self, object):

    # this is for using the template mechanics
    field = self.fields["text"]
    text = self.prepared_data[field.index_fieldname]

    encoding = "utf8"
    encoded = text.encode(encoding)

    prepared = re.sub(_regex, r"\1", encoded, re.UNICODE)

    if len(prepared) != len(encoded):
        return prepared.decode(encoding, 'ignore')

    return text