Xapian search terms which exceed the 245 character length: InvalidArgumentError: Term too long (> 245)

Question

Xapian search terms which exceed the 245 character length: InvalidArgumentError: Term too long (> 245)

345 views Asked by wes At 18 June 2015 at 20:44

I'm using Xapian and Haystack in my django app. I have a model which contains a text field that I want to index for searching. This field is used to store all sorts of characters: words, urls, html, etc.

I'm using the default document-based index template:

text = indexes.CharField(document=True, use_template=True)

This sometimes yields the following error when someone has pasted a particularly long link:

InvalidArgumentError: Term too long (> 245)

Now I understand the error. I've gotten around it before for other fields in other situations.

My question is, what's the preferred way to handle this exception?

It seems that handling this exception requires me to use a prepare_text() method:

def prepare_text(self, obj):
    content = []      
    for word in obj.body.split(' '):
        if len(word) <= 245:
            content += [word]
    return ' '.join(content)

It just seems clunky and prone to problems. Plus I can't use the search templates.

How have you handled this problem?

Original Q&A

There are 1 answers

**neuro** · Answer 1 · 2019-12-02T15:57:42+00:00

I think you get it right. There's a patch on inkscape xapian_backend fork, inspired from xapian omega project.

I've done something like you've done on my project, with some trick in order to use the search index template:

# regex to efficiently truncate with re.sub
_max_length = 240
_regex = re.compile(r"([^\s]{{{}}})([^\s]+)".format(_max_length))

def prepare_text(self, object):

    # this is for using the template mechanics
    field = self.fields["text"]
    text = self.prepared_data[field.index_fieldname]

    encoding = "utf8"
    encoded = text.encode(encoding)

    prepared = re.sub(_regex, r"\1", encoded, re.UNICODE)

    if len(prepared) != len(encoded):
        return prepared.decode(encoding, 'ignore')

    return text

TechQA.

Xapian search terms which exceed the 245 character length: InvalidArgumentError: Term too long (> 245)

There are 1 answers

Related Questions in DJANGO

Related Questions in DJANGO-HAYSTACK

Related Questions in XAPIAN

Popular Questions

Popular Tags

Trending Questions