How to create an index using Whoosh

4k views Asked by At

I am trying to use Whoosh for text searching for the first time. I want to search for documents containing the word "XML". But because I am new to Whoosh, I just wrote a program that search for a word from a document. Where the document is a text file (myRoko.txt)

import os, os.path
from whoosh import index
from whoosh.index import open_dir
from whoosh.fields import Schema, ID, TEXT
from whoosh.qparser import QueryParser
from whoosh.query import *

if not os.path.exists("indexdir3"):
   os.mkdir("indexdir3")

schema =  Schema(name=ID(stored=True), content=TEXT)
ix = index.create_in("indexdir3", schema)
writer = ix.writer()
path = "myRoko.txt"

with open(path, "r") as f:
   content = f.read()
   f.close()
   writer.add_document(name=path, content= content)

  writer.commit()

  ix = open_dir("indexdir3")
  query_b = QueryParser('content', ix.schema).parse('XML')
  with ix.searcher() as srch:
    res_b = srch.search(query_b)
    print res_b[0]

The above code is supposed to print the document that contain the word "XML". However the code return the following error:

    raise ValueError("%r is not unicode or sequence" % value)

    ValueError: 'A large number of documents are now represented and stored      
    as XML document on the web. Thus ................

What could be the cause of this error?

2

There are 2 answers

3
Assem On

You have a Unicode problem. You should pass unicode strings to the indexer. For that, you need to open the text file as unicode:

import codecs
with codecs.open(path, "r","utf-8") as f:
   content = f.read()

and use unicode string for file name:

path = u"myRoko.txt"

After fixes I got this result:

<Hit {'name': u'myRoko.txt'}>
0
AudioBubble On
writer.add_document(name=unicode(path), content=unicode(content))

It has to be UNICODE