I'm currently developing a website which allows the users to upload presentations, documents and e-books (something like scribd and slideshare) so I need to be able to search in the file's content. I'm currently extracting the text from the files in a txt file. I am considering 2 options as I am using MySQL:
- Store the plain text in a separate table and use mysql's fulltext index to search through it.
- Use an inverted index to store words and search through them. (2 new tables - words and many-to-many with the documents table). Now in this case what can I do to work with repeating words that give more relevance to the results.
The text will only be used for searching. The problem with (1) is that the text of an e-book may be huge so I consider limiting it to (for example) 50kb or less. (2) also has a problem with lots of words in an e-book which, again, can be limited.
So can you guide me to the best way to index the text and be able to do fast fulltext searches. I need to get the best out of mysql in this case.
I decided to use Sphinx as suggested by Rob Di Marco. Turns out it is the fastest (and opensource) FullText search engine out there. I had some trouble with compiling and getting SphinxSE not to crash mysql so I now use MariaDB which includes the plugin.
I chose version 1.10 because of the RealTime index. It means that there is no need to wait for the indexer thing to rebuild the entire index if you just add a row. ( I know about the main+delta workarounds but this is way easier to configure and use with SphinxQL )
See also Some questions related to SphinxSE and RT indexes