I have a php, linux server. It has a folder called notes_docs
which contains over 600,000 txt files. The folder structure of notes_docs
is as follows -
- notes_docs
- files_txt
- 20170831
- 1_837837472_abc_file.txt
- 1_579374743_abc2_file.txt
- 1_291838733_uridjdh.txt
- 1_482737439_a8weele.txt
- 1_733839474_dejsde.txt
- 20170830
- 20170829
I have to provide a fast text search utility which can show results on browser. So if my user searches for "new york", all the files which have "new york" in them, should be returned in an array. If user searches for "foo", all files with "foo" in them should be returned.
I already tried the code using scandir
, and Directory Iterator
, which is too slow. It is taking more than a minute to search, even then the search was not complete. I tried ubuntu find
which was again slow taking over a minute to complete. because there are too many folder iterations, and notes_docs
current size is over 20 GB.
Any solution which I can use to make it faster are welcome. I can make design changes, integrate my PHP code to curl to another language code. I can make infrastructure changes too in extreme cases (as in using in memory something).
I want to know how people in industry do this? People at Indeed, Zip Recruiter all provide file search.
Please note I have 2GB - 4GB to RAM, so loading all the files on RAM all the time is not acceptable.
EDIT - All the below inputs are great. For those who come later, We ended up using Lucene for indexing and text-search. It performed really well
To keep it simple: There is no fast way to open, search and close 600k documents every time you want to do a search. Your benchmarks with "over a minute" are probably with single test accounts. If you plan to search these via a multi-user website, you can quickly forget about it, because your
disk IO
will be off the charts and block your entire server.So your only options is to index all files. Just as every other quick search utility does. No matter if you use Solr or ElasticSearch as mentioned in the comments, or build something of your own. The files will be indexed.
Considering the
txt
files are text versions ofpdf
files you receive, I'm betting the easiest solution is to write the text to a database instead of a file. It won't take up much more disk space anyway.Then you can enable
full text search
on your database (mysql
,mssql
and others support it) and I'm sure the response times will be a lot better. Keep in mind that creating theseindexes
do require storage space, but the same goes for other solutions.Now if you really want to speed things up, you could try to parse the resumes on a more detailed level. Try and retrieve locations, education, spoken languages and other information you regularly search for and put them in separate tables/columns. This is a very difficult task and almost a project on it's own, but if you want a valuable search result, this is the way to go. Because searching in text without context gives very different results, just think of your example "new york":