Fast text search in over 600,000 files

2.7k views Asked by At

I have a php, linux server. It has a folder called notes_docs which contains over 600,000 txt files. The folder structure of notes_docs is as follows -

 - notes_docs
   - files_txt
     - 20170831
           - 1_837837472_abc_file.txt
           - 1_579374743_abc2_file.txt
           - 1_291838733_uridjdh.txt
           - 1_482737439_a8weele.txt
           - 1_733839474_dejsde.txt
     - 20170830
     - 20170829

I have to provide a fast text search utility which can show results on browser. So if my user searches for "new york", all the files which have "new york" in them, should be returned in an array. If user searches for "foo", all files with "foo" in them should be returned.

I already tried the code using scandir, and Directory Iterator, which is too slow. It is taking more than a minute to search, even then the search was not complete. I tried ubuntu find which was again slow taking over a minute to complete. because there are too many folder iterations, and notes_docs current size is over 20 GB.

Any solution which I can use to make it faster are welcome. I can make design changes, integrate my PHP code to curl to another language code. I can make infrastructure changes too in extreme cases (as in using in memory something).

I want to know how people in industry do this? People at Indeed, Zip Recruiter all provide file search.

Please note I have 2GB - 4GB to RAM, so loading all the files on RAM all the time is not acceptable.

EDIT - All the below inputs are great. For those who come later, We ended up using Lucene for indexing and text-search. It performed really well

4

There are 4 answers

1
Hugo Delsing On BEST ANSWER

To keep it simple: There is no fast way to open, search and close 600k documents every time you want to do a search. Your benchmarks with "over a minute" are probably with single test accounts. If you plan to search these via a multi-user website, you can quickly forget about it, because your disk IO will be off the charts and block your entire server.

So your only options is to index all files. Just as every other quick search utility does. No matter if you use Solr or ElasticSearch as mentioned in the comments, or build something of your own. The files will be indexed.

Considering the txt files are text versions of pdf files you receive, I'm betting the easiest solution is to write the text to a database instead of a file. It won't take up much more disk space anyway.

Then you can enable full text search on your database (mysql, mssql and others support it) and I'm sure the response times will be a lot better. Keep in mind that creating these indexes do require storage space, but the same goes for other solutions.

Now if you really want to speed things up, you could try to parse the resumes on a more detailed level. Try and retrieve locations, education, spoken languages and other information you regularly search for and put them in separate tables/columns. This is a very difficult task and almost a project on it's own, but if you want a valuable search result, this is the way to go. Because searching in text without context gives very different results, just think of your example "new york":

  1. I live in New York
  2. I studied at New York University
  3. I love the song "new york" from Alicia Keys in a personal bio
  4. I worked for New York Pizza
  5. I was born in new yorkshire, UK
  6. I spent a summer breeding new yorkshire terriers.
0
Marinos An On

I won't go too deep, but I will attempt to provide guidelines for creating a proof-of-concept.

1

First Download and extract elastic search from here: https://www.elastic.co/downloads/elasticsearch and then run it:

bin/elasticsearch

2

Download https://github.com/dadoonet/fscrawler#download-fscrawler extract it and run it:

bin/fscrawler myCustomJob

Then stop it (Ctrl-C) and edit the corresponding myCustomJob/_settings.json (It has been created automatically and path was printed on the console).
You can edit the properties: "url" (path to be scanned), "update_rate" (you can make it 1m), "includes" (e.g. ["*.pdf","*.doc","*.txt"]), "index_content" (make it false, to only stay on the filename).

Run again:

bin/fscrawler myCustomJob

Note: Indexing is something you might later want to perform using code, but for now, it will be done automatically, using fscrawler, which directly talks to elastic.

3

Now start adding files to the directory that you specified in "url" property.

4

Download advanced rest client for chrome and make the following POST:

URL: http://localhost:9200/_search

Raw payload:

{
  "query": { "wildcard": {"file.filename":"aFileNameToSearchFor*"} }
}

You will get back the list of matched files. Note: fscrawler is indexing the filenames under key: file.filename.

5

Now, instead of using advanced rest client you can use PHP, to perform this query. Either by a REST call to the url above, or by utilizing the php-client api: https://www.elastic.co/guide/en/elasticsearch/client/php-api/current/_search_operations.html

The same stands for indexing: https://www.elastic.co/guide/en/elasticsearch/client/php-api/current/_indexing_documents.html

0
Hisham Dalal On

If you want to save all files info into the database:

<?php 
function deep_scandir( $dir, &$query, &$files) {
    
    $count = 0;
    
    if(is_dir($dir)) {
        if ($dh = opendir($dir)) {
            while (($item = readdir($dh)) !== false) {
                if($item != '.' && $item != '..') {
                    if(is_dir($dir.'/'.$item)){
                        deep_scandir($dir.'/'.$item, $query, $files);
                    }else{
                        $count++;
                        preg_match("/(\d\_\d+)\_(.*)\.txt/i", $item, $matches);
                        if(!empty($matches)){
                            $no = $matches[1];
                            $str = $matches[2];
                            $files[$dir][$no] = $str;
                            $content = addcslashes( htmlspecialchars( file_get_contents($dir.'/'.$item) ), "\\\'\"" );
                            $query[] =  "INSERT INTO `mytable` (id, key, value, path, content)
                            VALUES\n(NULL, '$no', '$str', '$dir/$item', '$content');";
                        }
                    }
                }
            }
            closedir($dh);
        }
    }
    return $count;
}
    
echo '<pre>';
$dir = 'notes_docs/files_txt';
$query = [];
$files = [];
echo deep_scandir($dir, $query, $files);
echo '<br>';
print_r($files);
echo '<br>';
print_r($query);

Now you can execute every line in array

foreach($query as $no=>$line){
    mysql_query($line) or trigger_error("Couldn't execute query no: '$no' [$line]");
}

Output:

Array
(
    [notes_docs/files_txt/20170831] => Array
        (
            [1_291838733] => uridjdh
            [1_482737439] => a8weele
            [1_579374743] => abc2_file
            [1_733839474] => dejsde
            [1_837837472] => abc_file
        )

)

Array
(
    [0] => INSERT INTO `mytable` (id, key, value, path, content)
                            VALUES
(NULL, '1_291838733', 'uridjdh', 'notes_docs/files_txt/20170831/1_291838733_uridjdh.txt', 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus in nisl quis lectus sagittis ullamcorper at faucibus urna. Suspendisse tristique arcu sit amet ligula cursus pretium vitae eu elit. Nullam sed dolor ornare ex lobortis posuere. Quisque venenatis laoreet diam, in imperdiet arcu fermentum eu. Aenean molestie ligula id sem ultricies aliquet non a velit. Proin suscipit interdum vulputate. Nullam finibus gravida est, et fermentum est cursus eu. Integer sed metus ac urna molestie finibus. Aenean hendrerit ante quis diam ultrices pellentesque. Duis luctus turpis id ipsum dictum accumsan. Curabitur ornare nisi ligula, non pretium nulla venenatis sed. Aenean pharetra odio nec mi aliquam molestie. Fusce a condimentum nisl. Quisque mattis, nulla suscipit condimentum finibus, leo ex eleifend felis, vel efficitur eros turpis nec sem. ');
    [1] => INSERT INTO `mytable` (id, key, value, path, content)
                            VALUES
(NULL, '1_482737439', 'a8weele', 'notes_docs/files_txt/20170831/1_482737439_a8weele.txt', 'Nunc et odio sed odio rhoncus venenatis congue non nulla. Aliquam dictum, felis ac aliquam luctus, purus mi dignissim magna, vitae pharetra risus elit ac mi. Sed sodales dui semper commodo iaculis. Nunc vitae neque ut arcu gravida commodo. Fusce feugiat velit et felis pharetra posuere sit amet sit amet neque. Phasellus iaculis turpis odio, non consequat nunc consectetur a. Praesent ornare nisi non accumsan bibendum. Nunc vel ultricies enim, consectetur fermentum nisl. Sed eu augue ac massa efficitur ullamcorper. Ut hendrerit nisi arcu, a sagittis velit viverra ac. Quisque cursus nunc ac tincidunt sollicitudin. Cras eu rhoncus ante, ac varius velit. Mauris nibh lorem, viverra in porttitor at, interdum vel elit. Aliquam imperdiet lacus eu mi tincidunt volutpat. Vestibulum ut dolor risus. ');
    [2] => INSERT INTO `mytable` (id, key, value, path, content)
                            VALUES
(NULL, '1_579374743', 'abc2_file', 'notes_docs/files_txt/20170831/1_579374743_abc2_file.txt', 'Vivamus aliquet id elit vitae blandit. Proin laoreet ipsum sed tincidunt commodo. Fusce faucibus quam quam, in ornare ex fermentum et. Suspendisse dignissim, tortor at fringilla tempus, nibh lacus pretium metus, vel tempus dolor tellus ac orci. Vestibulum in congue dolor, nec porta elit. Donec pellentesque, neque sed commodo blandit, augue sapien dapibus arcu, sit amet hendrerit felis libero id ante. Praesent vitae elit at eros faucibus volutpat. Integer rutrum augue laoreet ex porta, ut faucibus elit accumsan. Donec in neque sagittis, auctor diam ac, viverra diam. Phasellus vel quam dolor. Nullam nisi tellus, faucibus a finibus et, blandit ac nisl. Vestibulum interdum malesuada sem, nec semper mi placerat quis. Nullam non bibendum sem, vitae elementum metus. Donec non ipsum quis turpis semper lobortis.');
    [3] => INSERT INTO `mytable` (id, key, value, path, content)
                            VALUES
(NULL, '1_733839474', 'dejsde', 'notes_docs/files_txt/20170831/1_733839474_dejsde.txt', 'Nunc faucibus, enim non luctus rutrum, lorem urna finibus turpis, sit amet dictum turpis ipsum pharetra ex. Donec at leo vitae massa consectetur viverra eget vel diam. Sed in neque tempor, vulputate quam sed, ullamcorper nisl. Fusce mollis libero in metus tincidunt interdum. Cras tempus porttitor nunc nec dapibus. Vestibulum condimentum, nisl eget venenatis tincidunt, nunc sem placerat dui, quis luctus nisl erat sed orci. Maecenas maximus finibus magna in facilisis. Maecenas maximus turpis eget dignissim fermentum. ');
    [4] => INSERT INTO `mytable` (id, key, value, path, content)
                            VALUES
(NULL, '1_837837472', 'abc_file', 'notes_docs/files_txt/20170831/1_837837472_abc_file.txt', 'Integer non ex condimentum, aliquet lectus id, accumsan nibh. Quisque aliquet, ante vitae convallis ullamcorper, velit diam tempus diam, et accumsan metus eros at tellus. Sed lacinia mauris sem, scelerisque efficitur mauris aliquam a. Nullam non auctor leo. In mattis mauris eu blandit varius. Phasellus interdum mi nec enim imperdiet tristique. In nec porttitor erat, tempor malesuada ante. Etiam scelerisque ligula et ex maximus, placerat consequat nunc consectetur. Phasellus suscipit ligula sed elit hendrerit laoreet. Suspendisse ex sem, placerat pharetra ligula eu, accumsan rhoncus ex. Sed luctus nisi vitae metus maximus scelerisque. Suspendisse porta nibh eget placerat tempus. Nunc efficitur gravida sagittis. ');
)
1
tru7 On

I'd try first creating a simple database. The main table would have the file name and the text contents. This way you can leverage the querying features of the DB engine and probably will be a substantial leap in performance over the current solution.

This would be the simplest solution and it could grow by adding complementary tables relating words with file names (this might be something done dynamically for example on the most frequent searches)