How to speed up this loop reading PDF files in .Net Core 6

374 views Asked by At

I have this method SearchPdf(string path, string keyword) where path is the folder path that contains all the PDFs file to search and keyword is the keyword to search in the PDF file or PDF's file name. I'm using Spire.Pdf to read the PDFs.

Here is the method:

public static ConcurrentBag<KeyValuePair<string, string>> SearchPdf(string path, string keyword)
{
    var results = new ConcurrentBag<KeyValuePair<string, string>>();

    var directory = new DirectoryInfo(path);
    var files = directory.GetFiles("*.pdf", SearchOption.AllDirectories);

    Parallel.ForEach(files, file =>
    {
        // Apri il file PDF
        var document = new PdfDocument(file.FullName);
        Console.WriteLine("\n\rRicerca per: " + keyword + " in file: " + file.Name + "\n\r");

        // Itera le pagine del documento
        for (int i = 0; i < document.Pages.Count; i++)
        {
            // Estrai il testo della pagina
            var page = document.Pages[i];
            var text = page.ExtractText();

            // Cerca la parola chiave
            keyword = keyword.ToLower().Trim();
            if (text.ToLower().Contains(keyword) || file.Name.ToLower().Trim().Contains(keyword) || file.FullName.ToLower().Trim().Contains(keyword))
            {
                results.Add(new KeyValuePair<string, string>(keyword, file.FullName));
            }
        }
    });

    return results;
}

All works fine but when I have more than 200 keywords to search and more then 1500 files it's a bit slow. Is there something to do to optimize this loop?

2

There are 2 answers

0
Alberto On
  1. If you are only interested in the filename you should stop processing pages after the first occurrence.
  2. Do not open and extract the pdf if the keyword already matches the filename.
  3. Use StringComparison.OrdinalIgnoreCase to compare strings instead of calling ToLower.
public static ConcurrentBag<KeyValuePair<string, string>> SearchPdf(string path, string keyword)
{
    var results = new ConcurrentBag<KeyValuePair<string, string>>();

    var directory = new DirectoryInfo(path);
    var files = directory.GetFiles("*.pdf", SearchOption.AllDirectories);

    Parallel.ForEach(files, file =>
    {
        if(file.Name.IndexOf(keyword, StringComparison.OrdinalIgnoreCase) >= 0 || 
           file.FullName.IndexOf(keyword, StringComparison.OrdinalIgnoreCase) >= 0)
        {
            results.Add(new KeyValuePair<string, string>(keyword, file.FullName));
        }
        else 
        {
            // Apri il file PDF
            var document = new PdfDocument(file.FullName);
            Console.WriteLine("\n\rRicerca per: " + keyword + " in file: " + file.Name + "\n\r");

            // Itera le pagine del documento
            for (int i = 0; i < document.Pages.Count; i++)
            {
                // Estrai il testo della pagina
                var page = document.Pages[i];
                var text = page.ExtractText();

                if (text.IndexOf(keyword, StringComparison.OrdinalIgnoreCase) >= 0)
                {
                    results.Add(new KeyValuePair<string, string>(keyword, file.FullName));
                    break;
                }
            }
        }
    });

    return results;
}

0
Guru Stron On

I have more than 200 keywords

And you loading all pdfs and processing for every single one of them. I think it would be much more efficient to load file once and check it for all keywords:

public static ConcurrentBag<KeyValuePair<string, string>> SearchPdf(string path, string[] keywords)
{
    //...
    Parallel.ForEach(files, file =>
    {
        // ...
        for (int i = 0; i < document.Pages.Count; i++)
        {
            foreach (var keyword in keywords)
            {
                // search for keyword and add it to the results    
            }
        }
    }
    // ...  
}

Next thing you can try to optimize - break of search for page/keyword pair - since you care only about keyword being found in file not a page - break out earlier if the keyword was found (and/or all keywords were found), for example by maintaining local hashset of found keywords.

Then optimize the search (as suggested in comments) - no need create bunch of string by using ToLower and add pressure on the GC -

Instead of

keyword = keyword.ToLower().Trim();
if (text.ToLower().Contains(keyword) || file.Name.ToLower().Trim().Contains(keyword) || file.FullName.ToLower().Trim().Contains(keyword))
{
    results.Add(new KeyValuePair<string, string>(keyword, file.FullName));
}

just use:

if (text.Contains(keyword, StringComparison.OrdinalIgnoreCase) || file.Name.Contains(keyword, StringComparison.OrdinalIgnoreCase) || file.FullName.Contains(keyword, StringComparison.OrdinalIgnoreCase))
{
    results.Add(new KeyValuePair<string, string>(keyword, file.FullName));
}

Also possibly perform file name and full file name checks before the fulltext search (maybe before file/page load).