I got a list of sentences (roughly 20000) stored in excel file named list.xlsx and sheet named Sentence under column name named Sentence.
My intention is to get words from user and return those sentences where in those exact words matches.
I am currently able to do so with the code i developed using spacy. But it takes lot of time to check and return output.
Is there any other time saving way of achieving this by any other means.
I see in geany notepad or libre calc wherein its search function return sentences in a jiffy.
How?
kindly help.
import pandas as pd
import spacy
# Load the English language model in spaCy
nlp = spacy.load("en_core_web_sm")
# Function to extract sentences containing the keyword
def extract_sentences_with_keyword(text, keyword):
doc = nlp(text)
sentences = [sent.text for sent in doc.sents if keyword in sent.text.lower()]
return sentences
i = input("Enter Keyword(s):")
# Read the Excel file
file_path = "list.xlsx"
sheet_name = "Sentence" # Update with your sheet name
column_name = "Sentence" # Update with the column containing text data
data = pd.read_excel(file_path, sheet_name=sheet_name)
# Iterate over the rows and extract sentences with the keyword
keyword = i # Update with the keyword you want to search for
for index, row in data.iterrows():
text = row[column_name]
sentences = extract_sentences_with_keyword(text, keyword)
if sentences:
for sentence in sentences:
print(sentence)
print("\n")
You can use Sqlite with a full text index. I tried the following proof of concept code with a 6 MB text file and it is very fast. You of course need to adjust the code for your needs, using spacy for sentence splitting as you did above might be a decent option:
Your code using spacy is also very slow because you do not disable any pipelines, so it also performs stuff like part of speech detection, which you do not need for your use case. For details you can look here: https://spacy.io/usage/processing-pipelines
Quoting from the docs (you might need to disable more or less pipelines):