I am looking to build a data science focused search engine and had a question for those familiar with parsing text with mathematical notation. So I have set up a standard WikiAPI class with a method to get text data and clean it a bit. Here:
class WikipediaAPI(BaseModel):
def fetch_article_data(self, article_title):
URL = "https://en.wikipedia.org/w/api.php"
PARAMS = {
"action": "query",
"prop": "extracts",
"titles": article_title,
"explaintext": True,
"format": "json"
}
response = requests.get(url=URL, params=PARAMS)
data = response.json()
pages = data["query"]["pages"]
page_id = next(iter(pages))
content = pages[page_id].get("extract", "")
tokens = content.split()
cleaned_tokens = [token for token in tokens if not re.match(r'\\[a-zA-Z]+', token)]
cleaned_text = ' '.join(cleaned_tokens)
final_text = re.sub(r'[\n\t]+', '', cleaned_text)
title = pages[page_id].get("title", "")
return title, page_id, final_text
title, page_id, final_text = WikipediaAPI().fetch_article_data('Normal_distribution')
print(final_text[:1000])
Now this 95% works. See the text, I think this is decent for embeddings but would ideally like to remove the latex/ rendering sytnax in this text. See the first ~1000 chars:
In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is f ( x ) = 1 σ 2 π e − 1 2 ( x − μ σ ) 2 {\displaystyle f(x)={\frac {1}{\sigma {\sqrt {2\pi }}}}e^{-{\frac {1}{2}}\left({\frac {x-\mu }{\sigma }}\right)^{2}}} The parameter μ {\displaystyle } is the mean or expectation of the distribution (and also its median and mode), while the parameter σ {\displaystyle } is its standard deviation. The variance of the distribution is σ 2 {\displaystyle ^{2}} . A random variable with a Gaussian distribution is said to be normally distributed, and is called a normal deviate. Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known. Their importance is partly due to the central limit theorem.
What solutions could you think of as it pertains to removing the latex? Or shall I not bother considering the content around is good enough for a data science rag application?
I have tried a remove_nested_curly_braces:
def remove_nested_curly_braces(text):
stack = []
to_remove = []
text_list = list(text)
for i, char in enumerate(text_list):
if char == '{':
stack.append(i - 1)
elif char == '}':
if stack:
start = stack.pop()
if not stack:
to_remove.append((start, i))
for start, end in reversed(to_remove):
del text_list[start:end + 1]
return ''.join(text_list)
And this works for the first sections but trips when there are {} in the non-latex sections. Any ideas are helpful. thanks.
Some options:
1. just embed and store in db with latex present.
2. Tokenize whole text, find tokenize pattern consistent with the start of the latex ex: ['{\displaystyle', '^{2}', '}'] and create function to clean this or apply remove_nested_curly_braces to each of these tokens then rejoin. or apply similar but different function to whole tokenized text
3. LLM or NN to clean it: $, waste of time, could build ner nn would be better use of time. overkill.
Leaning towards #2 but welcome to other ideas or suggestions.
thanks y'all