When I run a fulltext MySQL query, thanks to Unicode character collations I will get results matching all of the following, whichever of them I may query for: saka, sakā, śāka, ṣaka
etc.
Where I'm stuck is with highlighting the matches in search results. With standard RegEx, I can only match and highlight the original query word in the results -- not all the collated matches.
How would one go about solving this? I've initially thought of these approaches:
- Creating a RegEx pattern that would analyze the target results against all possible variants. Would easily turn into one monster of a bloated pattern.
- Creating a normalized version of the results, locating the matches there, and using the string positions as a basis for highlighting.
However both these approaches incur a substantial processing overhead compared to a regular search result highlighting. The first approach would incur a mighty CPU overhead; the second would probably eat up less CPU but munch at least twice the RAM for the results. Any suggestions?
P.S. In case it's relevant: The specific character set I'm dealing with (IAST for Sanskrit transliteration with extensions) has three variants of L and N; two variants of M, R and S; and one variant of A, D, E, H, I, T and U; in total A-Z + 19 diacritic variants; + uppercase (that poses no problem here).
Here's what I ended up doing. Seems to have negligible impact on performance. (I noticed none!)
First, a function that converts the query word into a regular expression iterating the variants:
Which turns the words
saka
,śaka
etc. into(s|ś|ṣ)(a|ā)k(a|ā)
. Then, the variant-iterated word-pattern is used to highlight the search results:Presto: I get all the variants highlighted. Thanks for the contributions so far, and please let me know if you can think of better ways to accomplish this. Cheers!