Highlighting Search Results: RegEx Character Collation?

600 views Asked by At

When I run a fulltext MySQL query, thanks to Unicode character collations I will get results matching all of the following, whichever of them I may query for: saka, sakā, śāka, ṣaka etc.

Where I'm stuck is with highlighting the matches in search results. With standard RegEx, I can only match and highlight the original query word in the results -- not all the collated matches.

How would one go about solving this? I've initially thought of these approaches:

  • Creating a RegEx pattern that would analyze the target results against all possible variants. Would easily turn into one monster of a bloated pattern.
  • Creating a normalized version of the results, locating the matches there, and using the string positions as a basis for highlighting.

However both these approaches incur a substantial processing overhead compared to a regular search result highlighting. The first approach would incur a mighty CPU overhead; the second would probably eat up less CPU but munch at least twice the RAM for the results. Any suggestions?

P.S. In case it's relevant: The specific character set I'm dealing with (IAST for Sanskrit transliteration with extensions) has three variants of L and N; two variants of M, R and S; and one variant of A, D, E, H, I, T and U; in total A-Z + 19 diacritic variants; + uppercase (that poses no problem here).

3

There are 3 answers

1
Markus AO On BEST ANSWER

Here's what I ended up doing. Seems to have negligible impact on performance. (I noticed none!)

First, a function that converts the query word into a regular expression iterating the variants:

function iast_normalize_regex($str) {

    $subst = [ 
        'a|ā', 'd|ḍ', 'e|ӗ', 'h|ḥ', 'i|ī', 'l|ḷ|ḹ', 'm|ṁ|ṃ', 
        'n|ñ|ṅ|ṇ', 'r|ṛ|ṝ', 's|ś|ṣ', 't|ṭ', 'u|ū' 
        ];

    $subst_rex = [];

    foreach($subst as $variants) {
        $chars = explode('|', $variants);
        foreach($chars as $char) {
            $subst_rex[$char] = "({$variants})";
        }
    }

    $str_chars = str_split_unicode($str);

    $str_rex = '';
    foreach($str_chars as $char) {
        $str_rex .= !isset($subst_rex[$char]) ? $char : $subst_rex[$char];
    }

    return $str_rex;
}

Which turns the words saka, śaka etc. into (s|ś|ṣ)(a|ā)k(a|ā). Then, the variant-iterated word-pattern is used to highlight the search results:

$word = iast_normalize_regex($word);
$result = preg_replace("#({$word})#iu", "<b>$1</b>", $result);

Presto: I get all the variants highlighted. Thanks for the contributions so far, and please let me know if you can think of better ways to accomplish this. Cheers!

5
Rick James On

With MySQL and its REGEXP, you can only locate row(s) that match the REGEXP. You cannot locate the match within the column.

REGEXP and LIKE both honor the collation of the column in question, but that does not help in locating the text withing the column.

Check out MariaDB and its REGEXP_REPLACE.

MySQL at least has a bug relating to it: http://bugs.mysql.com/bug.php?id=70767

0
Dale Thompson On

I was able to solve this by writing a custom MySQL function that works like the internal function LOCATE except it matches a unicode string with its non-unicode equivalent string (or vice versa):

DROP FUNCTION IF EXISTS locate_like;

DELIMITER $$

CREATE FUNCTION locate_like(substring VARCHAR(255), string VARCHAR(255)) RETURNS INTEGER
    DETERMINISTIC
BEGIN

    DECLARE i INTEGER;
    SET i = 1;

    myloop: WHILE (i <= CHAR_LENGTH(string)) DO

        IF SUBSTRING(string, i) LIKE CONCAT(substring, "%") THEN    
            -- subtract 1 to return a 0-based position
            RETURN(i - 1);
            LEAVE myloop;        
        END IF;    

        SET i = i + 1;

    END WHILE; 

    RETURN(0);
    
END

Create some test data:

CREATE TABLE search_table AS SELECT "Antonín Leopold Dvořák (8 September 1841 – 1 May 1904)" column1;

Invoke the function:

SELECT LOCATE_LIKE('dvorak', column1) pos FROM search_table WHERE column1 like '%dvorak%';

Results: 16

It works both ways:

CREATE TABLE search_table AS SELECT "Antonín Leopold Dvorak (8 September 1841 – 1 May 1904)" column1;

Invoke the function:

SELECT LOCATE_LIKE('Dvořák', column1) pos FROM search_table WHERE column1 like '%Dvořák%';

Results: 16

To highlight the results using javascript (where the search results are in an input having the id="results", the position is passed from PHP to javascript in a variable named "pos", and the length of the search term is passed as a variable named "searchLength"):

document.getElementById('results').setSelectionRange(pos, pos + searchLength);