Closest match with Oracle text search, including short strings

1.2k views Asked by At

I want to find the closest matching string in a database column for a given string. After searching I came up with the following table and query

CREATE TABLE docs (id NUMBER PRIMARY KEY, text VARCHAR2(200));
INSERT INTO docs VALUES(1, 'California is a state in the US.');
INSERT INTO docs VALUES(2, 'Paris is a city in France.');
INSERT INTO docs VALUES(3, 'France is in Europe.');
INSERT INTO docs VALUES(4, 'Paris');

CREATE INDEX idx_docs ON docs(text)
     INDEXTYPE IS CTXSYS.CONTEXT PARAMETERS
     ('DATASTORE CTXSYS.DEFAULT_DATASTORE');

SELECT SCORE(1), id, text 
  FROM docs 
 WHERE CONTAINS(text, 'fuzzy(Parsi,1,1)', 1) > 0;

I have set the similarity score to the minimum, i.e. 1. It works for search strings like "Parsi" or "Parse". It gives me the wanted results. But if the search string is too small like "par" or "pa" it does not show me any results.

What should I do to get the closest match, even when using very short strings to search?

1

There are 1 answers

2
Tomasz On

You're basically hitting a limit in the fuzzy operator

Unlike stem expansion, the number of words generated by a fuzzy expansion depends on what is in the index. Results can vary significantly according to the contents of the index.

and oracle doesn't index shorter strings unless you change the default:

begin 
ctx_ddl.create_preference('mywordlist', 'BASIC_WORDLIST'); 
ctx_ddl.set_attribute('mywordlist','PREFIX_INDEX','TRUE');
ctx_ddl.set_attribute('mywordlist','PREFIX_MIN_LENGTH', '3');
ctx_ddl.set_attribute('mywordlist','PREFIX_MAX_LENGTH', '4');
ctx_ddl.set_attribute('mywordlist','SUBSTRING_INDEX', 'YES');
end;

In this case you might actually have to combine fuzzy and wildcard queries using query rewrite/relaxation. In my experience, wildcard expansion tends to significantly slow down everything, although maybe it's just a matter of the right index configuration.