SQL Server searching for text in a column

38.1k views Asked by At

I'm confused over what to use?

Basically I need to have a search string that can search a single column for the occurrences of multiple phrases, each input phrase is separated by a space.

So input from user would be like:

"Phrase1 Phrase2 ... PhraseX"     (number of phrases can 0 to unknown!, but say < 6)

I need to search with the logic:

Where 'Phrase1%' **AND** 'Phrase2%' **AND** ... 'PhraseX%'

.. etc... so all phrases need to be found.

Always logical AND

SO speed, performance taken in to account, Do I use:

Lot's of

Like 'Phrase1%' and like 'Phrase2%' and like ... 'PhraseX%' ?

or use

patindex('Phrase1%' , column) > 0 AND  patindex('Phrase2%' , column) > 0 
AND ...     patindex('PhraseX%' , column) 

or use

add a Full Text search Index,

The use:

Where Contatins(Column, 'Phrase1*') AND Contatins(Column, 'Phrase2*') AND ... Contatins(Column, 'PhraseX*')

Or

????

Almost too many options, which is why I'm asking, what would be the most efficient way of doing this...

Your wisdom is appreciated...

3

There are 3 answers

2
Gordon Linoff On

If you are searching for AND, then the correct wildcard search would be:

Like '%Phrase1%' and like '%Phrase2%' and like ... '%PhraseX%' 

There is no reason to use patindex() here, because like is sufficient and well optimized. Well optimized, but this case cannot be made efficient. It is going to require a full table scan. And, if the text field is really, really big (I mean at least thousands or tens of thousands of characters), performance will not be good.

The solution is full text search. You would phrase this as:

where CONTAINS(column, 'Phrase1 AND phrase2 AND . . . ');

The only issue here would be when the "phrases" (which seem to be words) you are looking for are stop words.

In conclusion, if you have more than a few thousand rows or the text field you are searching has more than a few thousand characters, then use the full text option. This is just for guidance. If you are searching through a reference table with 100 rows and looking in the description field that has up to 100 characters, then the like method should be fine.

1
Devart On

Personally I like this solution -

DECLARE @temp TABLE (title NVARCHAR(50))
INSERT INTO @temp (title)
VALUES ('Phrase1 33'), ('test Phrase2'), ('blank')

SELECT t.*
FROM @temp t
WHERE EXISTS(
    SELECT 1
    FROM (
        VALUES ('Phrase1'), ('Phrase2'), ('PhraseX')
    ) c(t)  
    WHERE title LIKE '%' + t + '%'
)
0
JBelfort On

This should ideally be done with the help of Full text search as mentioned above. BUT, If you don't have full text configured for your DB, here is a performance intensive solution for doing a prioritized string search. Note : this returns rows for a partial/complete combination of input words(rows containing one or more words of the search string in any order):-

-- table to search in
drop table if exists dbo.myTable;
go
CREATE TABLE dbo.myTable
    (
    myTableId int NOT NULL IDENTITY (1, 1),
    code varchar(200) NOT NULL, 
    description varchar(200) NOT NULL -- this column contains the values we are going to search in 
    )  ON [PRIMARY]
GO

-- function to split space separated search string into individual words
drop function if exists [dbo].[fnSplit];
go
CREATE FUNCTION [dbo].[fnSplit] (@StringInput nvarchar(max),
@Delimiter nvarchar(1))
RETURNS @OutputTable TABLE (
  id nvarchar(1000)
)
AS
BEGIN
  DECLARE @String nvarchar(100);

  WHILE LEN(@StringInput) > 0
  BEGIN
    SET @String = LEFT(@StringInput, ISNULL(NULLIF(CHARINDEX(@Delimiter, @StringInput) - 1, -1),
    LEN(@StringInput)));
    SET @StringInput = SUBSTRING(@StringInput, ISNULL(NULLIF(CHARINDEX
    (
    @Delimiter, @StringInput
    ),
    0
    ), LEN
    (
    @StringInput)
    )
    + 1, LEN(@StringInput));

    INSERT INTO @OutputTable (id)
      VALUES (@String);
  END;

  RETURN;
END;
GO

-- this is the search script which can be optionally converted to a stored procedure /function

declare @search varchar(max) = 'infection upper acute genito'; -- enter your search string here
-- the searched string above should give rows containing the following
-- infection in upper side with acute genitointestinal tract
-- acute infection in upper teeth
-- acute genitointestinal pain

if (len(trim(@search)) = 0) -- if search string is empty, just return records ordered alphabetically
begin
 select 1 as Priority ,myTableid, code, Description from myTable order by Description 
 return;
end

declare @splitTable Table(
wordRank int Identity(1,1), -- individual words are assinged priority order (in order of occurence/position)
word varchar(200)
)
declare @nonWordTable Table( -- table to trim out auxiliary verbs, prepositions etc. from the search
id varchar(200)
)

insert into @nonWordTable values
('of'),
('with'),
('at'),
('in'),
('for'),
('on'),
('by'),
('like'),
('up'),
('off'),
('near'),
('is'),
('are'),
(','),
(':'),
(';')

insert into @splitTable
select id from dbo.fnSplit(@search,' '); -- this function gives you a table with rows containing all the space separated words of the search like in this e.g., the output will be -
--  id
-------------
-- infection
-- upper
-- acute
-- genito

delete s from @splitTable s join @nonWordTable n  on s.word = n.id; -- trimming out non-words here
declare @countOfSearchStrings int = (select count(word) from @splitTable);  -- count of space separated words for search
declare @highestPriority int = POWER(@countOfSearchStrings,3);

with plainMatches as
(
select myTableid, @highestPriority as Priority from myTable where Description like @search  -- exact matches have highest priority
union                                      
select myTableid, @highestPriority-1 as Priority from myTable where Description like  @search + '%'  -- then with something at the end
union                                      
select myTableid, @highestPriority-2 as Priority from myTable where Description like '%' + @search -- then with something at the beginning
union                                      
select myTableid, @highestPriority-3 as Priority from myTable where Description like '%' + @search + '%' -- then if the word falls somewhere in between
),
splitWordMatches as( -- give each searched word a rank based on its position in the searched string
                     -- and calculate its char index in the field to search
select myTable.myTableid, (@countOfSearchStrings - s.wordRank) as Priority, s.word,
wordIndex = CHARINDEX(s.word, myTable.Description)  from myTable join @splitTable s on myTable.Description like '%'+ s.word + '%'
-- and not exists(select myTableid from plainMatches p where p.myTableId = myTable.myTableId) -- need not look into rows that have already been found in plainmatches as they are highest ranked
                                                                              -- this one takes a long time though, so commenting it, will have no impact on the result
),
wordIndexRatings as( -- reverse the char indexes retrived above so that words occuring earlier have higher weightage
                     -- and then normalize them to sequential values
select myTableid, Priority, word, ROW_NUMBER() over (partition by myTableid order by wordindex desc) as comparativeWordIndex 
from splitWordMatches 
)
,
wordIndexSequenceRatings as ( -- need to do this to ensure that if the same set of words from search string is found in two rows,
                              -- their sequence in the field value is taken into account for higher priority
    select w.myTableid, w.word, (w.Priority + w.comparativeWordIndex + coalesce(sequncedPriority ,0)) as Priority
    from wordIndexRatings w left join 
    (
     select w1.myTableid, w1.priority, w1.word, w1.comparativeWordIndex, count(w1.myTableid) as sequncedPriority
     from wordIndexRatings w1 join wordIndexRatings w2 on w1.myTableId = w2.myTableId and w1.Priority > w2.Priority and w1.comparativeWordIndex>w2.comparativeWordIndex
     group by w1.myTableid, w1.priority,w1.word, w1.comparativeWordIndex
    ) 
    sequencedPriority on w.myTableId = sequencedPriority.myTableId and w.Priority = sequencedPriority.Priority
),
prioritizedSplitWordMatches as ( -- this calculates the cumulative priority for a field value
select  w1.myTableId, sum(w1.Priority) as OverallPriority from wordIndexSequenceRatings w1 join wordIndexSequenceRatings w2 on w1.myTableId =  w2.myTableId 
where w1.word <> w2.word group by w1.myTableid 
),
completeSet as (
select myTableid, priority from plainMatches -- get plain matches which should be highest ranked
union
select myTableid, OverallPriority as priority from prioritizedSplitWordMatches -- get ranked split word matches (which are ordered based on word rank in search string and sequence)
union
select myTableid, Priority as Priority from splitWordMatches -- get one word matches
),
maximizedCompleteSet as( -- set the priority of a field value = maximum priority for that field value
select myTableid, max(priority) as Priority from completeSet group by myTableId
)
select priority, myTable.myTableid , code, Description from maximizedCompleteSet m join myTable  on m.myTableId = myTable.myTableId 
order by Priority desc, Description -- order by priority desc to get highest rated items on top
--offset 0 rows fetch next 50 rows only -- optional paging