sys.dm_fts_parser sql full text

4.5k views Asked by At

We having a really hard time to figure out two similar strings given to sys.dm_fts_parser gives different results

select * from sys.dm_fts_parser('"0 CAD"', 0, null, 0) 

seems to think that "0 CAD" is one token (returns 2 token)

select * from sys.dm_fts_parser('"0 cad"', 0, null, 0) 

returns 3 tokens - correctly

more importantly and even more confusing is why

select * from Table where contains(*,"point 5 CAD") works and select * from Table where contains(*,"point 5 cad") fails

where the column searched contains "point 5 CAD" -

Shouldn't the full text index builder either ignore noise words (e.g. "5") based upon the index setting or include it.
We have tried both and cant explain why "nnnn CAD" is something special

note that full text is suppose to be case-insensitive according to http://msdn.microsoft.com/en-us/library/ms142583.aspx

What am I missing?

Edit: Using SQL 2012 11.0.2218

4

There are 4 answers

0
Haroon On BEST ANSWER

When using SQL 2008

select * from sys.dm_fts_parser('"0 CAD"', 0, null, 0) - gives 2 tokens   
select * from sys.dm_fts_parser('"0 CAD"', 1033, null, 0) - gives 3 tokens   

On SQL 2012 (11.0.3218):

select * from sys.dm_fts_parser('"0 CAD"', 1033, null, 0) - gives 2 tokens

In SQL 2012 Microsoft introduced a new word breaker (version 14.0.4763.1000) http://msdn.microsoft.com/en-us/library/gg509108.aspx

It seems that the work-breaker now recognizes 3 character ISO 4217 Currency Codes, and if there is a number prior to the 3 char code it is not broken up.

0
Jackson On

I used something like this ..

Select * from FROM sys.dm_fts_parser(N'"'+@P_SEARCHSTRING +'" ', 1033, 0, 0) where display_term NOT LIKE 'nn%'

So it avoids all tokens starting with 'nn'

0
Joseph Sturtevant On

My team also ran into this strange tokenization behavior. Our fix was to apply LOWER to the query and search text.

As Haroon previously identified, the issue appears to arise from the tokenizer identifying certain uppercase currency codes in proximity to numbers and treating them differently. This is illustrated by the following query:

SELECT * FROM sys.dm_fts_parser ('"syp 123"', 1033, 0, 0) -- Works fine
SELECT * FROM sys.dm_fts_parser ('"SYP 123"', 1033, 0, 0) -- Doesn't work

We wrote a script that identifies all the 1-4 letter character combinations that exhibit this behavior:

DECLARE @CurrencyCodes TABLE (CurrencyCode varchar(4), TokenCount int)
DECLARE @Start int = 65 -- A
DECLARE @End int = 90 -- Z

DECLARE @A int, @B int, @C int, @D int

SET @A = @Start
WHILE NOT (@A > @End) BEGIN
    INSERT INTO @CurrencyCodes VALUES (CHAR(@A), NULL)
    SET @B = @Start
    WHILE NOT (@B > @End) BEGIN
        INSERT INTO @CurrencyCodes VALUES (CHAR(@A) + CHAR(@B), NULL)
        SET @C = @Start
        WHILE NOT (@C > @End) BEGIN
            INSERT INTO @CurrencyCodes VALUES (CHAR(@A) + CHAR(@B) + CHAR(@C), NULL)
                SET @D = @Start
                WHILE NOT (@D > @End) BEGIN
                    INSERT INTO @CurrencyCodes VALUES (CHAR(@A) + CHAR(@B) + CHAR(@C) + CHAR(@D), NULL)
                    SET @D = @D + 1
                END
            SET @C = @C + 1
        END
        SET @B = @B + 1
    END
    SET @A = @A + 1
END

UPDATE @CurrencyCodes SET TokenCount = (SELECT COUNT(1) FROM sys.dm_fts_parser ('"' + CurrencyCode + '123,456"', 1033, 0, 0))

SELECT CurrencyCode FROM @CurrencyCodes WHERE TokenCount = 2

From that query we found that we will have problems with any of the following 273 codes:

ADF ADP AED AFA AFN ALK ALL AMD ANG AOA AON AOR ARA ARL ARM ARP ARS ATS AUD AWG AZM AZN BAM BBD BDT BEC BEF BEL BGJ BGK BGL BGN BHD BIF BMD BND BOB BOP BOV BRB BRC BRE BRL BRN BRR BRZ BSD BTN BWP BYR BZD CAD CDF CFP CHE CHF CHW CLF CLP CNX CNY COP COU CRC CSD CSJ CSK CUP CVE CYP CZK DDM DEM DJF DKK DM DOP DZD ECS ECV EEK EGP EQE ERN ESA ESB ESP ETB EUR EURO FF FIM FJD FKP FRF GBP GEL GHC GHS GIP GMD GNE GNF GRD GTQ GWP GYD HKD HNL HRK HTG HUF IDR IEP ILP ILR ILS INR IQD IRR ISJ ISK ITL JMD JOD JPY KES KGS KHR KMF KPW KRW KWD KYD KZT LAJ LAK LBP LKR LRD LSL LTL LUF LVL LYD MAD MAF MCF MDL MGA MGF MKD MKN MMK MNT MOP MRO MTL MUR MVQ MVR MWK MXN MXP MXV MYR MZM MZN NAD NGN NIO NLG NOK NPR NZD OMR PAB PEH PEI PEN PGK PHP PKR PLN PLZ PTE PYG QAR ROL RON RSD RUB RUR RWF SAR SBD SCR SDD SDG SEK SGD SHP SIT SKK SLL SML SOS SRD SRG STD SUR SVC SYP SZL THB TJR TJS TMM TND TOP TPE TRL TRY TTD TWD TZS UAH UAK UGS UGX USD USN USS UYI UYN UYU UZS VAL VEB VEF VNC VND VUV WST XAF XAG XAU XBA XBB XBC XBD XCD XDR XEC XEU XFO XFU XOF XPD XPF XPT XTS YDD YER YUD YUM ZAL ZAR ZMK ZRN ZRZ ZWC ZWD
1
Oleksandr Fedorenko On

You have to use the correct Locale identifier(lcid) of the word breaker. Thus, you need to replace the second argument to 1033

SELECT * 
FROM sys.dm_fts_parser('"0 CAD"', 1033, null, 0)

returns 3 tokens

SELECT * 
FROM sys.dm_fts_parser('"0 cad"', 1033, null, 0)

returns 3 tokens

Also need to select English as the word-breaker language for each columns.

USE [YourDB]
GO
ALTER FULLTEXT INDEX ON [dbo].[YourTable] DROP ([YourColumn]) 
GO
USE [YourDB]
GO
ALTER FULLTEXT INDEX ON [dbo].[YourTable] ADD ([YourColumn] LANGUAGE [English])
GO