ShingleFilter is not working for maxShingleSize=3

80 views Asked by At

Enviornment ==> solr - solr-8.9.0, java version "11.0.12" 2021-07-20 LTS

Following .csv file is indexed in solr

books_id,cat,name
0553573403,book,Game Thrones Clash
0553573404,book,GameThrones Clash
0553573405,book,GameThronesClash
0553573406,book,GameThronesClas

Schema defined in managed-schema as follows

<field name="books_id" type="plong" multiValued="false" indexed="false" stored="true"/>
<field name="cat" type="string" multiValued="false" indexed="false" stored="true"/>
<field name="name" type="text_general" multiValued="false" indexed="true" required="true" stored="true"/>

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="false">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="3"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>

I expect that if i query the book 'GameThronesClash', it should give me other three books. so shingleFilterFactory has been configurd with minShingleSize="2" maxShingleSize="3".

I understand that construct shingles from token stream.

In: "Game Thrones Clash"

Tokenizer to Filter: "Game"(1), "Thrones"(2), "Clash"(3)

Out: "Game"(1), "GameThrones"(1), "GameThronesClash"(1), "Thrones"(2), "ThronesClash"(2),"Clash"(3)

But following query

curl -G http://localhost:8983/solr/shingleConcatenationFuzzyCore/select --data-urlencode "q=(name:'GameThronesClash~')"
{
  "responseHeader":{
    "status":0,
    "QTime":15,
    "params":{
      "q":"(name:'GameThronesClash~')"}},
  "response":{"numFound":3,"start":0,"numFoundExact":true,"docs":[
      {
        "books_id":0553573404,
        "cat":"book",
        "name":"GameThrones Clash",
        "id":"22674fc1-9fc7-4e1b-8d09-231acf39bc25",
        "_version_":1743512855396745216},
      {
        "books_id":0553573405,
        "cat":"book",
        "name":"GamethronesClash",
        "id":"e82a0dee-a3fb-483e-806b-e667490536f4",
        "_version_":1743512855375773696},
      {
        "books_id":0553573406,
        "cat":"book",
        "name":"GameThronesclas",
        "id":"bf240788-81cd-4a51-b62d-5aba778e1dee",
        "_version_":1743512855376822272}
  }}

But why is not giving books having Id : "books_id":0553573403,("name":"Game Thrones Clash"). What to change in the query to retrieve book having name as "name":"Game Thrones Clash"

"Analysis" page under Solr's Admin page for specific field 'name' is as mentioned below -

Field value (Index) :==>name:'Game Thrones Clash' WithoutConcatenation(Index)

Field value (Index) :==>name:'GameThronesClash' Concatenated(index)

Field value (Query) :==>name:'Game Thrones Clash' enter image description here

Field value (Query) :==>name:'GameThronesClash' enter image description here

0

There are 0 answers