I'm trying to perform a query on my index and get all reviews that do NOT have a reviewer with a gravatar image. To do this I have implemented a PatternAnalyzerDefinition with a host pattern:

"^https?\\:\\/\\/([^\\/?#]+)(?:[\\/?#]|$)"

that should match and extract host of urls like:

https://www.gravatar.com/avatar/blablalbla?s=200&r=pg&d=mm

becomes:

www.gravatar.com

The mapping:

clientProvider.getClient.execute {
          create.index(_index).analysis(
            phraseAnalyzer,
            PatternAnalyzerDefinition("host_pattern", regex = "^https?\\:\\/\\/([^\\/?#]+)(?:[\\/?#]|$)")
          ).mappings(
"reviews" as (
             .... Cool mmappings
              "review" inner (
                "grade" typed LongType,
                "text" typed StringType index "not_analyzed",
                "reviewer" inner (
                  "screenName" typed StringType index "not_analyzed",
                  "profilePicture" typed StringType analyzer "host_pattern",
                  "thumbPicture" typed StringType index "not_analyzed",
                  "points" typed LongType index "not_analyzed"
                ),                    
               .... Other cool mmappings                    
              )
            ) all(false)
} map { response =>
      Logger.info("Create index response: {}", response)
    } recover {
      case t: Throwable => play.Logger.error("Error creating index: ", t)
    }

The query:

val reviewQuery = (search in path)
      .query(
        bool(
          must(
            not(
              termQuery("review.reviewer.profilePicture", "www.gravatar.com")
            )
          )
        )
      )
      .postFilter(
        bool(
          must(
            rangeFilter("review.grade") from 3
          )
        )
      )
      .size(size)
      .sort(by field "review.created" order SortOrder.DESC)

    clientProvider.getClient.execute {      
      reviewQuery
    }.map(_.getHits.jsonToList[ReviewData])

Check the index for the mapping:

reviewer: {
    properties: {
        id: {
            type: "long"
        },
        points: {
            type: "long"
        },
        profilePicture: {
            type: "string",
            analyzer: "host_pattern"
        },
        screenName: {
            type: "string",
            index: "not_analyzed"
        },
        state: {
            type: "string"
        },
        thumbPicture: {
            type: "string",
            index: "not_analyzed"
        }
    }
}

When i perform the query the pattern matching does not seem to work. I still get reviews with a reviewer that has a gravatar image. What am I doing wrong? Maybe I have misunderstood the PatternAnalyzer?

I'm using "com.sksamuel.elastic4s" %% "elastic4s" % "1.5.9",

1

There are 1 answers

0
jakob On BEST ANSWER

I guess once again RTFM is in order here:

The docs states:

IMPORTANT: The regular expression should match the token separators, not the tokens themselves.

meaning that in my case the matched token www.gravatar.com will not be a part of the tokens after analyzing the field.

Instead use the Pattern Capture Token Filter

First declare a new CustomAnalyzerDefinition:

val hostAnalyzer = CustomAnalyzerDefinition(
    "host_analyzer",
    StandardTokenizer,
    PatternCaptureTokenFilter(
      name = "hostFilter",
      patterns = List[String]("^https?\\:\\/\\/([^\\/?#]+)(?:[\\/?#]|$)"),
      preserveOriginal = false
    )
  )

Then add the analyzer to the field:

"review" inner (              
                "reviewer" inner (
                  "screenName" typed StringType index "not_analyzed",
                  "profilePicture" typed StringType analyzer "hostAnalyzer",
                  "thumbPicture" typed StringType index "not_analyzed",
                  "points" typed LongType index "not_analyzed"
                )
)

create.index(_index).analysis(
            someAnalyzer,
            phraseAnalyzer,
            hostAnalyzer
          ).mappings(

And voila. It works. A very nice tool for checking the tokens and the index is calling:

/[index]/[collection]/[id]/_termvector?fields=review.reviewer.profilePicture&pretty=true