I'm trying to perform a query on my index and get all reviews that do NOT have a reviewer with a gravatar image. To do this I have implemented a PatternAnalyzerDefinition with a host pattern:
"^https?\\:\\/\\/([^\\/?#]+)(?:[\\/?#]|$)"
that should match and extract host of urls like:
https://www.gravatar.com/avatar/blablalbla?s=200&r=pg&d=mm
becomes:
www.gravatar.com
The mapping:
clientProvider.getClient.execute {
create.index(_index).analysis(
phraseAnalyzer,
PatternAnalyzerDefinition("host_pattern", regex = "^https?\\:\\/\\/([^\\/?#]+)(?:[\\/?#]|$)")
).mappings(
"reviews" as (
.... Cool mmappings
"review" inner (
"grade" typed LongType,
"text" typed StringType index "not_analyzed",
"reviewer" inner (
"screenName" typed StringType index "not_analyzed",
"profilePicture" typed StringType analyzer "host_pattern",
"thumbPicture" typed StringType index "not_analyzed",
"points" typed LongType index "not_analyzed"
),
.... Other cool mmappings
)
) all(false)
} map { response =>
Logger.info("Create index response: {}", response)
} recover {
case t: Throwable => play.Logger.error("Error creating index: ", t)
}
The query:
val reviewQuery = (search in path)
.query(
bool(
must(
not(
termQuery("review.reviewer.profilePicture", "www.gravatar.com")
)
)
)
)
.postFilter(
bool(
must(
rangeFilter("review.grade") from 3
)
)
)
.size(size)
.sort(by field "review.created" order SortOrder.DESC)
clientProvider.getClient.execute {
reviewQuery
}.map(_.getHits.jsonToList[ReviewData])
Check the index for the mapping:
reviewer: {
properties: {
id: {
type: "long"
},
points: {
type: "long"
},
profilePicture: {
type: "string",
analyzer: "host_pattern"
},
screenName: {
type: "string",
index: "not_analyzed"
},
state: {
type: "string"
},
thumbPicture: {
type: "string",
index: "not_analyzed"
}
}
}
When i perform the query the pattern matching does not seem to work. I still get reviews with a reviewer that has a gravatar image. What am I doing wrong? Maybe I have misunderstood the PatternAnalyzer?
I'm using "com.sksamuel.elastic4s" %% "elastic4s" % "1.5.9",
I guess once again RTFM is in order here:
The docs states:
IMPORTANT: The regular expression should match the token separators, not the tokens themselves.
meaning that in my case the matched token www.gravatar.com will not be a part of the tokens after analyzing the field.
Instead use the Pattern Capture Token Filter
First declare a new CustomAnalyzerDefinition:
Then add the analyzer to the field:
And voila. It works. A very nice tool for checking the tokens and the index is calling: