I have a 9.1 Solrcloud installation of three servers on three r5.xlarge EC2 with a shared disk drive using EFS and stunnel. The solr data dirs are on the shared system along with zookeeper.
I have documents coming in about 20000 per day and I am trying to perform indexing along with some regular queries and some special queries for some new functionality.
When I just restart Solr, these new queries run very fast but over time become slower and slower.
Below is typical for the numbers. At time1, the request only took 2.162 sec but after waiting over night the response took 18.137 sec. That is just typical.
businessId, all count, reduced count, time1, time2
7016274253,8433,4769,2.162,18.137
This query behaves very differently depending on when it is executed. Overnight the Solr servers slow down and eventually give a response time that is unacceptable. Not sure if the request matters but here it is:
url: "http://xxx.aws01.hibu.int:8983/solr/calls/select",
params: {
q: `business_id:${businessId} AND call_day:[20230101 TO 20240101}`,
fl: "business_id, call_id, call_day, call_date, dialog_merged, call_callerno, call_duration, call_status, caller_name, caller_address, caller_state, caller_city, caller_zip",
rows: limit,
start: 0,
group: true,
"group.main": true,
"group.field": "call_callerno",
sort: "call_day desc"
}
Here is what I have for autoCommit and softCommit. The clients are not using hard commit but soft commit.
<autoCommit>
<maxTime>180000</maxTime>
<maxSize>512m</maxSize>
<openSearcher>false</openSearcher>
</autoCommit>
<autoSoftCommit>
<maxTime>10000</maxTime>
</autoSoftCommit>
I am mainly interested in why it is slowing down over time. Why when I freshly start it does it behave well and then slowly over time it becomes quite slow.
Is it my indexing? Is it my caching?
Note: When I look in /solr/admin/metrics, I see in the content there that this one shard is taking quite a long time.
"QUERY./select.requestTimes":{
"count":4577,
"meanRate":0.09252592498547889,
"1minRate":0.07171534322545538,
"5minRate":0.056511876693544336,
"15minRate":0.05780642380709814,
"min_ms":5.607831,
"max_ms":35447.542165,
"mean_ms":12.160278707076563,
"median_ms":5.988622,
"stddev_ms":14.871542074236968,
"p75_ms":6.307839,
"p95_ms":42.103719,
"p99_ms":42.103719,
"p999_ms":98.124416},
and the other a ridiculous amount of time.
"QUERY./select.requestTimes":{
"count":4486,
"meanRate":0.09405828676729713,
"1minRate":0.09345322035169516,
"5minRate":0.062102810330670666,
"15minRate":0.05520043855292057,
"min_ms":5.666243,
"max_ms":34713.632736,
"mean_ms":272.95919728573585,
"median_ms":6.101441,
"stddev_ms":813.2470530531275,
"p75_ms":7.397941,
"p95_ms":3392.606168,
"p99_ms":3392.606168,
"p999_ms":3392.606168},
This question was asked to the Solr users mailing list a few weeks ago and received several responses. This mailing list is normally the best place to ask Solr questions.
The short advice was to NOT use EFS and stunnel, as it would give terrible performance for an application like Solr. For a search engine you'll want dedicated fast disks on each node. Avoid NFS or other shared network disk systems.