SolrJ deleteById does not deletes data in Solr

42 views Asked by At

I've a Solr collection having 6 shards based on years - 2019 to 2024. I use this method to delete some documents in this collection :

invoke(() -> solrClient().deleteById(collectionName, ids ));

but this does not actually deletes the documents for corresponding Ids even after waiting for a day. However this below method works and deletes documents instantly.

invoke(() -> solrClient().deleteById(collectionName, ids, 1000 ));
         try {
             solrClient().commit(collectionName);
         } catch (SolrServerException e) {
             throw new RuntimeException(e);
         } catch (IOException e) {
             throw new RuntimeException(e);
         }

can someone please explain me what's going on here and what's the significance of commitWithinMs value that I'm using here as 1000. I'm not sure if should keep this value as 1000ms or increase it.

I'm using Solr version 8.9

I tried passing commitWithinMs parameter value as 1000 in deleteById method and did the commit at the same time and it worked but I thought Solr does autocommit and I can see Autocommit time passed in SolrConfig.xml

   <autoCommit>
            <!-- in ms, our setting is 10 min -->
            <maxTime>600000</maxTime>
            <maxDocs>100000</maxDocs>
            <openSearcher>false</openSearcher>
        </autoCommit>

Also just passing the commitWithinMs is not sufficient, I've to do the commit explicitly just after I invoke the deleteByID method

2

There are 2 answers

0
MatsLindh On BEST ANSWER

In your first example you're just submitting your query. Changes does not become permanent until a commit happens; they'll just be pending in memory and never be persisted to disk. If the server restarts before you issue a commit, the change is lost. It will not change what is returned from a search until a commit is issued (and a new searcher is opened, which becomes important further down).

In the other two examples you do issue a commit, so your changes becomes visible. You do it in two different ways - one with commitWithin and one with an explicit commit.

commitWithin tells Solr to automagically issue a commit if none has been issued within the time given - this is useful when using multiple clients for indexing content in parallel, so you don't issue commits from every client after every document, but still want updates to be visible within a certain timeframe. i.e. it's especially useful in a busy setting where multiple updates are being made within a short timeframe. If you only do updates with a low frequency, you can just issue a commit for every update, since there isn't going to be any performance penalty if the commit would have happened within a second anyway (and there isn't any other updates in that time frame).

And the last issue when using autoCommit is down to:

You have openSearcher set to false - so your autocommits doesn't cause a new searcher to be opened. This means the searcher (the module responsible for actually looking up your documents in the index based on your query) still is the old one that uses the old index; it never changes over to the changed index after your commit.

From the reference guide:

If this is false, the commit will flush recent index changes to stable storage, but does not cause a new searcher to be opened to make those changes visible.

So the changes will be persisted within ten minutes, but won't be visible until a new searcher is opened. That can happen by an explicit commit somewhere else, by an optimize, or by the Solr server being restarted.

2
git push origin master On

In Apache Solr, the process of adding, updating, or deleting documents involves two main steps: sending the changes to Solr, and then making those changes visible by committing them. The commit operation is what actually persists the changes to the disk and makes them searchable. However, committing is an expensive operation in terms of I/O, and doing it too frequently can negatively impact Solr's performance. This is where the concepts of commitWithinMs and auto-commit come into play, and understanding them can help you manage the balance between data visibility and system performance.

Understanding commitWithinMs

The commitWithinMs parameter specifies that the changes (in your case, deletions) should be committed to the index within the given number of milliseconds. When you call deleteById(collectionName, ids, 1000), you're essentially requesting that Solr commits these deletions within 1000 milliseconds (1 second) of receiving them. This is a way to suggest to Solr that it should try to make the changes visible soon, but without forcing an immediate commit.

However, it's important to note that commitWithinMs is a suggestion to Solr and not a strict guarantee. The actual commit might happen slightly later than the specified time, depending on the server's load and the settings in solrconfig.xml.

Auto-Commit Feature

Solr's auto-commit feature is designed to automatically commit changes after certain conditions are met, such as a specified time interval (maxTime) or a certain number of changes (maxDocs). In your solrconfig.xml, the auto-commit is set to trigger every 10 minutes (600000 milliseconds) or after 100,000 documents have been changed. This feature ensures that changes become visible in a timely manner without requiring manual commits, which can improve performance by batching multiple changes into a single commit operation.

Why Explicit Commits are Still Needed

Even with commitWithinMs and auto-commit configured, there are scenarios where you might want to explicitly commit changes. For instance, if you need certain changes to be immediately searchable, waiting for the next auto-commit cycle might not be acceptable. This is likely why your deletions are only effective when you explicitly call commit after using deleteById with commitWithinMs.

Explicitly committing after deletions ensures that the changes are made visible immediately, but it should be used judiciously to avoid performance issues.

Recommendations

  1. Use commitWithinMs Judiciously: While specifying a commitWithinMs value can help ensure that your deletions are committed in a timely manner, relying solely on this without understanding the implications on performance can be problematic. It's a helpful parameter for operations where you have flexibility on exactly when the changes become visible but want to suggest a timeframe.

  2. Understand Your Application's Requirements: If immediate visibility of changes is crucial for your application, then following up deletions with an explicit commit, as you've found to work, is necessary. However, if your application can tolerate a slight delay, relying on auto-commit could improve overall performance.

  3. Tuning Auto-Commit Settings: Consider your application's specific needs and adjust the auto-commit settings in solrconfig.xml accordingly. If your update volume is high and updates are frequent, you might want to adjust the maxTime and maxDocs settings to ensure a good balance between visibility of changes and performance.

  4. Monitoring Performance: Pay attention to how these settings impact your Solr cluster's performance. Adjusting these settings might require some trial and error to find the optimal balance for your specific use case.

In conclusion, the use of commitWithinMs and explicit commits should be tailored to the needs of your application, keeping in mind the trade-offs between immediate data visibility and the performance impact of frequent commits.