Deleting items from Azure Search index during an indexer seems, well, broken

338 views Asked by At

So I have a large cosmodb database of documents, I have an index and indexer that will iterate through an build this index. All works great.

But if a document is removed from the source database the search index still contains it.

I understand that there is a data deletion policy but that seems to indicate that the source database needs a property to indicate a soft delete. But the document has been deleted for real, no soft delete in the database.

So why can I not get the indexer to remove all documents that are no longer in the source data?

2

There are 2 answers

1
NotFound On

Because it doesn't know a document has been deleted. You can think of it this way. It basically keeps tracking a cursor containing the last processed change based on _ts (last modification to a document). After the scheduler triggers a query is done to check the latest changes based on that value. It can detect updates and inserts, but deletions cannot be detected as the query doesn't return any changes for them.

If you want it to work you there's a few things you can do:

  1. Add a soft delete property (e.g. isDeleted). That is updated in CosmosDb and notifies Azure Search that the document should be removed.
  2. Do the above in combination with a ttl and time to live policy on your CosmosDb so the item is also deleted from CosmosDb some time in the future with a timespan large enough that the scheduler is 'garantueed' to remove the item first.
  3. Manually delete the item yourself using one of the SDK's or REST API.
0
Daisy White On

Here is a creative way you might be able to do it.

Add a field to your index called "IndexedDate". When you repopulate the index, set that to the current date, for every document in the index. Next time you run your index population code, it will update the IndexedDate.

Now after your code as merged in new documents(and updated current docs), you can simply query the index for documents with an IndexedDate that is less than your most recent IndexedDate. Those records did not get their IndexedDate updated because they weren't returned in the API you called originally. Which means they were deleted from the databse and should no longer be in the index. Now you just call SearchClient.DeleteDocuments() and VoilĂ , you removed all deleted records from your index.