I'm thinking about implementing a system for filtering data. This relies on fast pattern matching of a large dataset (I'm thinking Aho-Corasick). The patterns will be fixed strings. This will probably be in the region of 1000-100000 records - not large for a relation database, but big for an in-memory graph. Key to the performance (regardless of the underlying algorithm) is to create an appropriate representation of the reference dataset. The dataset will change over time.

My problem is how to best architect this as a solution such that updating the dataset does not interrupt the use of the service. The updates do not immediately have to be reflected to service clients.

Is there a fast string matching algorithm which supports ad-hoc changes to the dataset without a full recompilation?

Failing that, presumably my only option would be to separate the front side service from the back end matching engine / and to spin up a new back end engine/switch over when I want to change the dataset.

Is there a more elegant solution?

0 Answers