What datastructure is the fastest to find the best matching prefix?

879 views Asked by At

Context: I'm working on an analyzer for useragent strings (Yauaa) and as part of this analysis I want to make an educated guess what brand of the device should be reported. I have an implementation that I need to rewrite to be a lot more efficient.

Because I do not want to have a complete list of all devices I want to do the detection based on the prefix of the model.

So I have a dataset with prefixes and the brand that is associated:

  • "GT-" --> "Samsung"
  • "LLD-" --> "Huawei"

And then I want to do a .get("GT-1234124") which should result in "Samsung" because that is the "longest matching prefix".

I had a look at the Trie structure but that seems to be for the opposite situation. What I understand is that you start with a set of values and you can efficiently get all the values that starts with the provided prefix.

If I were to implement this from scratch I would use a tree similar to the Trie but walk around it differently. What I'm looking for is a datastructure that does what I need as fast as possible.

What datastructure do you recommend for this usecase?

Is there an existing (proven) implementation I can use?

1

There are 1 answers

4
Niels Basjes On BEST ANSWER

I did some digging into datastructures and found that essentially the Trie structure is what I need with a different way of walking around the structure.

Since this structure is really simple I created my own implementation that works very well.

See: https://github.com/nielsbasjes/yauaa/blob/master/analyzer/src/main/java/nl/basjes/parse/useragent/utils/PrefixLookup.java


Updates:

  1. I wrote an article about this https://techlab.bol.com/finding-the-longest-matching-string-prefix-fast/
  2. I put my implementation into a separate library which I opensourced and which is already available via maven central. See https://github.com/nielsbasjes/prefixmap