My Marqo index is 10x larger than the data itself. How can I reduce the size of the index?

116 views Asked by At

I have indexed 10 GB data into a marqo index, and now the index is over 100GB. Can anyone tell me why this might have occurred?

Here is an example of the data

{"title": "some title", "article_text": "a long text field containing text in an article", "publish_date": 43132132132, "popularity": 4.221}

If I put the data into an inverted index store like ES the data is significantly smaller (around 20GB). However, I want to use semantic search on the text so ES won't work for my use case.

I also want to potentially use Marqo's GPT integration which comes out of the box.

2

There are 2 answers

0
Tom On BEST ANSWER

Note that because Marqo uses semantic search the text data is inherently more expensive on storage. However, there are a few options you can explore to reduce the size of your index:

Option 1: Use non_tensor_fields. Tensor based fields are encoded into collections of vectors, which means that they take up significantly more storage than regular inverted indexes. This means that Marqo is able to apply semantic search but means that its more costly on storage. Therefore, any text or attributes that you dont want to use semantic search on can be made non_tensor_fields which means they are stored only for filtering and for lexical search. Assuming you only want to do semantic search on the article_text, you could do:

mq.index("your-index").add_documents([{"title": "some title", "article_text": "a long text field containing text in an article", "publish_date": 43132132132, "popularity": 4.221}], non_tensor_fields=["popularity", "publish_date", "title"])

Option 2: Adjust your chunking strategy - check the settings you used to create the index.

Marqo chunks based on the number of sentences in a text field.

If you use a larger split length, this means that you get fewer vectors in the collection. In the below example we have split length of 2 and split overlap of 0.

index_settings = {
"index_defaults": {
    "text_preprocessing": {
        "split_length": 2,
        "split_overlap": 0,
        "split_method": "sentence"
    }
}
}
mq.create_index("my-first-index", settings_dict=index_settings)

If we changed this to split length of 2 with an overlap of 1 we would effectively double the number of vectors for each article_text field, significantly increasing the storage size for the index.

You should check these settings and see if you can increase split length (for example 6 can work) and reduce split overlap.

Note that this comes with some drawbacks due to the fact that attention tapers off quadratically, so shorter overlapping chunks generally perform better when finding specific information.

Note that by the same vein if you were to use image patching, you would also get multiple vectors per image.

0
electric_hotdog On

I think there are a couple of things to try. It will depend a little on the structure of the data. The best thing you can try is to adjust the splitting that happens for the text. See here https://marqo.pages.dev/0.0.10/Preprocessing/Text/ . For example, the default settings split the text by sentence and use splits of 2. Changing this to 5,10,20 should be fine and will reduce the storage by an approximate amount (2->20 is ~7-10x reduction).

settings = {
    "index_defaults": {
        "text_preprocessing": {
            "split_length": 4,
            "split_overlap": 0,
            "split_method": "sentence"
        },
    },
}

response = mq.create_index("my-multimodal-index", settings_dict=settings)

One thing to note is that the size of the splitting should fit within the context length of the model being used. Default text models are mostly 128 tokens for the context length but many can go to 512 (i.e. BERT based ones). I think splitting at 10 or 20 sentences will be fine for the defaults but the context length can be adjusted by specifying custom parameters in the model selection (see https://marqo.pages.dev/0.0.10/Models-Reference/dense_retrieval/#generic-models). A default model can increase the context length via;

settings = {
    "index_defaults": {
        "text_preprocessing": {
            "split_length": 5,
            "split_overlap": 0,
            "split_method": "sentence"
        },
    "model": 'unique-model-alias',
        "model_properties": {"name": "all_datasets_v4_MiniLM-L6",
                            "dimensions": 384,
                            "tokens": 256,
                            "type": "sbert"},
        "normalize_embeddings": True,
    },
}
response = mq.create_index("my-generic-model-index", settings_dict=settings)

which would double the context to 256 tokens. The exact mapping between tokens and words is not exact but I think it is something like 4 tokens per word (on average). The final option would be to use a model with a lower embedding dimension. However this will be limited with how much space can be reduced and there may not be a lot of options model-wise.