I have indexed 10 GB data into a marqo index, and now the index is over 100GB. Can anyone tell me why this might have occurred?
Here is an example of the data
{"title": "some title", "article_text": "a long text field containing text in an article", "publish_date": 43132132132, "popularity": 4.221}
If I put the data into an inverted index store like ES the data is significantly smaller (around 20GB). However, I want to use semantic search on the text so ES won't work for my use case.
I also want to potentially use Marqo's GPT integration which comes out of the box.
Note that because Marqo uses semantic search the text data is inherently more expensive on storage. However, there are a few options you can explore to reduce the size of your index:
Option 1: Use non_tensor_fields. Tensor based fields are encoded into collections of vectors, which means that they take up significantly more storage than regular inverted indexes. This means that Marqo is able to apply semantic search but means that its more costly on storage. Therefore, any text or attributes that you dont want to use semantic search on can be made non_tensor_fields which means they are stored only for filtering and for lexical search. Assuming you only want to do semantic search on the article_text, you could do:
Option 2: Adjust your chunking strategy - check the settings you used to create the index.
Marqo chunks based on the number of sentences in a text field.
If you use a larger split length, this means that you get fewer vectors in the collection. In the below example we have split length of 2 and split overlap of 0.
If we changed this to split length of 2 with an overlap of 1 we would effectively double the number of vectors for each article_text field, significantly increasing the storage size for the index.
You should check these settings and see if you can increase split length (for example 6 can work) and reduce split overlap.
Note that this comes with some drawbacks due to the fact that attention tapers off quadratically, so shorter overlapping chunks generally perform better when finding specific information.
Note that by the same vein if you were to use image patching, you would also get multiple vectors per image.