I would like to store vector features, like Bag-of-Words or Word-Embedding vectors of a large number of texts, in a dataset, stored in a SQL Database. What're the data structures and the best practices to save and retrieve these features?
How to store Bag of Words or Embeddings in a Database
5.8k views Asked by Gheeroppa At
5
There are 5 answers
0
On
This would depend on a number of factors, such as the precise SQL DB you intend to use and how you store this embedding. For instance, PostgreSQL allows to store query and retrieve JSON variables ( https://www.postgresqltutorial.com/postgresql-json/ ) ; Other options as SQLite would allow to store string representations of JSONs or pickle objects - that would be OK for storing, but would make querying the elements inside the vector impossible.
0
On
There are databases that are specialized for vector data in machine learning. these are the list.
- Milvus https://milvus.io/
- Weavviate https://weaviate.io/
- AquilaDB https://docs.aquila.network
- Pinecone https://www.pinecone.io/
1
On
Milvus is an open-source vector database built to power embedding similarity search and AI applications
https://github.com/milvus-io/milvus
I am doing the test
0
On
Maybe with
- the AI-native open-source embedding database: https://www.trychroma.com
- or the open-source vector similarity search for Postgres: https://github.com/pgvector/pgvector
Word vectors should generally be stored as BLOBs if possible. If not they can be stored as json arrays. Since the only reasonable operation for word vectors is to look them up by the word key the other details don't particularly matter.
For bag of words you would typically need three columns, this is what it would look like in sqlite.
Where your document IDs come from somewhere else. If you need to you can make
(doc_id, word)
the key.However, storing features like this in a SQL DB is generally not helpful. When you access word counts or word vectors you typically don't need a subset of them, you need them all at once, so the relational features of SQL aren't helpful.