Any possibility to increase performance of querying chromadb persisted locally

584 views Asked by mlee_jordan At 14 October 2023 at 21:28

I am quite new to vector databases. For a dataset containing abstracts of 570K scientific publications, I created the embeddings using a sentence transformer. Then in chromadb, I created a collection and populated it with the embeddings along with their ids. So when sending the embeddings (part by part i.e., 40K in each bulk as allowed by chromadb) to the collection below, it automatically created the folder and persist in the path mentioned.

client = chromadb.PersistentClient(path="/path/to/folder/chromadb")
collection = client.get_or_create_collection(name="Abstracts")

Now I need to fetch top 10 queries for each embedding in the collection so that I can calculate their cosine similarity. By the way I have also tried to do this directly by:

collection = client.create_collection(
         name="Abstracts",
         metadata={"hnsw:space": "cosine"})

but somehow still a distance is returned instead of cosine similarity.

I wrote the code below for my task:

def getl2dist(input_idx):
    
    df_ress = pd.DataFrame()

    for idx in list(input_idx):

        r = collection.query(
            query_embeddings=collection.get(ids=[str(idx)], include=['embeddings']).get('embeddings')[0],
            n_results=10)

        sim_id_list = r.get('ids')[0]

        for i in sim_id_list:

            simm = cosine_similarity(np.array(collection.get(ids=[str(idx)], include=['embeddings']).get('embeddings')[0]).reshape(1,-1),
                              np.array(collection.get(ids=[str(i)], include=['embeddings']).get('embeddings')[0]).reshape(1,-1))[0][0]

            df_tmp = pd.DataFrame([idx, i, simm]).T
            df_tmp.columns = ['id1','id2','sim']

            df_ress = pd.concat([df_ress, df_tmp], axis=0, ignore_index=True)
            
    return df_ress



pool = Pool(processes=20)
split_list = np.array_split(range(50000, 100000), 50)
prs = pool.map(getl2dist, split_list)
pool.close()

As seen, I use multiprocessing to decrease the processing time. However, it seems multiprocessing has almost no effect. For example, to process 50K ids, it takes 14 hours in my personal computer. Is there something that I am totally doing wrong here? Would there be other ways hopefully way more performant?

Original Q&A

TechQA.

Any possibility to increase performance of querying chromadb persisted locally

There are 0 answers

Related Questions in PYTHON-3.X

Related Questions in EMBEDDING

Related Questions in LARGE-LANGUAGE-MODEL

Related Questions in CHROMADB

Related Questions in VECTOR-DATABASE

Popular Questions

Popular Tags

Trending Questions