Any possibility to increase performance of querying chromadb persisted locally

543 views Asked by At

I am quite new to vector databases. For a dataset containing abstracts of 570K scientific publications, I created the embeddings using a sentence transformer. Then in chromadb, I created a collection and populated it with the embeddings along with their ids. So when sending the embeddings (part by part i.e., 40K in each bulk as allowed by chromadb) to the collection below, it automatically created the folder and persist in the path mentioned.

client = chromadb.PersistentClient(path="/path/to/folder/chromadb")
collection = client.get_or_create_collection(name="Abstracts")

Now I need to fetch top 10 queries for each embedding in the collection so that I can calculate their cosine similarity. By the way I have also tried to do this directly by:

collection = client.create_collection(
         name="Abstracts",
         metadata={"hnsw:space": "cosine"})

but somehow still a distance is returned instead of cosine similarity.

I wrote the code below for my task:

def getl2dist(input_idx):
    
    df_ress = pd.DataFrame()

    for idx in list(input_idx):

        r = collection.query(
            query_embeddings=collection.get(ids=[str(idx)], include=['embeddings']).get('embeddings')[0],
            n_results=10)

        sim_id_list = r.get('ids')[0]

        for i in sim_id_list:

            simm = cosine_similarity(np.array(collection.get(ids=[str(idx)], include=['embeddings']).get('embeddings')[0]).reshape(1,-1),
                              np.array(collection.get(ids=[str(i)], include=['embeddings']).get('embeddings')[0]).reshape(1,-1))[0][0]

            df_tmp = pd.DataFrame([idx, i, simm]).T
            df_tmp.columns = ['id1','id2','sim']

            df_ress = pd.concat([df_ress, df_tmp], axis=0, ignore_index=True)
            
    return df_ress



pool = Pool(processes=20)
split_list = np.array_split(range(50000, 100000), 50)
prs = pool.map(getl2dist, split_list)
pool.close()

As seen, I use multiprocessing to decrease the processing time. However, it seems multiprocessing has almost no effect. For example, to process 50K ids, it takes 14 hours in my personal computer. Is there something that I am totally doing wrong here? Would there be other ways hopefully way more performant?

0

There are 0 answers