I have a use case where I need to dynamically exclude certain vectors based on specific criteria before performing a similarity search using Faiss. I have explored the Faiss GitHub repository and came across an issue that is closely related to my requirement. One of the responses highlighted that directly filtering the vectors might negatively impact performance. Instead, they suggested using an alternative solution involving IDSelector. I would like to try filtering the vectors using IDSelector.
Here is a description of my use case:
Let's consider a table with 1000 records, where each record corresponds to a row in an SQL table. Each record has fields such as messageId, message, and communityId. These fields allow us to group the messages based on the community ID. Now, I have indexed all 1000 records. During the similarity search, when a query arises from a specific community, I want to search only within the relevant community rather than across all communities.
As mentioned, there is a workaround suggested:
"Filtering must be based on the vector IDs." However, I'm unsure about the process of incorporating the community ID into the vector ID and how to effectively filter the records based on the community ID.
What have i done so far
:
os.environ['OPENAI_API_KEY'] = 'key'
messages = ["Hello, world!", "How are you?", "Greetings!","How are you"]
community_ids = [1, 2, 3, 1]
class CommunityIDSelector(faiss.IDSelector):
def __init__(self):
pass
def is_member(self, id):
return community_ids[id] == 1
id_selector = CommunityIDSelector()
embs = []
embeddings = OpenAIEmbeddings()
for message in messages:
embs.append(embeddings.embed_query(text=message))
vectors = np.array(embs)
metadata = np.array(community_ids)
concatenated_vectors = np.concatenate((vectors,metadata[:,np.newaxis]),axis=1)
index = faiss.IndexFlatL2(concatenated_vectors.shape[1])
index.add(concatenated_vectors)
target_community_id = 1
query_vector = np.array([embeddings.embed_query(text='I am good')])
# Prepare query vector with target communityId
query_metadata = np.array([[target_community_id]])
concatenated_query = np.concatenate((query_vector, query_metadata), axis=1)
k = 3
distances, indices = index.search(concatenated_query, k,id_selector)
print("Filtered Messages:")
Questions:
- How can I include the community ID in the vector ID?
- How can I filter the records based on the community ID?