-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pytx] Implement a new cleaner PDQ index solution #1695
Conversation
Sorry I didn't get to this today, will check first thing tomorrow! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking: #1613 is about creating a new PDQ index implementation which follows the SignalType interface, but this PR creates a new SignalTypeIndex interface entirely, which is surprising.
Is this an intentional change, and if so, can you walk me through why we should should go this route than a more tailored approach to provide a second, cleaned up PDQ index implementation of the SignalTypeIndex interface which lives in https://github.com/facebook/ThreatExchange/tree/main/python-threatexchange/threatexchange/signal_type/pdq ?
As part of this, can you include tests that show that the bug in #1318 is solved?
PDQIndexMatch = IndexMatchUntyped[SignalSimilarityInfoWithIntDistance, IndexT] | ||
|
||
|
||
class SignalTypeIndex2(t.Generic[T]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blockin q: Why are we creating this class? Issue #1613 is about creating a new PDQ index, but this is creating a new interface for the overall index class which assumes faiss compatibility, which may not be true for every signal type.
faiss_index: t.Optional[faiss.Index] = None, | ||
) -> None: | ||
""" | ||
Initialize the PDQ index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This refers to PDQ, but you've put it at the top level class.
return np.array(hash_arrays, dtype=np.float32) | ||
|
||
|
||
Self = t.TypeVar("Self", bound="SignalTypeIndex2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused?
self._entries.append([entry]) | ||
else: | ||
# If hash exists, append entry to existing entries | ||
idx = list(self._deduper).index(pdq_hash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very expensive call! O(n) to convert _deduper into a list, then O(n) to find which index is in the list. I also don't think it will be stable, since the set is not ordered!
I appreciate your feedback on this PR. To be honest, I was a bit uncareful at the start and didn't fully realize that creating a new SignalTypeIndex was changing the whole interface, rather than a more tailored approach. But since now that is cleared up, moving forward, I will provide a second, cleaned up PDQ index implementation that adheres to the existing SignalTypeIndex interface. I'll be sure to include needed tests as well! |
faiss_index = faiss.IndexFlatL2(DIMENSIONALITY) | ||
self.faiss_index = _PDQHashIndex(faiss_index) | ||
self.threshold = threshold | ||
self._deduper: t.Set[str] = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the meeting:
Should store this as a mapping from hash => faiss_id
add(h, payload1)
add(h, playload2)
deduper: dict[hash, faiss_id] = dict[]
entries: list[int, payload] = list[]
existing_id = deduper.get(new_hash)
if existing_id is not None:
# Don't add to faiss!
# we add to our internal entry mapping
entries[existing_id].append(payload2)
else:
# faiss id is 0 -> size
next_id = len(deduper)
faiss.add(h)
entries.append([payload])
deduper[h] = next_id
///
lookup(h) -> payloads
faiss_id = faiss.search(h)
entries[faiss_id]
query(h) -> [payload1, payload2]
Summary
Resolve issue #1613 .
This PR introduces a new SignalTypeIndex2 class for managing and querying a PDQ hash-based index using FAISS. Key changes include:
Test Plan
I have included several test cases for this, currently all in one file: