Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pytx] Implement a new cleaner PDQ index solution #1695

Closed
wants to merge 4 commits into from

Conversation

haianhng31
Copy link
Contributor

@haianhng31 haianhng31 commented Nov 11, 2024

Summary

Resolve issue #1613 .
This PR introduces a new SignalTypeIndex2 class for managing and querying a PDQ hash-based index using FAISS. Key changes include:

  • _PDQHashIndex: A wrapper around the FAISS index to handle the serialization/deserialization of PDQ hashes, along with methods for adding hashes and performing searches using the FAISS library.
  • SignalTypeIndex2: A class for managing a PDQ index, which includes:
    • The ability to add PDQ hashes and associate them with entries. (lines)
    • A query method for searching the index for matching entries based on a query hash. (lines)
    • A serialize and deserialize method for persisting and loading the index from binary streams using pickle. (lines)

Test Plan

I have included several test cases for this, currently all in one file:

  • test for correct initialization
  • test serialize deserialize (with & without custom index)
  • search functionality
    • test empty index query -> return empty array
    • test sample with 1 exact match
    • test sample with 1 near exact match
    • test distance threshold behavior
    • test duplicate handling

@Dcallies
Copy link
Contributor

Sorry I didn't get to this today, will check first thing tomorrow!

Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: #1613 is about creating a new PDQ index implementation which follows the SignalType interface, but this PR creates a new SignalTypeIndex interface entirely, which is surprising.

Is this an intentional change, and if so, can you walk me through why we should should go this route than a more tailored approach to provide a second, cleaned up PDQ index implementation of the SignalTypeIndex interface which lives in https://github.com/facebook/ThreatExchange/tree/main/python-threatexchange/threatexchange/signal_type/pdq ?

As part of this, can you include tests that show that the bug in #1318 is solved?

PDQIndexMatch = IndexMatchUntyped[SignalSimilarityInfoWithIntDistance, IndexT]


class SignalTypeIndex2(t.Generic[T]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blockin q: Why are we creating this class? Issue #1613 is about creating a new PDQ index, but this is creating a new interface for the overall index class which assumes faiss compatibility, which may not be true for every signal type.

faiss_index: t.Optional[faiss.Index] = None,
) -> None:
"""
Initialize the PDQ index.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This refers to PDQ, but you've put it at the top level class.

return np.array(hash_arrays, dtype=np.float32)


Self = t.TypeVar("Self", bound="SignalTypeIndex2")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused?

self._entries.append([entry])
else:
# If hash exists, append entry to existing entries
idx = list(self._deduper).index(pdq_hash)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very expensive call! O(n) to convert _deduper into a list, then O(n) to find which index is in the list. I also don't think it will be stable, since the set is not ordered!

@haianhng31
Copy link
Contributor Author

I appreciate your feedback on this PR. To be honest, I was a bit uncareful at the start and didn't fully realize that creating a new SignalTypeIndex was changing the whole interface, rather than a more tailored approach.

But since now that is cleared up, moving forward, I will provide a second, cleaned up PDQ index implementation that adheres to the existing SignalTypeIndex interface. I'll be sure to include needed tests as well!

@haianhng31 haianhng31 closed this Nov 13, 2024
faiss_index = faiss.IndexFlatL2(DIMENSIONALITY)
self.faiss_index = _PDQHashIndex(faiss_index)
self.threshold = threshold
self._deduper: t.Set[str] = set()
Copy link
Contributor

@Dcallies Dcallies Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the meeting:

Should store this as a mapping from hash => faiss_id

add(h, payload1)
add(h, playload2)

deduper: dict[hash, faiss_id] = dict[]
entries: list[int, payload] = list[]

existing_id = deduper.get(new_hash)
if existing_id is not None: 
   # Don't add to faiss!
   # we add to our internal entry mapping
   entries[existing_id].append(payload2)
else:
  # faiss id is 0 -> size
  next_id = len(deduper)
  faiss.add(h)
  entries.append([payload])
  deduper[h] = next_id

///
lookup(h) -> payloads
  faiss_id = faiss.search(h) 
  entries[faiss_id]

query(h) -> [payload1, payload2]

@haianhng31 haianhng31 reopened this Nov 14, 2024
@haianhng31 haianhng31 closed this Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants