Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data_type to mlbf to control filter/stash generation per data type #22775

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

KevinMind
Copy link
Contributor

@KevinMind KevinMind commented Oct 16, 2024

Relates to: mozilla/addons#15014

Description

Introduces the concept of a data_type to the MLBF class allowing the executor to decide if we should create a filter/stash for "hard" or "soft" blocked versions.

This PR also modifies the internal data structure of MLBF since it now needs to store both soft/hard blocks internally regardless of which type of filter you are actually trying to make.

Context

Adding "soft" blocking introduces some interesting complexity on the MLBF filter class. We now need to know the full state of blocked and soft_blocked versions in order to create either filter type, and we need to be able to create any combination of filters/stashes from a single data set depending on the current state. There are 2 really important edge cases for why this is a requirement:

  1. the cascading-bloom-filter we use is a binary data structure. Items can be either blocked or not_blocked. However in our new block data structure a version can be blocked, soft_blocked or neither of those.

    This means when creating the data for the filter for soft_blocked versions, the not_blocked set needs to include both not blocked and blocked (since they are not soft blocked)

  2. when creating the stash it is possible for versions to move from one type of block to another. That means that a version can be included in 2 different states on the stash for both block types.

    For example, if a version moves from soft blocked, to hard blocked, the stash for this state will include the version for both stashed. IN the blocked stash, the version will be in "blocked" and for the soft_blocked stash, the version will be in "unblocked". So when creating these stashes it is vitally important that the data set used to create blocked and soft blocked is the same, otherwise you could end up with versions included in both data sets in invalid states.

So when we create a new "run" of the block list, we must create the MLBF class with a complete data set of blocked, soft_blocked and all versions at that moment. Then we can proceed to create filters and or stashes as required for that snapshot of the data. This PR ensures that by including all valid sets in the single class instance and also ensuring the data is queried and stored during initialization and not when creating the filter/stash.

Testing

The existing automated tests were modified in PR to make this PR much easier to verify. No tests were removed in this PR only some of the rough shape of the APIs were modified to accommodate specifying a specific block type.

Command to verify which versions are blocked for a given type:

BLOCK_TYPE should be either:

  • 0: hard blocks
  • 1: soft blocks
BlockVersion.objects.filter(soft=<BLOCK_TYPE>).values_list('block__guid', 'version__version', 'soft')

Scenario 1: Create a new hard block filter

Now you can test the output of a filter by running the export_blocklist management command.

./manage.py export_blocklist

This command already accepts a --block-type argument, but the only valid one for now is BLOCKED. Soft block will be added after this PR

Expect a directory to be created in ./storage/mlbf matching either the parameter you specified for id or a timestamp.

Expect two files cache.json containing the blocked/not_blocked versions for that data type (in this case always hard blocked) and filter-blocked where the suffix matches the data type (in the future it could be blocked or soft_blocked)

Expect the strings in the blocked property of cache.json to match the versions in the query above if you search for hard blocks.

Scenario 2: create a stash

Now you need to modify one of the hard blocked versions to be soft blocked

BlockVersion.objects.first().update(soft=True)

Next create another filter using the same command above.

Now, open a django shell and we can create a stash (make sure mlbf is the latest filter and previous is the first filter you created)

mlbf = MLBF.load_from_storage(<second_id>)
previouis = MLBF.load_from_storage(<first_id>)

Verify you can diff the filters and see what has changed

mlbf.blocks_changed_since_previous(data_type=MLBFDataType.BLOCKED, previous_mlbf=previous)

The value should return 1 if you only changed on block_version or however many you changed between the 2 filters.
You can also pass None for previous and get back the number of blocks there are currently

Now you can generate a stash

mlbf.generate_and_write_stash(data_type=MLBFDataType.BLOCKED, previous_mlbf=previous)

This should create a new directory matching the id of the mlbf and should contain 2 files cache.json and now stash-<block_type>.json where block type is "blocked" in this case.

image

The cache.json should contain the current set of hard blocked versions and now the block you made "soft" should be in "not_blocked" since it is NOT, hard, blocked. I know it's kind of confusing.

The stash-blocked.json should now contain in the "unblocked" the block you made soft

{"blocked": [], "unblocked": ["{0f1fbd83-d0fe-42ef-b6ce-365860d6ca10}:23609.517.8654.43211"]}

Once we have the logic for querying soft blocks we will be able to create both stashes which would unblock from the hard blocked list and block in the soft block list.

Other tests

There are other edge cases you can test, that should not really be impacted by this PR, like the cron job or task to upload a filter/stash to remote settings. These should probably be tested on dev. You can test them manually but I'm not sure it is worth the effort.

Checklist

  • Add #ISSUENUM at the top of your PR to an existing open issue in the mozilla/addons repository.
  • Successfully verified the change locally.
  • The change is covered by automated tests, or otherwise indicated why doing so is unnecessary/impossible.
  • Add before and after screenshots (Only for changes that impact the UI).
  • Add or update relevant docs reflecting the changes made.

@KevinMind KevinMind changed the base branch from soft-block-configurable to master October 16, 2024 19:18
@KevinMind KevinMind force-pushed the soft-block-block-type branch 2 times, most recently from 029eea3 to f4bb5a9 Compare October 17, 2024 08:44
@KevinMind KevinMind marked this pull request as ready for review October 17, 2024 10:24
@KevinMind KevinMind requested review from a team and eviljeff and removed request for a team October 17, 2024 10:24
@KevinMind KevinMind changed the title Refactor MLBF in preparation for soft blocking Add data_type to mlbf to control filter/stash generation per data type Oct 17, 2024
Copy link
Member

@eviljeff eviljeff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔

@@ -89,7 +92,7 @@ def _upload_mlbf_to_remote_settings(*, force_base=False):
else base_filter
)

changes_count = mlbf.blocks_changed_since_previous(previous_filter)
changes_count = mlbf.blocks_changed_since_previous(data_type, previous_filter)
statsd.incr(
'blocklist.cron.upload_mlbf_to_remote_settings.blocked_changed', changes_count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need an alternate statsd metric name for soft-block counts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly for everywhere else we send statsd pings, if this function/code-path is going to be called twice.

)
return
else:
mlbf.generate_and_write_stash(data_type, previous_filter)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so a FileNotFoundError can't happen any longer? (and we don't need to fall back to creating a new base?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for filters generated from DB the cache.json is guaranteed to exist on initialiation so it will raise before this point. for filters loaded from data, if the cache doesn't exist it will raise there and it will be a catastrophic failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could consider keeping the try catch but it would have to be somewhere else.

@@ -101,6 +101,7 @@ def setUp(self):
self.user = user_factory()
self.addon = addon_factory()
self.block = block_factory(guid=self.addon.guid, updated_by=self.user)
self.data_type = MLBFDataType.BLOCKED
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we subclass the test class and run the same tests with the soft-block data_type, to confirm all tests pass in both cases? (i.e. define data_type in the class definition rather than in setUp)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test for InvalidErrorRateException happening, to confirm it's handled as expected

Comment on lines -19 to -23
'--addon-guids-input',
help='Path to json file with [[guid, version],...] data for all '
'addons. If not provided will be generated from '
'Addons&Versions in the database',
default=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This functionality isn't possible any longer? It was pretty useful in the past to be able to generate a bloom filter from known json blobs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to ask what was the use case for that. I think we should modify to give a timestamp and load from storage instead of passing arbitrary lists for blocked/unblocked. or just always generate from db.

Copy link
Member

@eviljeff eviljeff Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was mainly used while mlbf were a new and exotic thing, and I needed to manually generate bloom filters to double check everything was being created as expected, without the added potential variance coming from building the values from the database. These days the library and implementation is pretty stable and known so I guess not really needed?

from olympia.blocklist.models import BlockVersion
class MLBFDataType(Enum):
# The names must match the values in BlockVersion.BLOCK_TYPE_CHOICES
BLOCKED = 'blocked'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it... reads to be like you're overloading meaning of BLOCKED to mean both blocked vs. not blocked in the bloom filter, and (hard-)blocked vs. soft-blocked ... which seems confusing (and undesirable).

Comment on lines +128 to +133
# TODO: there is an edge case where you create a new filter from
# a previously used time stamp. THis could lead to invalid files
# a filter using the DB should either clear the storage files
# or raise to not allow reusing the same time stamp.
# it is possibly debatable whether you should be able to
# determine the created_at time as an argument at all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's pretty unlikely to have the exact timestamp, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would lead to really corrupted data but yeah probably unlikely in practice. Just think it's odd to leave such a gaping whole to be covered onnly by the happenstance of time.


@cached_property
def cache_path(self):
return self._file_path(name=self.CACHE_FILE, extension='json')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.CACHE_FILE doesn't exist in this class (it's in MLBF, that isn't a super or subclass, as far as I can see?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, is this method even used?

class BaseMLBFLoader:
def __init__(self, storage: SafeStorage, _cache_path: str):
self.storage = storage
self._cache_path = _cache_path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an alternate name possible? ._cache_path and .cache_path are getting confusing for me.

@@ -72,178 +74,210 @@ def fetch_all_versions_from_db(excluding_version_ids=None):
qs = Version.unfiltered.exclude(id__in=excluding_version_ids or ()).values_list(
'addon__addonguid__guid', 'version'
)
return list(qs)
return set(qs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a nit, I suppose, but I don't really understand why the return from fetch_all_versions_from_db has changed from a list to a set, but fetch_blocked_from_db has changed from a dict (where the keys would be a set) to a list. Shouldn't both be a set?

TMP: simplify export_blocklist command (default id and no more arbitrarily loading data)

TMP: small bug fix + more tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants