Add data_type to mlbf to control filter/stash generation per data type #22775

KevinMind · 2024-10-16T19:18:23Z

Description

Introduces the concept of a data_type to the MLBF class allowing the executor to decide if we should create a filter/stash for "hard" or "soft" blocked versions.

This PR also modifies the internal data structure of MLBF since it now needs to store both soft/hard blocks internally regardless of which type of filter you are actually trying to make.

Context

Adding "soft" blocking introduces some interesting complexity on the MLBF filter class. We now need to know the full state of blocked and soft_blocked versions in order to create either filter type, and we need to be able to create any combination of filters/stashes from a single data set depending on the current state. There are 2 really important edge cases for why this is a requirement:

the cascading-bloom-filter we use is a binary data structure. Items can be either blocked or not_blocked. However in our new block data structure a version can be blocked, soft_blocked or neither of those.

This means when creating the data for the filter for soft_blocked versions, the not_blocked set needs to include both not blocked and blocked (since they are not soft blocked)
when creating the stash it is possible for versions to move from one type of block to another. That means that a version can be included in 2 different states on the stash for both block types.

For example, if a version moves from soft blocked, to hard blocked, the stash for this state will include the version for both stashed. IN the blocked stash, the version will be in "blocked" and for the soft_blocked stash, the version will be in "unblocked". So when creating these stashes it is vitally important that the data set used to create blocked and soft blocked is the same, otherwise you could end up with versions included in both data sets in invalid states.

So when we create a new "run" of the block list, we must create the MLBF class with a complete data set of blocked, soft_blocked and all versions at that moment. Then we can proceed to create filters and or stashes as required for that snapshot of the data. This PR ensures that by including all valid sets in the single class instance and also ensuring the data is queried and stored during initialization and not when creating the filter/stash.

Testing

The existing automated tests were modified in PR to make this PR much easier to verify. No tests were removed in this PR only some of the rough shape of the APIs were modified to accommodate specifying a specific block type.

Command to verify which versions are blocked for a given type:

BLOCK_TYPE should be either:

0: hard blocks
1: soft blocks

BlockVersion.objects.filter(soft=<BLOCK_TYPE>).values_list('block__guid', 'version__version', 'soft')

Scenario 1: Create a new hard block filter

Now you can test the output of a filter by running the export_blocklist management command.

./manage.py export_blocklist

This command already accepts a --block-type argument, but the only valid one for now is BLOCKED. Soft block will be added after this PR

Expect a directory to be created in ./storage/mlbf matching either the parameter you specified for id or a timestamp.

Expect two files cache.json containing the blocked/not_blocked versions for that data type (in this case always hard blocked) and filter-blocked where the suffix matches the data type (in the future it could be blocked or soft_blocked)

Expect the strings in the blocked property of cache.json to match the versions in the query above if you search for hard blocks.

Scenario 2: create a stash

Now you need to modify one of the hard blocked versions to be soft blocked

BlockVersion.objects.first().update(soft=True)

Next create another filter using the same command above.

Now, open a django shell and we can create a stash (make sure mlbf is the latest filter and previous is the first filter you created)

mlbf = MLBF.load_from_storage(<second_id>)
previouis = MLBF.load_from_storage(<first_id>)

Verify you can diff the filters and see what has changed

mlbf.blocks_changed_since_previous(data_type=MLBFDataType.BLOCKED, previous_mlbf=previous)

The value should return 1 if you only changed on block_version or however many you changed between the 2 filters.
You can also pass None for previous and get back the number of blocks there are currently

Now you can generate a stash

mlbf.generate_and_write_stash(data_type=MLBFDataType.BLOCKED, previous_mlbf=previous)

This should create a new directory matching the id of the mlbf and should contain 2 files cache.json and now stash-<block_type>.json where block type is "blocked" in this case.

The cache.json should contain the current set of hard blocked versions and now the block you made "soft" should be in "not_blocked" since it is NOT, hard, blocked. I know it's kind of confusing.

The stash-blocked.json should now contain in the "unblocked" the block you made soft

{"blocked": [], "unblocked": ["{0f1fbd83-d0fe-42ef-b6ce-365860d6ca10}:23609.517.8654.43211"]}

Once we have the logic for querying soft blocks we will be able to create both stashes which would unblock from the hard blocked list and block in the soft block list.

Other tests

There are other edge cases you can test, that should not really be impacted by this PR, like the cron job or task to upload a filter/stash to remote settings. These should probably be tested on dev. You can test them manually but I'm not sure it is worth the effort.

Checklist

Add #ISSUENUM at the top of your PR to an existing open issue in the mozilla/addons repository.
Successfully verified the change locally.
The change is covered by automated tests, or otherwise indicated why doing so is unnecessary/impossible.
Add before and after screenshots (Only for changes that impact the UI).
Add or update relevant docs reflecting the changes made.

eviljeff

🤔

eviljeff · 2024-10-17T11:36:05Z

src/olympia/blocklist/cron.py

@@ -89,7 +92,7 @@ def _upload_mlbf_to_remote_settings(*, force_base=False):
        else base_filter
    )

-    changes_count = mlbf.blocks_changed_since_previous(previous_filter)
+    changes_count = mlbf.blocks_changed_since_previous(data_type, previous_filter)
    statsd.incr(
        'blocklist.cron.upload_mlbf_to_remote_settings.blocked_changed', changes_count


We'll need an alternate statsd metric name for soft-block counts.

similarly for everywhere else we send statsd pings, if this function/code-path is going to be called twice.

eviljeff · 2024-10-17T11:41:31Z

src/olympia/blocklist/cron.py

+            )
+            return
+    else:
+        mlbf.generate_and_write_stash(data_type, previous_filter)


so a FileNotFoundError can't happen any longer? (and we don't need to fall back to creating a new base?)

for filters generated from DB the cache.json is guaranteed to exist on initialiation so it will raise before this point. for filters loaded from data, if the cache doesn't exist it will raise there and it will be a catastrophic failure.

We could consider keeping the try catch but it would have to be somewhere else.

eviljeff · 2024-10-17T11:49:42Z

src/olympia/blocklist/tests/test_tasks.py

@@ -101,6 +101,7 @@ def setUp(self):
        self.user = user_factory()
        self.addon = addon_factory()
        self.block = block_factory(guid=self.addon.guid, updated_by=self.user)
+        self.data_type = MLBFDataType.BLOCKED


I guess we subclass the test class and run the same tests with the soft-block data_type, to confirm all tests pass in both cases? (i.e. define data_type in the class definition rather than in setUp)

eviljeff · 2024-10-17T11:52:30Z

src/olympia/blocklist/tests/test_cron.py

can you add a test for InvalidErrorRateException happening, to confirm it's handled as expected

eviljeff · 2024-10-17T13:56:08Z

src/olympia/blocklist/management/commands/export_blocklist.py

-            '--addon-guids-input',
-            help='Path to json file with [[guid, version],...] data for all '
-            'addons. If not provided will be generated from '
-            'Addons&Versions in the database',
-            default=None,


This functionality isn't possible any longer? It was pretty useful in the past to be able to generate a bloom filter from known json blobs.

I wanted to ask what was the use case for that. I think we should modify to give a timestamp and load from storage instead of passing arbitrary lists for blocked/unblocked. or just always generate from db.

I think it was mainly used while mlbf were a new and exotic thing, and I needed to manually generate bloom filters to double check everything was being created as expected, without the added potential variance coming from building the values from the database. These days the library and implementation is pretty stable and known so I guess not really needed?

eviljeff · 2024-10-17T14:00:13Z

src/olympia/blocklist/mlbf.py

-    from olympia.blocklist.models import BlockVersion
+class MLBFDataType(Enum):
+    # The names must match the values in BlockVersion.BLOCK_TYPE_CHOICES
+    BLOCKED = 'blocked'


it... reads to be like you're overloading meaning of BLOCKED to mean both blocked vs. not blocked in the bloom filter, and (hard-)blocked vs. soft-blocked ... which seems confusing (and undesirable).

eviljeff · 2024-10-17T15:45:19Z

src/olympia/blocklist/mlbf.py

+        # TODO: there is an edge case where you create a new filter from
+        # a previously used time stamp. THis could lead to invalid files
+        # a filter using the DB should either clear the storage files
+        # or raise to not allow reusing the same time stamp.
+        # it is possibly debatable whether you should be able to
+        # determine the created_at time as an argument at all


it's pretty unlikely to have the exact timestamp, no?

It would lead to really corrupted data but yeah probably unlikely in practice. Just think it's odd to leave such a gaping whole to be covered onnly by the happenstance of time.

eviljeff · 2024-10-17T15:48:04Z

src/olympia/blocklist/mlbf.py

+
+    @cached_property
+    def cache_path(self):
+        return self._file_path(name=self.CACHE_FILE, extension='json')


self.CACHE_FILE doesn't exist in this class (it's in MLBF, that isn't a super or subclass, as far as I can see?)

actually, is this method even used?

eviljeff · 2024-10-17T15:56:28Z

src/olympia/blocklist/mlbf.py

+class BaseMLBFLoader:
+    def __init__(self, storage: SafeStorage, _cache_path: str):
+        self.storage = storage
+        self._cache_path = _cache_path


Is there an alternate name possible? ._cache_path and .cache_path are getting confusing for me.

eviljeff · 2024-10-17T15:58:35Z

src/olympia/blocklist/mlbf.py

@@ -72,178 +74,210 @@ def fetch_all_versions_from_db(excluding_version_ids=None):
    qs = Version.unfiltered.exclude(id__in=excluding_version_ids or ()).values_list(
        'addon__addonguid__guid', 'version'
    )
-    return list(qs)
+    return set(qs)


a nit, I suppose, but I don't really understand why the return from fetch_all_versions_from_db has changed from a list to a set, but fetch_blocked_from_db has changed from a dict (where the keys would be a set) to a list. Shouldn't both be a set?

TMP: simplify export_blocklist command (default id and no more arbitrarily loading data) TMP: small bug fix + more tests

KevinMind changed the base branch from soft-block-configurable to master October 16, 2024 19:18

KevinMind force-pushed the soft-block-block-type branch 2 times, most recently from 029eea3 to f4bb5a9 Compare October 17, 2024 08:44

KevinMind marked this pull request as ready for review October 17, 2024 10:24

KevinMind requested review from a team and eviljeff and removed request for a team October 17, 2024 10:24

KevinMind changed the title ~~Refactor MLBF in preparation for soft blocking~~ Add data_type to mlbf to control filter/stash generation per data type Oct 17, 2024

eviljeff reviewed Oct 17, 2024

View reviewed changes

Add data_type to mlbf to control filter/stash generation per data type

abd81b7

TMP: simplify export_blocklist command (default id and no more arbitrarily loading data) TMP: small bug fix + more tests

KevinMind force-pushed the soft-block-block-type branch from e5ca806 to abd81b7 Compare October 17, 2024 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add data_type to mlbf to control filter/stash generation per data type #22775

Add data_type to mlbf to control filter/stash generation per data type #22775

KevinMind commented Oct 16, 2024 •

edited

Loading

eviljeff left a comment

eviljeff Oct 17, 2024

eviljeff Oct 17, 2024

eviljeff Oct 17, 2024

KevinMind Oct 17, 2024

KevinMind Oct 17, 2024

eviljeff Oct 17, 2024

eviljeff Oct 17, 2024

eviljeff Oct 17, 2024

KevinMind Oct 17, 2024

eviljeff Oct 17, 2024 •

edited

Loading

eviljeff Oct 17, 2024

eviljeff Oct 17, 2024

KevinMind Oct 17, 2024

eviljeff Oct 17, 2024

eviljeff Oct 17, 2024

eviljeff Oct 17, 2024

eviljeff Oct 17, 2024

Add data_type to mlbf to control filter/stash generation per data type #22775

Are you sure you want to change the base?

Add data_type to mlbf to control filter/stash generation per data type #22775

Conversation

KevinMind commented Oct 16, 2024 • edited Loading

Description

Context

Testing

Scenario 1: Create a new hard block filter

Scenario 2: create a stash

Other tests

Checklist

eviljeff left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eviljeff Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KevinMind commented Oct 16, 2024 •

edited

Loading

eviljeff Oct 17, 2024 •

edited

Loading