-
Notifications
You must be signed in to change notification settings - Fork 537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add data_type to mlbf to control filter/stash generation per data type #22775
base: master
Are you sure you want to change the base?
Conversation
029eea3
to
f4bb5a9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔
@@ -89,7 +92,7 @@ def _upload_mlbf_to_remote_settings(*, force_base=False): | |||
else base_filter | |||
) | |||
|
|||
changes_count = mlbf.blocks_changed_since_previous(previous_filter) | |||
changes_count = mlbf.blocks_changed_since_previous(data_type, previous_filter) | |||
statsd.incr( | |||
'blocklist.cron.upload_mlbf_to_remote_settings.blocked_changed', changes_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need an alternate statsd metric name for soft-block counts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similarly for everywhere else we send statsd pings, if this function/code-path is going to be called twice.
) | ||
return | ||
else: | ||
mlbf.generate_and_write_stash(data_type, previous_filter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so a FileNotFoundError can't happen any longer? (and we don't need to fall back to creating a new base?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for filters generated from DB the cache.json is guaranteed to exist on initialiation so it will raise before this point. for filters loaded from data, if the cache doesn't exist it will raise there and it will be a catastrophic failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could consider keeping the try catch but it would have to be somewhere else.
@@ -101,6 +101,7 @@ def setUp(self): | |||
self.user = user_factory() | |||
self.addon = addon_factory() | |||
self.block = block_factory(guid=self.addon.guid, updated_by=self.user) | |||
self.data_type = MLBFDataType.BLOCKED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we subclass the test class and run the same tests with the soft-block data_type, to confirm all tests pass in both cases? (i.e. define data_type
in the class definition rather than in setUp)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a test for InvalidErrorRateException
happening, to confirm it's handled as expected
'--addon-guids-input', | ||
help='Path to json file with [[guid, version],...] data for all ' | ||
'addons. If not provided will be generated from ' | ||
'Addons&Versions in the database', | ||
default=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This functionality isn't possible any longer? It was pretty useful in the past to be able to generate a bloom filter from known json blobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to ask what was the use case for that. I think we should modify to give a timestamp and load from storage instead of passing arbitrary lists for blocked/unblocked. or just always generate from db.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was mainly used while mlbf were a new and exotic thing, and I needed to manually generate bloom filters to double check everything was being created as expected, without the added potential variance coming from building the values from the database. These days the library and implementation is pretty stable and known so I guess not really needed?
from olympia.blocklist.models import BlockVersion | ||
class MLBFDataType(Enum): | ||
# The names must match the values in BlockVersion.BLOCK_TYPE_CHOICES | ||
BLOCKED = 'blocked' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it... reads to be like you're overloading meaning of BLOCKED
to mean both blocked vs. not blocked in the bloom filter, and (hard-)blocked vs. soft-blocked ... which seems confusing (and undesirable).
# TODO: there is an edge case where you create a new filter from | ||
# a previously used time stamp. THis could lead to invalid files | ||
# a filter using the DB should either clear the storage files | ||
# or raise to not allow reusing the same time stamp. | ||
# it is possibly debatable whether you should be able to | ||
# determine the created_at time as an argument at all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's pretty unlikely to have the exact timestamp, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would lead to really corrupted data but yeah probably unlikely in practice. Just think it's odd to leave such a gaping whole to be covered onnly by the happenstance of time.
|
||
@cached_property | ||
def cache_path(self): | ||
return self._file_path(name=self.CACHE_FILE, extension='json') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.CACHE_FILE
doesn't exist in this class (it's in MLBF, that isn't a super or subclass, as far as I can see?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, is this method even used?
class BaseMLBFLoader: | ||
def __init__(self, storage: SafeStorage, _cache_path: str): | ||
self.storage = storage | ||
self._cache_path = _cache_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an alternate name possible? ._cache_path
and .cache_path
are getting confusing for me.
@@ -72,178 +74,210 @@ def fetch_all_versions_from_db(excluding_version_ids=None): | |||
qs = Version.unfiltered.exclude(id__in=excluding_version_ids or ()).values_list( | |||
'addon__addonguid__guid', 'version' | |||
) | |||
return list(qs) | |||
return set(qs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a nit, I suppose, but I don't really understand why the return from fetch_all_versions_from_db
has changed from a list to a set, but fetch_blocked_from_db
has changed from a dict (where the keys would be a set) to a list. Shouldn't both be a set?
TMP: simplify export_blocklist command (default id and no more arbitrarily loading data) TMP: small bug fix + more tests
e5ca806
to
abd81b7
Compare
Relates to: mozilla/addons#15014
Description
Introduces the concept of a
data_type
to theMLBF
class allowing the executor to decide if we should create a filter/stash for "hard" or "soft" blocked versions.This PR also modifies the internal data structure of
MLBF
since it now needs to store both soft/hard blocks internally regardless of which type of filter you are actually trying to make.Context
Adding "soft" blocking introduces some interesting complexity on the
MLBF
filter class. We now need to know the full state of blocked and soft_blocked versions in order to create either filter type, and we need to be able to create any combination of filters/stashes from a single data set depending on the current state. There are 2 really important edge cases for why this is a requirement:the cascading-bloom-filter we use is a binary data structure. Items can be either
blocked
ornot_blocked
. However in our new block data structure a version can beblocked
,soft_blocked
or neither of those.This means when creating the data for the filter for
soft_blocked
versions, thenot_blocked
set needs to include both not blocked and blocked (since they are not soft blocked)when creating the stash it is possible for versions to move from one type of block to another. That means that a version can be included in 2 different states on the stash for both block types.
For example, if a version moves from soft blocked, to hard blocked, the stash for this state will include the version for both stashed. IN the blocked stash, the version will be in "blocked" and for the soft_blocked stash, the version will be in "unblocked". So when creating these stashes it is vitally important that the data set used to create blocked and soft blocked is the same, otherwise you could end up with versions included in both data sets in invalid states.
So when we create a new "run" of the block list, we must create the MLBF class with a complete data set of blocked, soft_blocked and all versions at that moment. Then we can proceed to create filters and or stashes as required for that snapshot of the data. This PR ensures that by including all valid sets in the single class instance and also ensuring the data is queried and stored during initialization and not when creating the filter/stash.
Testing
The existing automated tests were modified in PR to make this PR much easier to verify. No tests were removed in this PR only some of the rough shape of the APIs were modified to accommodate specifying a specific block type.
Command to verify which versions are blocked for a given type:
BLOCK_TYPE
should be either:0
: hard blocks1
: soft blocksScenario 1: Create a new hard block filter
Now you can test the output of a filter by running the
export_blocklist
management command.Expect a directory to be created in
./storage/mlbf
matching either the parameter you specified forid
or a timestamp.Expect two files
cache.json
containing the blocked/not_blocked versions for that data type (in this case always hard blocked) andfilter-blocked
where the suffix matches the data type (in the future it could be blocked or soft_blocked)Expect the strings in the
blocked
property of cache.json to match the versions in the query above if you search for hard blocks.Scenario 2: create a stash
Now you need to modify one of the hard blocked versions to be soft blocked
Next create another filter using the same command above.
Now, open a django shell and we can create a stash (make sure mlbf is the latest filter and previous is the first filter you created)
Verify you can diff the filters and see what has changed
The value should return 1 if you only changed on block_version or however many you changed between the 2 filters.
You can also pass None for previous and get back the number of blocks there are currently
Now you can generate a stash
This should create a new directory matching the id of the
mlbf
and should contain 2 filescache.json
and nowstash-<block_type>.json
where block type is "blocked" in this case.The cache.json should contain the current set of hard blocked versions and now the block you made "soft" should be in "not_blocked" since it is NOT, hard, blocked. I know it's kind of confusing.
The stash-blocked.json should now contain in the "unblocked" the block you made soft
Once we have the logic for querying soft blocks we will be able to create both stashes which would unblock from the hard blocked list and block in the soft block list.
Other tests
There are other edge cases you can test, that should not really be impacted by this PR, like the cron job or task to upload a filter/stash to remote settings. These should probably be tested on dev. You can test them manually but I'm not sure it is worth the effort.
Checklist
#ISSUENUM
at the top of your PR to an existing open issue in the mozilla/addons repository.