-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancements for hashed bin delegation #1007
Enhancements for hashed bin delegation #1007
Conversation
Add test to ensure delegated bin names are consistent with the hash prefixes that are delegated to the role. This is an implicit assumption of the current implementation, the testing of which will enable us to modify the code with greater confidence. Signed-off-by: Joshua Lock <[email protected]>
As we are adding and removing items from the hashed bins and checking for their presence/absence it's simplest if we being with the hashed bins initially empty. If we pass a list of targets when we call delgate_hashed_bins() the delegated roles have an initial set of targets delegated to them, which complicates testing of adding then removing a target to a delegated bin. Signed-off-by: Joshua Lock <[email protected]>
This helper will generate a hash of the passed target_filepath using the algorithm defined by the repository_tool's HASH_FUNCTION variable. Signed-off-by: Joshua Lock <[email protected]>
# targets directory. | ||
targets_directory_length = len(self._targets_directory) + 1 | ||
roleinfo = tuf.roledb.get_roleinfo(self._rolename, self._repository_name) | ||
relative_path = filepath[targets_directory_length:].replace('\\', '/') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit (can probably be ignored in this PR): It might be good to add (or use) a path canonicalization method instead of calling replace('\\', '/')
throughout the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this would be good cleanup.
@lukpueh any thoughts on whether this belongs in securesystemslib or in tuf? Perhaps as part of a new utils module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely! #1008 (review) has some thoughts about path normalization. The TLDR is, if the caller passes a bad path,
- either fail,
- or warn and normalize
In both cases this should happen once at the boundary where a path is added to the system, and thus remove the need for many replace('\\', '/')
throughout the code.
I've pushed a couple of extra patches to this branch, one to address a review nit from @woodruffw and another to address #1004 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is impeccable work, @joshuagl. I only have one blocking comment, which is amazing for a PR of that size. That comment starts with "Important:". All the other comments are just sugar, so that you see I actually reviewed this. :P
Many many thanks!
tuf/repository_tool.py
Outdated
prefix_len = len("{:x}".format(number_of_bins - 1)) | ||
prefix_count = 16 ** prefix_len | ||
|
||
if prefix_count % number_of_bins != 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: Maybe it's worth a comment that x % y != 0
does not guarantee that y
is not a power of two for any x
and y
, but due to the relationship between number_of_bins
and prefix_count
it is true for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, I've updated this patch to include a comment
tuf/repository_tool.py
Outdated
total_hash_prefixes = 16 ** prefix_length | ||
bin_size = total_hash_prefixes // number_of_bins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: Maybe above three lines together with the "power of two" check below are worth a helper, to be used here and in _find_bin_for_hash()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest of the code uses all three of the variables (though one only in a comment iirc) and I'm not sure about adding a function that returns multiple values, so I've left this suggestion for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should also be a comment that this particular LoC is rescaling the index. I didn't really understand what was going on until I looked above to see what number_of_bins
is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with leaving it as it is, although it is the same code (with different names and order) in both cases.
delegate_hashed_bins
:
https://github.com/theupdateframework/tuf/blob/5ea1f6398a27047a5a299c42b7428c6248d45e0e/tuf/repository_tool.py#L2504-L2513
_find_bin_for_hash
:
https://github.com/theupdateframework/tuf/blob/5ea1f6398a27047a5a299c42b7428c6248d45e0e/tuf/repository_tool.py#L2768-L2777
And I personally see no problems with idioms like
pefix_length, prefix_count, bin_size = _get_bin_numbers(number_of_bins)
Also, and this was part of my motivation, we'd only need one place to properly comment on this math-magic (thanks for pointing that out, @trishankatdatadog).
Btw., @trishankatdatadog, I wouldn't say that this particular LoC rescales the index. Did you maybe mean this one...?
https://github.com/theupdateframework/tuf/blob/5ea1f6398a27047a5a299c42b7428c6248d45e0e/tuf/repository_tool.py#L2547
Yes, the general idea here is that we scale 0..total_hash_prefixes
to 0..number_of_bins
. But I wouldn't say that the particular LoC pointed out by you does this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Convincing points. I will factor out this into a helper now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Helper added, with lots of comments, in ec68416
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lukpueh Yes, I meant that line, the one with the integer division
fileinfo: | ||
An optional fileinfo object, conforming to tuf.formats.FILEINFO_SCHEMA, | ||
providing full information about the file. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we said that the only reason to have custom
and fileinfo
is to support backwards compatibility (as the former can be passed within the latter) and add_target_to_bin
didn't support this before, I think it is safe to not add custom
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that my suggestion comes at the cost of consistency between add_target
and add_target_to_bin
. I am fine either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. As we plan to deprecate the custom
parameter it doesn't make sense to add it to add_target_to_bin
. I've added a follow-on patch in 5ea1f63, though maybe I should have squashed it.
# Note: join() discards 'targets_directory' if 'target' contains a leading | ||
# path separator (i.e., is treated as an absolute path). | ||
target_path = os.path.join(targets_directory, target.lstrip(os.sep)) | ||
filedict[target.replace('\\', '/').lstrip('/')] = fileinfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated: I think this should have happened before (see #1007 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully it's OK to remove this as we consolidate path handling under #1008, or as a follow on to those changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under #1008 or a follow on is okay for me!
|
||
# Ensure all target files listed in 'target_files' exist. If just one of | ||
# these files does not exist, raise an exception. | ||
if not os.path.exists(target_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely unrelated: This is the first of three path.exists
/ path.isfile
checks in the stack before we open that file (get_metadata_fileinfo
has one and securesystemslib.util.get_file_details
has one too). As good Pythonistas, we should ask for forgiveness instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a helper function to determine the name of a bin that a hashed targetfile will be delegated to. Based sketches by Lukas Puehringer in issues theupdateframework#994 & theupdateframework#995 Signed-off-by: Joshua Lock <[email protected]>
74d6973
to
5ea1f63
Compare
Thank you for the thorough review @lukpueh . I've addressed most of your comments, some in squashed commits and some in follow-on patches. I have replied to all of your comments so that you can see where I've made changes. |
@@ -369,6 +369,108 @@ def test_writeall(self): | |||
self.assertRaises(securesystemslib.exceptions.FormatError, repository.writeall, 3) | |||
|
|||
|
|||
def test_writeall_no_files(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to give kudos for adding a test case that closely mirrors the current work on Warehouse ❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates! ❤️ Here are some more comments.
custom_data = None | ||
if len(custom): | ||
custom_data = custom | ||
if fileinfo.get('length', -1) < 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool way of checking this. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels a bit C-like, but I couldn't think of a cleaner way to do it.
tuf/repository_tool.py
Outdated
|
||
return self._locate_and_update_target_in_bin(target_filepath, 'add_target') | ||
self._delegated_roles[bin_name].add_target(target_filepath, fileinfo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._delegated_roles[bin_name].add_target(target_filepath, fileinfo) | |
self._delegated_roles[bin_name].add_target(target_filepath, fileinfo=fileinfo) |
... otherwise fileinfo
is passed as custom
to add_target
.
It would be great to have a test for add_target_to_bin
that adds a fileinfo
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch Lukas. I've squashed this fix and added a test in 62e4364
|
||
# For simplicity, ensure that 'prefix_count' (16 ^ n) can be evenly | ||
# distributed over 'number_of_bins' (must be 2 ^ n). Each bin will contain | ||
# (prefix_count / number_of_bins) hash prefixes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping #1007 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this. My original comment was lost when factoring out into the helper. I've updated that patch to add a comment as suggested.
Vastly simplify the implementation, using the _get_hash() and _find_bin_for_hash() helpers added in earlier commits. Furthermore, enable passing of the custom parameter to add_target_to_bin() to better match add_target() Signed-off-by: Joshua Lock <[email protected]>
Simplify the delegate_hashed_bins function based on the semi-pseudo-code by Lukas Puehringer in theupdateframework#995. Signed-off-by: Joshua Lock <[email protected]>
In delegate_to_hashed_bins() duplicate the code which determines the target's file path for writing to the metadata in repository_lib's generate_targets_metadata to ensure that the target path hashed by delegate_hashed_bins() matches the target path in the metadata file. Signed-off-by: Joshua Lock <[email protected]>
The file isn't strictly needed on-disk at the time add_target() and add_targets() are called and this duplicates the check for the file's presence in write[_all]() By removing this check we allow extra versatility in adding targets. Signed-off-by: Joshua Lock <[email protected]>
Match the pattern in add_target() where if the filepath already exists in roleinfo['paths'] it is updated to replace the existing entry with the new fileinfo. Signed-off-by: Joshua Lock <[email protected]>
Add an additional optional parameter to add_target() and add_target_to_bin() which is a fileinfo object matching tuf.formats.FILEINFO_OBJECT This parameter and the custom parameter are mutually exclusive and thus cannot be passed at the same time. Signed-off-by: Joshua Lock <[email protected]>
Previously, at the time of writing targets metadata, we accessed all targets files and ensured they exist on disk before generating a filinfo dict containing hashes of the target file. This change enables targets metadata to be generated using the fileinfo data that is already stored in the roledb. Signed-off-by: Joshua Lock <[email protected]>
Merge the logger calls reporting information about the hashed bin delegations into a single logger.info() call to ensure the messages will be grouped together even when integrated into a logging system with multiple parallel sources. Signed-off-by: Joshua Lock <[email protected]>
Test the newly added functionality to: * add a target to the repository without access to the target file on disk * write targets metadata without access to target files on disk, by using the existing fileinfo data from the roledb Signed-off-by: Joshua Lock <[email protected]>
When testing delegate_hashed_bins to ensure that hash_path_prefixes map to the generated name of the bin, also check to ensure that at least one of the delegations contains one or more path_hash_prefixes. Signed-off-by: Joshua Lock <[email protected]>
We intend to deprecate the custom parameter of add_target() in favour of using the fileinfo parameter with the custom value populated, therefore it does not make sense to _add_ the custom parameter to add_target_to_bin() Signed-off-by: Joshua Lock <[email protected]>
Factor out code to calculate the prefix length, total number of prefixes and number of prefixes to be allocated to each bin into a helper function _get_bin_numbers() so that we a) do not duplicate logic, b) can document the logic in one place and c) ensure the logic is consistent. Signed-off-by: Joshua Lock <[email protected]>
Add some additional checks to test_add_target_to_bin to ensure the code to add a target passing a fileinfo is tested. Signed-off-by: Joshua Lock <[email protected]>
ec68416
to
62e4364
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
Fixes issue #: #994, #995, #1003, #1004
Description of the changes being introduced by the pull request:
Note: This is the first PR addressing changes required by the PEP 458 implementation.
_get_hash()
to hash a filepath string_create_bin_name()
to create the string name of a delegated hash bin (mostly to consolidate the instances where we do non-trivial string formatting)_find_bin_for_hash()
to calculate the bin a hashed file string would be allocated toadd_target_to_bin()
andremove_target_from_bin()
to remove the dispatcher and simplify the code as well as allowingadd_target_to_bin()
to take a custom argument, the same asadd_target()
delegate_hashed_bins()
to simplify the codeadd_target()
andadd_target_to_bin()
use_existing_fileinfo
argument towrite()
andwriteall()
to allow writing metadata without access to targets filesPlease verify and check that the pull request fulfils the following
requirements:
cc @woodruffw