Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancements for hashed bin delegation #1007

Merged
merged 17 commits into from
Apr 1, 2020

Conversation

joshuagl
Copy link
Member

@joshuagl joshuagl commented Mar 26, 2020

Fixes issue #: #994, #995, #1003, #1004

Description of the changes being introduced by the pull request:

Note: This is the first PR addressing changes required by the PEP 458 implementation.

  • Make tests around hashed bins simpler and more robust
  • Add various helper functions for common functionality around hashed bins:
    • _get_hash() to hash a filepath string
    • _create_bin_name() to create the string name of a delegated hash bin (mostly to consolidate the instances where we do non-trivial string formatting)
    • _find_bin_for_hash() to calculate the bin a hashed file string would be allocated to
  • Refactor add_target_to_bin() and remove_target_from_bin() to remove the dispatcher and simplify the code as well as allowing add_target_to_bin() to take a custom argument, the same as add_target()
  • Refactor delegate_hashed_bins() to simplify the code
  • Ensure file paths are hashed as they appear in the targets metadata
  • Enable passing a fileinfo to add_target() and add_target_to_bin()
  • Add use_existing_fileinfo argument to write() and writeall() to allow writing metadata without access to targets files

Please verify and check that the pull request fulfils the following
requirements
:

  • The code follows the Code Style Guidelines
  • Tests have been added for the bug fix or new feature
  • Docs have been added for the bug fix or new feature

cc @woodruffw

Add test to ensure delegated bin names are consistent with the hash
prefixes that are delegated to the role.

This is an implicit assumption of the current implementation, the
testing of which will enable us to modify the code with greater
confidence.

Signed-off-by: Joshua Lock <[email protected]>
As we are adding and removing items from the hashed bins and checking
for their presence/absence it's simplest if we being with the hashed
bins initially empty.

If we pass a list of targets when we call delgate_hashed_bins() the
delegated roles have an initial set of targets delegated to them,
which complicates testing of adding then removing a target to a
delegated bin.

Signed-off-by: Joshua Lock <[email protected]>
This helper will generate a hash of the passed target_filepath using the
algorithm defined by the repository_tool's HASH_FUNCTION variable.

Signed-off-by: Joshua Lock <[email protected]>
tuf/repository_tool.py Outdated Show resolved Hide resolved
# targets directory.
targets_directory_length = len(self._targets_directory) + 1
roleinfo = tuf.roledb.get_roleinfo(self._rolename, self._repository_name)
relative_path = filepath[targets_directory_length:].replace('\\', '/')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (can probably be ignored in this PR): It might be good to add (or use) a path canonicalization method instead of calling replace('\\', '/') throughout the codebase.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this would be good cleanup.

@lukpueh any thoughts on whether this belongs in securesystemslib or in tuf? Perhaps as part of a new utils module?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely! #1008 (review) has some thoughts about path normalization. The TLDR is, if the caller passes a bad path,

  • either fail,
  • or warn and normalize

In both cases this should happen once at the boundary where a path is added to the system, and thus remove the need for many replace('\\', '/') throughout the code.

@joshuagl
Copy link
Member Author

I've pushed a couple of extra patches to this branch, one to address a review nit from @woodruffw and another to address #1004

Copy link
Member

@lukpueh lukpueh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is impeccable work, @joshuagl. I only have one blocking comment, which is amazing for a PR of that size. That comment starts with "Important:". All the other comments are just sugar, so that you see I actually reviewed this. :P

Many many thanks!

tests/test_repository_tool.py Show resolved Hide resolved
prefix_len = len("{:x}".format(number_of_bins - 1))
prefix_count = 16 ** prefix_len

if prefix_count % number_of_bins != 0:
Copy link
Member

@lukpueh lukpueh Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: Maybe it's worth a comment that x % y != 0 does not guarantee that y is not a power of two for any x and y, but due to the relationship between number_of_bins and prefix_count it is true for them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, I've updated this patch to include a comment

tuf/repository_tool.py Show resolved Hide resolved
total_hash_prefixes = 16 ** prefix_length
bin_size = total_hash_prefixes // number_of_bins
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: Maybe above three lines together with the "power of two" check below are worth a helper, to be used here and in _find_bin_for_hash().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of the code uses all three of the variables (though one only in a comment iirc) and I'm not sure about adding a function that returns multiple values, so I've left this suggestion for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should also be a comment that this particular LoC is rescaling the index. I didn't really understand what was going on until I looked above to see what number_of_bins is.

Copy link
Member

@lukpueh lukpueh Mar 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with leaving it as it is, although it is the same code (with different names and order) in both cases.

delegate_hashed_bins:
https://github.com/theupdateframework/tuf/blob/5ea1f6398a27047a5a299c42b7428c6248d45e0e/tuf/repository_tool.py#L2504-L2513

_find_bin_for_hash:
https://github.com/theupdateframework/tuf/blob/5ea1f6398a27047a5a299c42b7428c6248d45e0e/tuf/repository_tool.py#L2768-L2777

And I personally see no problems with idioms like

pefix_length, prefix_count, bin_size = _get_bin_numbers(number_of_bins)

Also, and this was part of my motivation, we'd only need one place to properly comment on this math-magic (thanks for pointing that out, @trishankatdatadog).

Btw., @trishankatdatadog, I wouldn't say that this particular LoC rescales the index. Did you maybe mean this one...?
https://github.com/theupdateframework/tuf/blob/5ea1f6398a27047a5a299c42b7428c6248d45e0e/tuf/repository_tool.py#L2547

Yes, the general idea here is that we scale 0..total_hash_prefixes to 0..number_of_bins. But I wouldn't say that the particular LoC pointed out by you does this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convincing points. I will factor out this into a helper now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helper added, with lots of comments, in ec68416

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukpueh Yes, I meant that line, the one with the integer division

tuf/repository_tool.py Outdated Show resolved Hide resolved
fileinfo:
An optional fileinfo object, conforming to tuf.formats.FILEINFO_SCHEMA,
providing full information about the file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we said that the only reason to have custom and fileinfo is to support backwards compatibility (as the former can be passed within the latter) and add_target_to_bin didn't support this before, I think it is safe to not add custom here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that my suggestion comes at the cost of consistency between add_target and add_target_to_bin. I am fine either way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. As we plan to deprecate the custom parameter it doesn't make sense to add it to add_target_to_bin. I've added a follow-on patch in 5ea1f63, though maybe I should have squashed it.

# Note: join() discards 'targets_directory' if 'target' contains a leading
# path separator (i.e., is treated as an absolute path).
target_path = os.path.join(targets_directory, target.lstrip(os.sep))
filedict[target.replace('\\', '/').lstrip('/')] = fileinfo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated: I think this should have happened before (see #1007 (comment))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully it's OK to remove this as we consolidate path handling under #1008, or as a follow on to those changes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under #1008 or a follow on is okay for me!

tuf/repository_lib.py Show resolved Hide resolved

# Ensure all target files listed in 'target_files' exist. If just one of
# these files does not exist, raise an exception.
if not os.path.exists(target_path):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely unrelated: This is the first of three path.exists/ path.isfile checks in the stack before we open that file (get_metadata_fileinfo has one and securesystemslib.util.get_file_details has one too). As good Pythonistas, we should ask for forgiveness instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should absolutely move towards the Python idiom of EAFP. I think we can do that in future changes as we work to address #1009

Add a helper function to determine the name of a bin that a hashed
targetfile will be delegated to.

Based sketches by Lukas Puehringer in issues theupdateframework#994 & theupdateframework#995

Signed-off-by: Joshua Lock <[email protected]>
@joshuagl joshuagl force-pushed the joshuagl/hashed-bins branch 2 times, most recently from 74d6973 to 5ea1f63 Compare March 30, 2020 18:05
@joshuagl
Copy link
Member Author

Thank you for the thorough review @lukpueh . I've addressed most of your comments, some in squashed commits and some in follow-on patches.

I have replied to all of your comments so that you can see where I've made changes.

@@ -369,6 +369,108 @@ def test_writeall(self):
self.assertRaises(securesystemslib.exceptions.FormatError, repository.writeall, 3)


def test_writeall_no_files(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to give kudos for adding a test case that closely mirrors the current work on Warehouse ❤️

Copy link
Member

@lukpueh lukpueh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! ❤️ Here are some more comments.

custom_data = None
if len(custom):
custom_data = custom
if fileinfo.get('length', -1) < 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool way of checking this. 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels a bit C-like, but I couldn't think of a cleaner way to do it.


return self._locate_and_update_target_in_bin(target_filepath, 'add_target')
self._delegated_roles[bin_name].add_target(target_filepath, fileinfo)
Copy link
Member

@lukpueh lukpueh Apr 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._delegated_roles[bin_name].add_target(target_filepath, fileinfo)
self._delegated_roles[bin_name].add_target(target_filepath, fileinfo=fileinfo)

... otherwise fileinfo is passed as custom to add_target.

It would be great to have a test for add_target_to_bin that adds a fileinfo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch Lukas. I've squashed this fix and added a test in 62e4364

tuf/repository_tool.py Outdated Show resolved Hide resolved

# For simplicity, ensure that 'prefix_count' (16 ^ n) can be evenly
# distributed over 'number_of_bins' (must be 2 ^ n). Each bin will contain
# (prefix_count / number_of_bins) hash prefixes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. My original comment was lost when factoring out into the helper. I've updated that patch to add a comment as suggested.

joshuagl added 13 commits April 1, 2020 12:06
Vastly simplify the implementation, using the _get_hash() and
_find_bin_for_hash() helpers added in earlier commits.

Furthermore, enable passing of the custom parameter to
add_target_to_bin() to better match add_target()

Signed-off-by: Joshua Lock <[email protected]>
Simplify the delegate_hashed_bins function based on the semi-pseudo-code
by Lukas Puehringer in theupdateframework#995.

Signed-off-by: Joshua Lock <[email protected]>
In delegate_to_hashed_bins() duplicate the code which determines the
target's file path for writing to the metadata in repository_lib's
generate_targets_metadata to ensure that the target path hashed by
delegate_hashed_bins() matches the target path in the metadata file.

Signed-off-by: Joshua Lock <[email protected]>
The file isn't strictly needed on-disk at the time add_target() and
add_targets() are called and this duplicates the check for the file's
presence in write[_all]()

By removing this check we allow extra versatility in adding targets.

Signed-off-by: Joshua Lock <[email protected]>
Match the pattern in add_target() where if the filepath already exists
in roleinfo['paths'] it is updated to replace the existing entry with
the new fileinfo.

Signed-off-by: Joshua Lock <[email protected]>
Add an additional optional parameter to add_target() and
add_target_to_bin() which is a fileinfo object matching
tuf.formats.FILEINFO_OBJECT

This parameter and the custom parameter are mutually exclusive and
thus cannot be passed at the same time.

Signed-off-by: Joshua Lock <[email protected]>
Previously, at the time of writing targets metadata, we accessed all
targets files and ensured they exist on disk before generating a
filinfo dict containing hashes of the target file.

This change enables targets metadata to be generated using the
fileinfo data that is already stored in the roledb.

Signed-off-by: Joshua Lock <[email protected]>
Merge the logger calls reporting information about the hashed bin
delegations into a single logger.info() call to ensure the messages
will be grouped together even when integrated into a logging system
with multiple parallel sources.

Signed-off-by: Joshua Lock <[email protected]>
Test the newly added functionality to:
* add a target to the repository without access to the target file on disk
* write targets metadata without access to target files on disk, by using
  the existing fileinfo data from the roledb

Signed-off-by: Joshua Lock <[email protected]>
When testing delegate_hashed_bins to ensure that hash_path_prefixes
map to the generated name of the bin, also check to ensure that at least
one of the delegations contains one or more path_hash_prefixes.

Signed-off-by: Joshua Lock <[email protected]>
We intend to deprecate the custom parameter of add_target() in favour
of using the fileinfo parameter with the custom value populated,
therefore it does not make sense to _add_ the custom parameter to
add_target_to_bin()

Signed-off-by: Joshua Lock <[email protected]>
Factor out code to calculate the prefix length, total number of prefixes
and number of prefixes to be allocated to each bin into a helper
function _get_bin_numbers() so that we a) do not duplicate logic,
b) can document the logic in one place and c) ensure the logic is
consistent.

Signed-off-by: Joshua Lock <[email protected]>
Add some additional checks to test_add_target_to_bin to ensure the code
to add a target passing a fileinfo is tested.

Signed-off-by: Joshua Lock <[email protected]>
@joshuagl joshuagl force-pushed the joshuagl/hashed-bins branch from ec68416 to 62e4364 Compare April 1, 2020 11:13
Copy link
Member

@lukpueh lukpueh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@lukpueh lukpueh merged commit e095112 into theupdateframework:develop Apr 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants