-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of delegate_hashed_bins
when delegating to a large number of bins
#1012
Improve performance of delegate_hashed_bins
when delegating to a large number of bins
#1012
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes, @joshuagl. The performance gain is uncanny! :)
I have a few minor suggestions nothing blocking though. Let me know if you want to change anything, otherwise we can merge this.
tuf/repository_tool.py
Outdated
expiration = expiration.isoformat() + 'Z' | ||
|
||
if not paths: | ||
paths = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestion: Both function that use this helper are guaranteed to pass a paths object that is a dict
. I guess we could make the paths kwarg an arg and remove the two lines here (we'll have fewer things to revert :P). What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm always happy to remove lines of code, thanks for the suggestion. Squashed in to the first patch in 9be4634.
tuf/repository_tool.py
Outdated
|
||
# A queue of roleinfo's that need to be updated in the roledb | ||
delegated_roleinfos = [] | ||
|
||
for bin_rolename in ordered_roles: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: I guess I approved this just two days ago, but now that I look at it again, I wonder if it would make sense to call it bin_role
instead of bin_rolename
, given that it is a dict that i.a. contains a name
field. bin_rolename['name']
below reads a bit odd. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, it does look a bit odd. I've squashed the suggested change into the second patch in 6703b51
tuf/repository_tool.py
Outdated
list_of_targets=bin_rolename['target_paths'], | ||
path_hash_prefixes=bin_rolename['target_hash_prefixes']) | ||
target = self._create_delegated_target(bin_rolename['name'], keyids, | ||
paths=relative_paths) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I suggest to pass the same hardcoded threshold of 1
as below? This is what delegate()
would have done.
https://github.com/theupdateframework/tuf/blob/32326975e040ad82e22d66ec27823a6122e49136/tuf/repository_tool.py#L2318
https://github.com/theupdateframework/tuf/blob/32326975e040ad82e22d66ec27823a6122e49136/tuf/repository_tool.py#L2336
As a consequence we could remove the default for threshold in _create_delegated_target
. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Squashed in to 6703b51
Due to the performance overhead of deepcopy(), as used extensively in roledb, the delegate function is rather slow. This is especially noticeable when we have a large number_of_bins when calling delegate_hashed_bins. In order to be able to easily reduce the number of deepcopy() operations we will need to to be able to replicate the logic of delegate() in delegate_hashed_bins(), only with batched updates to the roledb. To support that work we refactor delegate() to split out logic that will be required in delegate_hashed_bins() into helper functions. Signed-off-by: Joshua Lock <[email protected]>
Due to the performance overhead of deepcopy(), as used extensively in roledb, the delegate function is rather slow. This is especially noticeable when we have a large number_of_bins when calling delegate_hashed_bins. In order to be able to easily reduce the number of deepcopy() operations we remove direct calls to delegate() and instead use the newly added helper functions to replicate the behaviour, only with a single call update to the roledb. This improves the performance of a 16k bins delegation from a 1hr 24min operation on my laptop to 33s. Ideally once Issue theupdateframework#1005 has been properly fixed this commit can be reverted and we can once again just call delegate() here. Signed-off-by: Joshua Lock <[email protected]>
82a3afd
to
6703b51
Compare
Thanks largely to your profiling 🙂
All great suggestions, I've squashed them into the original patches. Thanks for the thoughtful review @lukpueh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick fixes! Re-approving...
Let's wait for appveyor to return and merge this. :) |
|
||
expiration = tuf.formats.unix_timestamp_to_datetime( | ||
int(time.time() + TARGETS_EXPIRATION)) | ||
expiration = expiration.isoformat() + 'Z' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to these changes, but does anybody know why unix_timestamp_to_datetime
isn't including the Z
suffix by default? Feels like an API error if the times are intended to always be GMT 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a ton!
Fixes issue #: Quick fix for #1005
Description of the changes being introduced by the pull request:
delegate()
logic into various helpers:_keys_to_keydict()
_create_delegated_target()
_update_roledb_delegations()
delegated_hashed_bins()
, rather than calls todelegate()
, most significantly queuing up changes to roledb and performing them in one operation with_update_roledb_delegations()
Based on the profiling done by @lukpueh after @woodruffw's report we gain a significant speed improvement when delegating to a large number of hashed bins (i.e. 16k):
before:
python3 profile_tuf_bins.py 4978.27s user 39.37s system 99% cpu 1:24:00.47 total
after:
python3 profile_tuf_bins.py 21.09s user 10.61s system 96% cpu 32.979 total
Please verify and check that the pull request fulfils the following
requirements: