Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Speed-up readxml by caching key lookup instead of using try/except #1691

Merged
merged 3 commits into from
Nov 10, 2021

Conversation

kratsg
Copy link
Contributor

@kratsg kratsg commented Nov 10, 2021

Pull Request Description

This resolves #1690 and was brought up by the discussion in #1687.

Fundamentally, when accessing a path that does not exist in uproot via f.__getitem__, it would then trigger a user-friendly levenshtein distance calculation to recommend similarly-named keys. However, in the case of pyhf, it was assumed that f[name] would be fast to return or at least fast to raise an exception, but this is a bad assumption.

Instead, the set of keys in the file (without cycle numbers) is cached along with the file itself, and we will rely on a key lookup in the set instead.

This only occurs in HistFactory when the XML has <Sample/> with HistoPath attributes and a suboptimal computation path in pyhf is then used, because we need to build fullname in order to figure out the right path for the histogram.

Checklist Before Requesting Reviewer

  • Tests are passing
  • "WIP" removed from the title of the pull request
  • Selected an Assignee for the PR to be responsible for the log summary

Before Merging

For the PR Assignees:

  • Summarize commit messages into a comprehensive review of the PR
* Refactor the readxml logic
* Cache the set of keys in a file and use that to look up if a histogram
exists or not
   - Avoids issue of accessing a path that does not exist in uproot via
     f.__getitem__, triggering a user-friendly levenshtein distance
     calculation to recommend similarly-named keys.
* Add tests and test files for key name lookups

@kratsg kratsg added fix A bug fix optimization perf A code change that improves performance refactor A code change that neither fixes a bug nor adds a feature labels Nov 10, 2021
@kratsg kratsg self-assigned this Nov 10, 2021
@kratsg
Copy link
Contributor Author

kratsg commented Nov 10, 2021

@gollumben - this has the fix.

@codecov
Copy link

codecov bot commented Nov 10, 2021

Codecov Report

Merging #1691 (cde983b) into master (c493c7c) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1691      +/-   ##
==========================================
- Coverage   98.05%   98.05%   -0.01%     
==========================================
  Files          64       64              
  Lines        4213     4212       -1     
  Branches      585      587       +2     
==========================================
- Hits         4131     4130       -1     
  Misses         49       49              
  Partials       33       33              
Flag Coverage Δ
contrib 25.30% <0.00%> (+<0.01%) ⬆️
doctest 61.06% <0.00%> (+0.01%) ⬆️
unittests 96.36% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/pyhf/readxml.py 94.37% <100.00%> (-0.04%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c493c7c...cde983b. Read the comment docs.

Copy link
Member

@matthewfeickert matthewfeickert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. 👍 Thanks @kratsg!

Copy link
Member

@matthewfeickert matthewfeickert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the coverage boost. LGTM now.

@matthewfeickert matthewfeickert merged commit 902052f into master Nov 10, 2021
@matthewfeickert matthewfeickert deleted the fix/readxmlDeserialization branch November 10, 2021 21:46
@matthewfeickert
Copy link
Member

@gollumben If you are willing to install a development version of pyhf from GitHub your Issue described in Discussion #1687 should be fixed now. Please let us know if you encounter any other problems. 👍

@gollumben
Copy link

Hi @kratsg, @matthewfeickert and @lukasheinrich, thank you very much for your help and the very fast turn around time! It takes 3 minutes to convert the workspace now instead of the previous 30 h (~600x speed increase), which is amazing!

@kratsg
Copy link
Contributor Author

kratsg commented Jul 1, 2022

Related to scikit-hep/uproot5#595 as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix A bug fix optimization perf A code change that improves performance refactor A code change that neither fixes a bug nor adds a feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DeserializationError too frequent when HistoPath is specified in HiFa XML
3 participants