-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: warn about TBranch name, alias name conflict. #776
feat: warn about TBranch name, alias name conflict. #776
Conversation
I need convincing that this is a good thing to do. (I'm still thinking about it.) In particular, let me know, @wiso, if this would have caught your problem. (Note: it's only being implemented for 5.x, since it's a new feature; I'm asking if this is the sort of thing that would have fixed it, not whether you can successfully run with this patch.) |
I have checkout this branch and I confirm that in the example in #775 now I get the warning
while in the other case I get the same error as before, as expected. So, yes, this would have caught my problem. Now I am thinking if this is always the desired behavior. For example you can have an input file where you know that one branch is wrong, but you can recompute it from other branches. In this case you may want to overwrite it with an alias, with the same name of the branch (since maybe you have other codes that assume that particular branch name). By the way this is just a warning, and not an error, so I guess it is ok. As a comparison RDataFrame is raising an error when calling
by the way they have a method |
I made it a warning, rather than an error, to prevent undue restrictions that might not even be under a user's control: a ROOT file could contain a TBranch Adding the ability to choose to the API, as RDataFrame has with Ultimately, I don't think physicists should be doing their analysis in the Uproot expressions API. It's not comparable to RDataFrame, in which you're supposed to put code into Why are they there at all? Here's how it went, historically:
So I'm not wanting to build up the expressions API more than it already is, which would include providing an API to express the define/redefine distinction, and raising an error on redefinition might make TTrees unreadable in ways that have little to do with the redefined expression, which users can't change. I think this is the least-bad option. I'd support anyone who wants to revamp formulate (writing parsers is fun!), though it's too late for a PythonLanguage → TTreeFormulaLanguage switch in defaults before Uproot version 5.0.0 (early December). We could do it later with a deprecation cycle. I guess, then, I'll merge this PR. Footnotes
|
Thank you @jpivarski. By the way, what you are saying about aliases is worrying me. I am moving from uproot3 to 4. Before I used uproot to get a panda dataframe and then I used numexpr to compute new columns. Now, with uproot4, I am using alias to do the same. Should I go back? Probably not since I am limited by I/O and not computation. |
I think the problem with the Since that Python code is primarily NumPy ufuncs, NumExpr will be more efficient because it is a single pass over the arrays, rather than one pass per mathematical expression. (It's more complicated than that now with the fact that NumPy can fuse some operations, but with NumExpr, the entire expression is always fused.) This difference circumvents the potential bottleneck between RAM and CPU, which is roughly 10‒20 Gbps. If you're I/O limited, meaning disk I/O, the fastest SSDs are 7 Gbps (and that's extreme), so it's probably not a limiting factor. To directly answer your question: the NumExpr strings would be faster, but you're limited by I/O anyway so you probably won't see it. But more importantly, prefer doing computations in Pandas and only I/O in Uproot because Pandas is a computation library and Uproot is an I/O library. Pandas, RDataFrame, Awkward Array, NumExpr, Numba, Dask, xarray, ... should all be better at doing computations than Uproot.
I crossed out the above after thinking about it for a few minutes. Rushing an interface breaking change on this motivation sounds like a bad idea. We could take our time and add the NumExpr language (possibly with TTreeFormula → NumExpr translation, which would exclude the slice/loop/reducer parts of the TTreeFormula language) as a non-default, and maybe change the default through a deprecation cycle. There could be a range of version numbers in which non-trivial expressions require you to explicitly set the |
I normally dislike warnings, but if we made this case an exception, then there would be reasonable-to-do things that wouldn't be possible to do. So, okay, a warning.