Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add 'interp_options' mechanism and ak_add_doc. #784

Merged
merged 2 commits into from
Nov 20, 2022

Conversation

jpivarski
Copy link
Member

This is taking over a function from Coffea—adding TBranch.title to the __doc__ parameter of Awkward Arrays—in a centralized way. The only file that is affected to add that feature is library.py (the Awkward singleton). Let me know, @lgray, if it works as needed.

All the rest of the changes are to propagate the option down. We don't want to always do this; Coffea just needs a hook to be able to enable it. The interp_options mechanism enables us to pass more of these options in in the future.

@jpivarski
Copy link
Member Author

If you don't get a chance to look at this, @kkothari2001 (because I know you're busy), that's okay! I just wanted to give you a chance because the interp_options mechanism threads through your uproot.dask work.

@agoose77
Copy link
Collaborator

@jpivarski why don't we always want to do this? My first thought was "let's not add any options, and just make this the behaviour" before reading your remark

@jpivarski
Copy link
Member Author

I had the same thought until I saw it:

import uproot
tree = uproot.open("nano_dy.root:Events")
array = tree.arrays(filter_name="Muon*", ak_add_doc=True)
array.show(type=True)
type: 40 * struct[{
    Muon_dxy: [var * float32, parameters={"__doc__": "dxy (with sign) wrt first PV, in cm"}],
    Muon_dxyErr: [var * float32, parameters={"__doc__": "dxy uncertainty, in cm"}],
    Muon_dz: [var * float32, parameters={"__doc__": "dz (with sign) wrt first PV, in cm"}],
    Muon_dzErr: [var * float32, parameters={"__doc__": "dz uncertainty, in cm"}],
    Muon_eta: [var * float32, parameters={"__doc__": "eta"}],
    Muon_ip3d: [var * float32, parameters={"__doc__": "3D impact parameter wrt first PV, in cm"}],
    Muon_jetPtRelv2: [var * float32, parameters={"__doc__": "Relative momentum of the lepton with respect to the closest jet after subtracting the lepton"}],
    Muon_jetRelIso: [var * float32, parameters={"__doc__": "Relative isolation in matched jet (1/ptRatio-1, pfRelIso04_all if no matched jet)"}],
    Muon_mass: [var * float32, parameters={"__doc__": "mass"}],
    Muon_miniPFRelIso_all: [var * float32, parameters={"__doc__": "mini PF relative isolation, total (with scaled rho*EA PU corrections)"}],
    Muon_miniPFRelIso_chg: [var * float32, parameters={"__doc__": "mini PF relative isolation, charged component"}],
    Muon_pfRelIso03_all: [var * float32, parameters={"__doc__": "PF relative isolation dR=0.3, total (deltaBeta corrections)"}],
    Muon_pfRelIso03_chg: [var * float32, parameters={"__doc__": "PF relative isolation dR=0.3, charged component"}],
    Muon_pfRelIso04_all: [var * float32, parameters={"__doc__": "PF relative isolation dR=0.4, total (deltaBeta corrections)"}],
    Muon_phi: [var * float32, parameters={"__doc__": "phi"}],
    Muon_pt: [var * float32, parameters={"__doc__": "pt"}],
    Muon_ptErr: [var * float32, parameters={"__doc__": "ptError of the muon track"}],
    Muon_segmentComp: [var * float32, parameters={"__doc__": "muon segment compatibility"}],
    Muon_sip3d: [var * float32, parameters={"__doc__": "3D impact parameter significance wrt first PV"}],
    Muon_softMva: [var * float32, parameters={"__doc__": "soft MVA ID score"}],
    Muon_tkRelIso: [var * float32, parameters={"__doc__": "Tracker-based relative isolation dR=0.3 for highPt, trkIso/tunePpt"}],
    Muon_tunepRelPt: [var * float32, parameters={"__doc__": "TuneP relative pt, tunePpt/pt"}],
    Muon_mvaLowPt: [var * float32, parameters={"__doc__": "Low pt muon ID score"}],
    Muon_mvaTTH: [var * float32, parameters={"__doc__": "TTH MVA lepton ID score"}],
    Muon_charge: [var * int32, parameters={"__doc__": "electric charge"}],
    Muon_jetIdx: [var * int32, parameters={"__doc__": "index of the associated jet (-1 if none)"}],
    Muon_nStations: [var * int32, parameters={"__doc__": "number of matched stations with default arbitration (segment & track)"}],
    Muon_nTrackerLayers: [var * int32, parameters={"__doc__": "number of layers in the tracker"}],
    Muon_pdgId: [var * int32, parameters={"__doc__": "PDG code assigned by the event reconstruction (not by MC truth)"}],
    Muon_tightCharge: [var * int32, parameters={"__doc__": "Tight charge criterion using pterr/pt of muonBestTrack (0:fail, 2:pass)"}],
    Muon_fsrPhotonIdx: [var * int32, parameters={"__doc__": "Index of the associated FSR photon"}],
    Muon_highPtId: [var * uint8, parameters={"__doc__": "high-pT cut-based ID (1 = tracker high pT, 2 = global high pT, which includes tracker high pT)"}],
    Muon_inTimeMuon: [var * bool, parameters={"__doc__": "inTimeMuon ID"}],
    Muon_isGlobal: [var * bool, parameters={"__doc__": "muon is global muon"}],
    Muon_isPFcand: [var * bool, parameters={"__doc__": "muon is PF candidate"}],
    Muon_isTracker: [var * bool, parameters={"__doc__": "muon is tracker muon"}],
    Muon_looseId: [var * bool, parameters={"__doc__": "muon is loose muon"}],
    Muon_mediumId: [var * bool, parameters={"__doc__": "cut-based ID, medium WP"}],
    Muon_mediumPromptId: [var * bool, parameters={"__doc__": "cut-based ID, medium prompt WP"}],
    Muon_miniIsoId: [var * uint8, parameters={"__doc__": "MiniIso ID from miniAOD selector (1=MiniIsoLoose, 2=MiniIsoMedium, 3=MiniIsoTight, 4=MiniIsoVeryTight)"}],
    Muon_multiIsoId: [var * uint8, parameters={"__doc__": "MultiIsoId from miniAOD selector (1=MultiIsoLoose, 2=MultiIsoMedium)"}],
    Muon_mvaId: [var * uint8, parameters={"__doc__": "Mva ID from miniAOD selector (1=MvaLoose, 2=MvaMedium, 3=MvaTight)"}],
    Muon_pfIsoId: [var * uint8, parameters={"__doc__": "PFIso ID from miniAOD selector (1=PFIsoVeryLoose, 2=PFIsoLoose, 3=PFIsoMedium, 4=PFIsoTight, 5=PFIsoVeryTight, 6=PFIsoVeryVeryTight)"}],
    Muon_softId: [var * bool, parameters={"__doc__": "soft cut-based ID"}],
    Muon_softMvaId: [var * bool, parameters={"__doc__": "soft MVA ID"}],
    Muon_tightId: [var * bool, parameters={"__doc__": "cut-based ID, tight WP"}],
    Muon_tkIsoId: [var * uint8, parameters={"__doc__": "TkIso ID (1=TkIsoLoose, 2=TkIsoTight)"}],
    Muon_triggerIdLoose: [var * bool, parameters={"__doc__": "TriggerIdLoose ID"}],
    Muon_genPartIdx: [var * int32, parameters={"__doc__": "Index into genParticle list for MC matching to status==1 muons"}],
    Muon_genPartFlav: [var * uint8, parameters={"__doc__": "Flavour of genParticle for MC matching to status==1 muons: 1 = prompt muon (including gamma*->mu mu), 15 = muon from prompt tau, 5 = muon from b, 4 = muon from c, 3 = muon from light or unknown, 0 = unmatched"}],
    Muon_cleanmask: [var * uint8, parameters={"__doc__": "simple cleaning mask with priority to leptons"}]
}, parameters={"__doc__": "Events"}]
[{Muon_dxy: [], Muon_dxyErr: [], Muon_dz: [], Muon_dzErr: [], ...},
 {Muon_dxy: [], Muon_dxyErr: [], Muon_dz: [], Muon_dzErr: [], ...},
 {Muon_dxy: [], Muon_dxyErr: [], Muon_dz: [], Muon_dzErr: [], ...},
 {Muon_dxy: [], Muon_dxyErr: [], Muon_dz: [], Muon_dzErr: [], ...},
 {Muon_dxy: [-0.000319, -0.00682], Muon_dxyErr: [...], Muon_dz: [...], ...},
 {Muon_dxy: [-0.00011], Muon_dxyErr: [0.00162], Muon_dz: [0.0026], ...},
 {Muon_dxy: [0.00324, -0.00244], Muon_dxyErr: [0.00229, ...], ...},
 {Muon_dxy: [], Muon_dxyErr: [], Muon_dz: [], Muon_dzErr: [], ...},
 {Muon_dxy: [], Muon_dxyErr: [], Muon_dz: [], Muon_dzErr: [], ...},
 {Muon_dxy: [], Muon_dxyErr: [], Muon_dz: [], Muon_dzErr: [], ...},
 ...,
 {Muon_dxy: [0.000774], Muon_dxyErr: [0.00229], Muon_dz: [-0.000873], ...},
 {Muon_dxy: [], Muon_dxyErr: [], Muon_dz: [], Muon_dzErr: [], ...},
 {Muon_dxy: [], Muon_dxyErr: [], Muon_dz: [], Muon_dzErr: [], ...},
 {Muon_dxy: [-0.000587], Muon_dxyErr: [0.00162], Muon_dz: [0.000254], ...},
 {Muon_dxy: [], Muon_dxyErr: [], Muon_dz: [], Muon_dzErr: [], ...},
 {Muon_dxy: [], Muon_dxyErr: [], Muon_dz: [], Muon_dzErr: [], ...},
 {Muon_dxy: [], Muon_dxyErr: [], Muon_dz: [], Muon_dzErr: [], ...},
 {Muon_dxy: [], Muon_dxyErr: [], Muon_dz: [], Muon_dzErr: [], ...},
 {Muon_dxy: [], Muon_dxyErr: [], Muon_dz: [], Muon_dzErr: [], ...}]

versus

array = tree.arrays(filter_name="Muon*", ak_add_doc=True)
array.show(type=True)
type: 40 * {
    Muon_dxy: var * float32,
    Muon_dxyErr: var * float32,
    Muon_dz: var * float32,
    Muon_dzErr: var * float32,
    Muon_eta: var * float32,
    Muon_ip3d: var * float32,
    Muon_jetPtRelv2: var * float32,
    Muon_jetRelIso: var * float32,
    Muon_mass: var * float32,
    Muon_miniPFRelIso_all: var * float32,
    Muon_miniPFRelIso_chg: var * float32,
    Muon_pfRelIso03_all: var * float32,
    Muon_pfRelIso03_chg: var * float32,
    Muon_pfRelIso04_all: var * float32,
    Muon_phi: var * float32,
    Muon_pt: var * float32,
    Muon_ptErr: var * float32,
    Muon_segmentComp: var * float32,
    Muon_sip3d: var * float32,
    Muon_softMva: var * float32,
    Muon_tkRelIso: var * float32,
    Muon_tunepRelPt: var * float32,
    Muon_mvaLowPt: var * float32,
    Muon_mvaTTH: var * float32,
    Muon_charge: var * int32,
    Muon_jetIdx: var * int32,
    Muon_nStations: var * int32,
    Muon_nTrackerLayers: var * int32,
    Muon_pdgId: var * int32,
    Muon_tightCharge: var * int32,
    Muon_fsrPhotonIdx: var * int32,
    Muon_highPtId: var * uint8,
    Muon_inTimeMuon: var * bool,
    Muon_isGlobal: var * bool,
    Muon_isPFcand: var * bool,
    Muon_isTracker: var * bool,
    Muon_looseId: var * bool,
    Muon_mediumId: var * bool,
    Muon_mediumPromptId: var * bool,
    Muon_miniIsoId: var * uint8,
    Muon_multiIsoId: var * uint8,
    Muon_mvaId: var * uint8,
    Muon_pfIsoId: var * uint8,
    Muon_softId: var * bool,
    Muon_softMvaId: var * bool,
    Muon_tightId: var * bool,
    Muon_tkIsoId: var * uint8,
    Muon_triggerIdLoose: var * bool,
    Muon_genPartIdx: var * int32,
    Muon_genPartFlav: var * uint8,
    Muon_cleanmask: var * uint8
}
[{Muon_dxy: [], Muon_dxyErr: [], Muon_dz: [], Muon_dzErr: [], ...},
 {Muon_dxy: [], Muon_dxyErr: [], Muon_dz: [], Muon_dzErr: [], ...},
 {Muon_dxy: [], Muon_dxyErr: [], Muon_dz: [], Muon_dzErr: [], ...},
 {Muon_dxy: [], Muon_dxyErr: [], Muon_dz: [], Muon_dzErr: [], ...},
 {Muon_dxy: [-0.000319, -0.00682], Muon_dxyErr: [...], Muon_dz: [...], ...},
 {Muon_dxy: [-0.00011], Muon_dxyErr: [0.00162], Muon_dz: [0.0026], ...},
 {Muon_dxy: [0.00324, -0.00244], Muon_dxyErr: [0.00229, ...], ...},
 {Muon_dxy: [], Muon_dxyErr: [], Muon_dz: [], Muon_dzErr: [], ...},
 {Muon_dxy: [], Muon_dxyErr: [], Muon_dz: [], Muon_dzErr: [], ...},
 {Muon_dxy: [], Muon_dxyErr: [], Muon_dz: [], Muon_dzErr: [], ...},
 ...,
 {Muon_dxy: [0.000774], Muon_dxyErr: [0.00229], Muon_dz: [-0.000873], ...},
 {Muon_dxy: [], Muon_dxyErr: [], Muon_dz: [], Muon_dzErr: [], ...},
 {Muon_dxy: [], Muon_dxyErr: [], Muon_dz: [], Muon_dzErr: [], ...},
 {Muon_dxy: [-0.000587], Muon_dxyErr: [0.00162], Muon_dz: [0.000254], ...},
 {Muon_dxy: [], Muon_dxyErr: [], Muon_dz: [], Muon_dzErr: [], ...},
 {Muon_dxy: [], Muon_dxyErr: [], Muon_dz: [], Muon_dzErr: [], ...},
 {Muon_dxy: [], Muon_dxyErr: [], Muon_dz: [], Muon_dzErr: [], ...},
 {Muon_dxy: [], Muon_dxyErr: [], Muon_dz: [], Muon_dzErr: [], ...},
 {Muon_dxy: [], Muon_dxyErr: [], Muon_dz: [], Muon_dzErr: [], ...}]

@lgray
Copy link
Contributor

lgray commented Nov 18, 2022

I agree with Jim it should be opt-in, we use it for some very particular user-facing features that tie into notebook use.

That gives me the idea it should be sensitive to if you're in ipython/jupyter or not and change defaultness depending on that?
That has some sense to it, since it gives useful features in situations where you can take advantage of them.

@agoose77
Copy link
Collaborator

I'm generally not in favour of environment-specific behaviour; it's hard to know that it's happening without discovering it (usually accidentally), and harder still to google what's happening!

What kind of features do you use __doc__ for @lgray? You've piqued my interest!

@jpivarski that's maybe suggesting to me that we should elide long parameters rather than we should not set them unless opt-in? Could a solution be to make parameters > N characters collapse to ellipsis, and make __doc__ non-optional?

Copy link
Collaborator

@agoose77 agoose77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a solution goes, I refer to my question about changing the repr vs actually not storing the __doc__ parameter in the first place. However, this PR is good to go if you decide against that course of action!

entry_start=start,
entry_stop=stop,
library="np",
ak_add_doc=self.interp_options["ak_add_doc"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ak_add_doc=self.interp_options["ak_add_doc"],
**self.interp_options,

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to be explicit, to control the list of options. Then adding a new one would be a matter of searching for all instances of ak_add_doc and adding the new one next to that.

**self.interp_options passes everything in the interp_options dict through, which might be right or it might silently overshadow arguments of TTree.arrays that aren't interp_options. We shouldn't create different types of arguments with the same names, but it would just be easier to catch such mistakes with an explicit pass-through.

@@ -339,7 +355,9 @@ def __call__(self, file_path_object_path):
self.allow_missing,
self.real_options,
)
return ttree[self.key].array(library="np")
return ttree[self.key].array(
library="np", ak_add_doc=self.interp_options["ak_add_doc"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above

self.branches,
entry_start=start,
entry_stop=stop,
ak_add_doc=self.interp_options["ak_add_doc"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know where this is going ;)

@@ -554,11 +583,17 @@ def __call__(self, file_path_object_path):
self.allow_missing,
self.real_options,
)
return ttree.arrays(self.common_keys)
return ttree.arrays(
self.common_keys, ak_add_doc=self.interp_options["ak_add_doc"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more

@@ -658,6 +666,12 @@ def to_global(self, global_offset):
)


def _ak_add_doc(array, hasbranches, ak_add_doc):
if ak_add_doc and type(array).__module__ == "awkward.highlevel":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use an isinstance here? It slightly reduces the strictness of the coupling if we can promise to provide ak.Array vs ak.highlevel.Array.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use isinstance here because we don't know if awkward is installable.

Regarding looseness/strictness: we'll never be able to move Array out of awkward.highlevel, anyway. That much of the public API is fixed by widespread use.

Also, while we know that array is either an ak.Array or a dict, list, or tuple, it's nice to narrow in on the three classes in the awkward.highlevel submodule, rather than accepting anything that might be defined elsewhere in the Awkward library.

@lgray
Copy link
Contributor

lgray commented Nov 18, 2022

What kind of features do you use __doc__ for @lgray? You've piqued my interest!

Right now it's really this very user facing documentation of what branches in TTrees do (if the designer of the TTree cares to fill it). The point is largely to have the capability there so that it can be exploited and so the data further serves as its own documentation. I could imagine people filling fairly rich descriptions of TTrees or branches or using doc strings to contain example analysis patterns for the data.

Copy link
Collaborator

@kkothari2001 kkothari2001 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at _dask.py, everything looks great to me. All callable classes and code paths have been covered.

@jpivarski
Copy link
Member Author

Thanks, @agoose77 and @kkothari2001!

@jpivarski jpivarski merged commit b36a022 into main Nov 20, 2022
@jpivarski jpivarski deleted the jpivarski/add-interp_options-and-ak_add_doc branch November 20, 2022 03:52
jpivarski added a commit that referenced this pull request Nov 28, 2022
jpivarski added a commit that referenced this pull request Nov 28, 2022
…Pandas Dataframes (#734)

* Token change to get PR number

* Revert "Token change to get PR number"

This reverts commit 5a631b3.

* Complete basic Awkward Pandas port, and start changing tests

* make some of the suggested changes

* Solve some tests

* Finalize tests

* Add awkward-pandas to dev dependencies

* awkward-pandas only supports Python 3.8+.

* Declare awkward-pandas requirement in affected tests.

* Spell it right.

* Get this PR up to date with #784.

Co-authored-by: Jim Pivarski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants