-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Status of the new commondata format implementation #1709
Comments
@t7phy could you add the specific datasets (instead of the name of the whole set). That way we can achieve more granularity on the checks. |
@scarlehoff I had somehow missed your above comment and just saw it. Actually, at this point, every dataset in the branch |
Hi @t7phy I would like to test all of the datasets since we want to ensure full backwards compatibility. I would like to register even the factors of -1 in the eigenvalues since any difference can potentially create issues downstream. |
Hi @t7phy what does NA mean? Are these completely new datasets? Note that in order to test I also need to have the old dataset name I should test against. |
Hi @scarlehoff , the ones with NA are all new. It would still be good to test them as they were implemented while we were making many changes to the keys in metadata and this will ensure that I implemented everything correctly. The empty ones are the ones which are also there in the old format. I will be adding the names to them, just need to track them all. |
Ok. When I test I'll separate both scenarios since for the ones we have old-new I'll also include the differences with respect to the new ones in the test. Please leave a comment in this issue when you are done adding the names to the tables (github doesn't notify the edits) |
Yes yes, I will do that. I hadn't commented precisely because I still had to add the old dataset names. |
@scarlehoff, the old dataset names are now added |
I'm noticing that some of the datasets have as Should we put there instead the url from which the data was taken? Or some explanation saying where the original data is coming from? One example of this is ATLAS_TTBAR_13TEV_TOT_X-SEC Some specific errors I'm finding for some datasets:
There's also a typo in the kinematics of It has:
instead of
Then, one thing that worries me is that the one point for this dataset in the old implementation had the numerical value The same problems can be found for In general I was considering the (more comments to come) |
For all the total cross sections, the values are taken directly from the abstract of the papers, thus there is no hepdata entry. For the kinematic bin typo, since these entries for total cross sections were typed by handed, I must have made a mistake, I will fix them. Concerning the mid value, from what I have seen, some experiments seem to provide the bin edges and the central value, however the central value isn't necessarily the middle value between min and max, so I think it depends on the distribution. However, since some experiments don't provide the central values, I guess in this cases the reader could take the min and max, and average them to get the central 'true' value. |
Then we should add this information as a comment somewhere.
Ok! Then I'll take the average for these. Thanks! Now for some more specific questions (pinging as well @enocera here as his input might be needed), expanding on what I wrote above.
Total cross section and statistical errors are very different: old: 818 +- 8 by looking at the paper clearly the second value is correct, does that mean that the old value was wrong?
For this dataset the old and new numerical values for both the data and the uncertainties is the same, what has changed however is the name (and type) of the uncertainties. In the old dataset all uncertainties were considered multplicatives and the names were: instead, in the new one, two are multiplicative with names: My question here is, should I worry about the change of name? I want to test as many things as possible but I don't want to raise warnings / errors for things that are just a convention which might have changed. |
Something has also gone wrong for these datasets, "ATLAS_TTBAR_8TEV_LJ_DIF_PTT-NORM" as the names of some of the uncertainties contain commas, slashes and spaces which I'm not sure are intentional (it looks more like a parsing problem). If it is intentional (and the names should contain commas and slashes) please remove trailing whitespaces. |
I guess that @t7phy is using the names given by experimentalists to systematic uncertainties (as given on Hepdata) which do include commas and slashes. The advantage of this is to have a one-to-one correspondence with Hepdata. Would that be a problem to keep these names? |
No, it's not. They just looked funny in comparison with other names so I thought it might've been a bug. The only problematic thing are the trailing whitespaces at the end. |
OK. @t7phy please check. |
If the whitespaces are needed for whatever reason I can work around them. Another thing I would like to make sure is that the uppercase/lowercase convention (for instance for the kinematics variables) is consistent. For instance in Btw, @enocera , I'm having problems with two datasets in theory 200 for which the fktables are missing (although the compounds file do exist). These are |
That is weird. They should be there. The applgrids are there, so in the worst case I can re-generate the FK tables with apfelcomb. It is also weird because we used these data sets in NNPDF4.0 in order to define the baseline data set. But maybe I've removed them at some points because we decided not to use these data sets. Another thing to check. |
It sounded weird indeed. They are there for theory 163. (of course they are also missing from theory 400 but that's expected since we ignored everything that was not in the final dataset) |
Other than the comments above, I've tested all the ATLAS datasets that have an older counterpart (see table). For some the number of systematics uncertainties is wildly different. For instance, for Is this a bug or a feature? For other quantities (for instance central data) there is agreement between old and new to a better-than-% level. |
I have pushed the following corrections:
Concerning data values, number of uncertainties, etc.: I don't have any idea about this because tbh I did not refer to any old implementations. I implemented all datasets based on the agreed upon principles and conventions as explained to me by @enocera. Also did not want to read cpp code :P What I can say is that I am pretty confident that all the value and error data provided to us is taken into account as I have tried to limit manual handling in favor of automatic handling for every piece of data and uncer. to minimize errors introduced by me in the implementation. Concerning name changes: It's ok for the name and type changes for uncertainty in the ***_TOT_X-SEC because @enocera had told me that I am free to choose naming convention for the new implementation, just that it will need to be consistent across datasets. Going with ATLASLUMI7 (for 7 tev) and so on makes it easy to remember and easy to be consistent, so I think that can be safely ignored, as there is no material change. |
ok! Just to confirm, is this only for the totals? (i.e., for the others I should check the names as well?)
Ok, I'll compile then a list with the differences so that @enocera and others can check whether they are acceptable/expected. |
I don't think the names need to be checked for the other too, because in the new implementation, all names are taken from the hepdata entries. Also within the types, there will definitely be differences for instance lumi type will be ATLASLUMI7 or ATLASLUMI8, instead of whatever was used previously. So, I don't think its needed.
Thanks, that would be helpful. |
@scarlehoff The reason why the numer of sys uncs may be that, in one version (the old one), + and - uncs are implemented as independent uncs (as sometimes suggested by the experimentalists); while in the other version (the new one), + and - uncs are symmetrised and therefore two uncs are reduced to one. This is a case in which we should have two variants. The version implemented by @t7phy is the one for symmetrised uncs (I'm going to take care of the old version). |
I've now tested all atlas datasets (I put a checkmark in the first post, ✔ when it is read correctly and it is compatible with the old data, x when it isn't compatible -but it is read correctly-). ATLAS datasets all work. For CMS there are a few "errors" (typos and convention differences):
Same for: Then, for edit: note that most of the CMS datasets agree perfectly between the old and new version, which is a good thing but makes me more nervous about the one that don't in the ATLAS case :P |
For the H1 and Zeus dataasets instead, the only problem has been |
Hi @scarlehoff
😎
Thanks! fixed.
We agreed that there would be a minimum of 3
Same as above.
That's weird, have a look at this filter file from the old buildmaster: https://github.com/NNPDF/nnpdf/blob/master/buildmaster/filters/CMSTOPDIFF.cc , if you look at the individual distributions, this filter should generate it....????
Actually, no, it's not a typo, the dataset gives the last bin as 5000 < Q^2, so thats what I added. @enocera , for this last bin, both the paper and the hepdata table do not give a right side bin edge. However, there are some other distributions, where the last bin edges are given as 5000 < Q^2 < 100000, so I have (now) set the bin edge the same in this case also, because we need both the edges to set the mid value, is that fine, or can that cause any issues? |
Then please add that information to the documentation PR. However, in many places in the code the number of kinematic variables is hardcoded to 3 so in the parser I will need to ignore any variables beyond the 3rd one.
It seems the dataset was deemed as useless and removed at some point 4 years ago 1be1969#diff-a0021bc54a20bac4b2e758a1b0945c8699dd0de5b2781bac8d81910375b4b3bd
This is ok, but it is something that can be also dealt with in the parser. If you think it is going to happen again in the future (because other datasets will be similar for instance). |
The 3rd and 4th variables are ones, in which the cross section is double differential, so I don't think we can ignore them. Also I don't understand this, as @enocera had told me that one of the benefits of new format is that we can use 3 or more variables now. |
In the commondata file yes (I just updated the parser), and once everything is done someone can go through the code and make sure that the number
|
:) |
This issue collects the status of the new data implementation.
The fully implemented datasets can be found in PR #1813
Th reader for the new commondata format is in PR #1678
All datasets merged to #1813 have been tested and work with the reader of PR #1678.
The documentation for the new CD format is instead in PR #1708
For people implementing new datasets, whenever a new datasets is implemented and it is considered final,
please open a PR pointing to #1813. See for instance #1821
It is not necessary to check that it works with the reader, although it would be much appreciated, of course.
It is understood that datasets for which there is no PR pointing to #1813 are still not completed.
All datasets which have an "older" version will be compared and checked and if it is deemed to be equal (or compatible up to bug fixes) it will be added to this list, so only those are considered to be fully implemented.
Reports and examples
Implemented branches
New jet data (implemented in #1821)
https://vp.nnpdf.science/3vtMtF9yQFe1V-kmiGbh2w==
Branchs under review
DIS+j
New DIS + j #1825
DY (reimplementation) #1846
LHCB DY (1) #1826
ATLAS DY (1) #1866
CMS DY (1) #1869
Jets
Atlas/CMS jet/dijet #1785
Top
ttb ATLAS #1837
ttb CMS #1836
ttb total #1834
Datasets from NNPDF4.0 (all done in #1813)
The text was updated successfully, but these errors were encountered: