Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New CommonData Reader #1678

Merged
merged 53 commits into from
Mar 4, 2024
Merged

Conversation

scarlehoff
Copy link
Member

@scarlehoff scarlehoff commented Feb 23, 2023

This branch contains not only the final reader, but also the new commondata and all changes from master.

Some checks performed

The fitbot there act as a check that old theories can still be used (the fitbot uses theory 200... it will soon be changed to theory 700). There are small changes in the fitbot (while the regression tests have not changed) because there are some dataset for which the precision of the data has changed (e.g., we had 4 digits before and now have 5), but it is only a few LHCB and maybe maybe jets, so it doesn't show up in any regression.

I will submit another fitbot just before the merge to update the reference bot.

This is a comparison to the last baseline (NNPDF40_nnlo_as_01180_qcd) where I've used exactly the same runcard (i.e., I haven't changed the names of the datasets that have been translated automatically by vp) https://vp.nnpdf.science/ClK5YFI-TjCBkzTeewuFow== (since the report is also done with new data, it might be informative to compare to a report done with old data 1)

And this one is the same fit but this time the names of the runcard have also been updated: https://vp.nnpdf.science/QaBlf8XvSmSe8UWMvzIy3g==

In both cases they are <60 replica fits.

The fits have been done with this commit 1a8bf48 which corresponds to this one 7e6599a before the rebase on top of the last batch of comments in the other branch.

Checks for sets of data separated by experiment

https://vp.nnpdf.science/chwFM_lJR025vREJ1I9VPQ==
https://vp.nnpdf.science/Toy_r6uFRm-h1oUFEZF6hw==
https://vp.nnpdf.science/-2VyNN3CTHWnb26fUQkUJQ==
https://vp.nnpdf.science/YPAQHvMtTeyBuNq12nl6RQ==
https://vp.nnpdf.science/UHU-TYLJQCuE-8lRHe8Aog==
https://vp.nnpdf.science/oSZLrPg3Tyyf-i2mE33G-Q==
https://vp.nnpdf.science/wCFRKBNsSA2U2O_6Sa75Zw==
https://vp.nnpdf.science/E0IDIgWFRF6tvdFk3pD0mA==
https://vp.nnpdf.science/0YL4eaPTT7eR3wM9IQwj5w==


You should be able to use it with

git checkout final_reader_for_new_commondata_mk2

You can even install this in a separated isolated virtualenvironment in your computer without having to clone or checkout the repository with:

python -m venv nnpdf_venv
source nnpdf_venv/bin/activate
python -m pip install git+https://github.com/NNPDF/nnpdf.git@final_reader_for_new_commondata_mk2

Leaving the original msg here:

This branch / PR is the final reader for the new commondata files.

As you can see, this is a bit less ambitious than the previous PR. I've added a reader and only a reader. The changes to other parts of the code are minimal and well contained*. The reader is able to read the new commondata but the output is an object of the old type. This will allow me to keep this branch on track with master.

@enocera in principle people can branch out of this branch to implement the new dataset so that they can test. As long as they don't change the code itself there will be no conflicts. If people find it difficult I can test the new commondatas myself as they get implemented.

@Zaharid since this branch is only going to add a reader and thus there will be no conflicts with master, feel free to comment or even modify the parsers to your heart's content. This is even almost orthogonal to people's implementation of the new data so it should be fine.

For people implementing new commondata:

The new data should be found together with the old commondata inside the commondata/new folder found in: validphy2/src/validphys/datafiles/. Note that the "new" part will dissapear. It is just to have the new and old commondata files differentiated.

So, for instance, if you implement a set of data like CMS_TTBAR_13TEV_2L_DIF then it will be inside: validphy2/src/validphys/datafiles/commondata/new/CMS_TTBAR_13TEV_2L_DIF .

If you installed the code with pip install -e . then you can update the metadata there, no need to reinstall or anything (like it used to be the case with cmake).

Then you can load the new commondata in the same way you would do so with Validphys
The you can load the new commondata with:

from validphys.api import API
ds = API.dataset(dataset_input={"dataset": "DATASET"}, theoryid=400, use_cuts="internal")

Important: take into account that the theory id must be one of the new ones, since the theory is read from the commondata itself.

TODO

  • There's a few "TODO" within the code. These are things to be changed at the end since they might require changes somewhere else or that are useful for debugging.
  • One thing missing is to make sure that anywhere in vp where peek_commondata_metadata is used one can also use the new CommonMetaData class. It should be a trivial change but will wait until the end.
  • Remove the cases in which there are both load and load_commondata (as now they will both do the same thing)

(feel free to edit this comment with other to-do items that might be important)

*note that this is possible only now that there is no longer the need to keep C++ compatible or literal objects in the pipeline.

@scarlehoff
Copy link
Member Author

@enocera how can I check that I'm reading all data and uncertainties correctly? The number of systematics in this commondata is not the same as in the old one, isn't it? Or the old one contains all the variations in the same table?

Screenshot 2023-02-23 at 14 50 52

vs

Screenshot 2023-02-23 at 14 51 05

so, data and statistics seem correct but the systematic part not really?

Also, the process name is different. Is this fine? I'm using the nnpdf31_process key.

@Zaharid
Copy link
Contributor

Zaharid commented Feb 23, 2023

I think you may have merged something in a funny way: There are a bunch of duplicated classes such as ValidKinematics, Variant and so on.

@Zaharid
Copy link
Contributor

Zaharid commented Feb 23, 2023

Continuing from here, #1679 (comment),
one goal I definitively want to keep is that data files work fine in an incomplete state, with data, theory or various other pieces missing, with only kinematics and metadata being strictly required. There should be mechanisms (like e.g. in vp config) that check that the dataset has enough stuff for the requisite action but e.g. actions not requiring experimental uncertainties should not crash if we don't have them.

Now, at the moment one needs to define some thoery metadata always. The easiest way to satisfy the above would be to make that optional, but I think one can make an argument to also offload it to a separate file like we do for the uncertainties.

@enocera
Copy link
Contributor

enocera commented Feb 23, 2023

@enocera how can I check that I'm reading all data and uncertainties correctly? The number of systematics in this commondata is not the same as in the old one, isn't it?

It isn't indeed. I realise that all the data set variants are actually the same (they shouldn't) and that they correspond to the variant of the data set with no nuclear uncertainties. So the extra systematics in the old data set are for the extra nuclear uncertainties that have not been implemented (yet?) in any of the variants of the new data set.

Also, the process name is different. Is this fine? I'm using the nnpdf31_process key.

That is just a mistake.

@scarlehoff
Copy link
Member Author

one goal I definitively want to keep is that data files work fine in an incomplete state

Having the theory be optional (so that one can deal with the data in some ways before having a prediction for it) I think it is a good idea.
However, this

with only kinematics and metadata being strictly required

is too lenient.

Of the current set of required options (I've taken out theory)

    setname: str
    ndata: int
    observable: dict
    kinematics: ValidKinematics
    kinematic_coverage: dict
    data_central: ValidPath
    data_uncertainties: list[ValidPath]
    dataset_label: str
    plot_x: str
    figure_by: list[str]
    nnpdf_metadata: dict
    version: int

I think only nnpdf_metadata could be made optional. All the others I believe are important from the first minute to have a valid implementation.

@Zaharid
Copy link
Contributor

Zaharid commented Feb 23, 2023

I meant that out of the possible external per-point files (fk-{stuff}, uncertainties, experimental values, kinematics), only the kinematics are intrinsic to what the dataset it, with the rest being optional and subject to being overwritten by variants.

I don't have strong opinions on what should be required wrt the top level metadata but, if we start more strict, we can always add defaults later on.

@Zaharid
Copy link
Contributor

Zaharid commented Feb 23, 2023

I guess it follows that the current coredata.Commondata structure is but one of the possible views of the data, which is enabled when there are central values and uncertainties. I guess that suggests that the parser for commondata (or at any rate something in config) should be more complicated (by having checks for having the relevant files).

We could also use a bunch of boolean methods for checks like has_experimental_values, has_theory and so on.

@Zaharid
Copy link
Contributor

Zaharid commented Feb 23, 2023

Would you like to move the notebook to some example directory or similar?

@scarlehoff
Copy link
Member Author

The notebook is just there for people to use it while they are implementing new cd. It won't be merged to master.

should be more complicated (by having checks for having the relevant files)

I was planning to check on construction. But if we allow for them to be empty I guess yes, the parser will need to do that.

We could also use a bunch of boolean methods for checks like has_experimental_values, has_theory and so on.

sure

the current coredata.Commondata structure is but one of the possible views of the data

And will be the only one for the time being I guess. It is just the CommonMetaData that will be allowed to have missing pieces (so that if in the future there's an use case for a half-constructed CommonData it can be done) but I consider the current* coredata.Commondata as the final realisation of the dataset.

*understanding current as "whatever is in master". I'm happy changing that and as I said at the beginning I'll keep rebasing this branch so that it is able to track whatever the CommonData needs to have in master.

@Zaharid
Copy link
Contributor

Zaharid commented Feb 23, 2023

And will be the only one for the time being I guess. It is just the CommonMetaData that will be allowed to have missing pieces (so that if in the future there's an use case for a half-constructed CommonData it can be done) but I consider the current* coredata.Commondata as the final realisation of the dataset.

I guess there are a few possible ways to go about it, which have different trade offs in terms of how much code we break and how elegant they are. For example, we have said that we want positivity predictions to be basically commondata files without experimental central values (which is why the central values are in a separate, optional, file, different from the uncertainties). We could make that be a "normal" commondata file that just returns zero when asked about central values,. That is close what we have now, but less nice. We could make the commondata object itself have only optional central values, but that would break a lot of stuff unless we walk over the whole code and add checks as needed.

Or we could have a bunch of different "final realisations" that depend on what capabilities an action needs (e.g. the inputs would be things like theory_predictions, experimental_values, systematics_matrix and so on; maybe wrapped with the relevant metadata), in which case, we would have to have several data structures. This is closes to how I envision it. We would start only with commondata but move positivity to that framework fairly quickly (likely before any of this gets merged into master). This is closer to how I envision it.

@Zaharid
Copy link
Contributor

Zaharid commented Feb 23, 2023

Actually would be kind of nice to have some example notebook. People do seem to like those.

@scarlehoff
Copy link
Member Author

scarlehoff commented Feb 23, 2023

People do seem to like those.

But then the example notebooks need to be maintained. And so they would need to be tested, etc, etc.

Or we could have a bunch of different "final realisations" that depend on what capabilities an action needs (e.g. the inputs would be things like theory_predictions, experimental_values, systematics_matrix and so on; maybe wrapped with the relevant metadata)

This I like. It was also my original idea (two branches ago)

We would start only with commondata but move positivity to that framework fairly quickly (likely before any of this gets merged into master)

If you mean, "this is done in master and then this is modified to follow that master as required" I completely agree. I'm not planning to merge this until all the new commondata is ready.

I'm currently pruning the dangling methods, attributes and functions that have been left from the python commondata and closure test. That will allow us to take out the C++. Then I'll try to unify the interfaces (so we don't have things like the len / ndata problem of the other day or the fact that currently CommonData gives the central value with two different methods). I'm hoping that will also help me with the to-do in the OP about the commonmetadata.
After that is done we can see about change the systematics / data / theory within the commondata so that it is general enough.

@scarlehoff scarlehoff force-pushed the final_reader_for_new_commondata_mk2 branch from 5290ba1 to 9849418 Compare March 29, 2023 15:10
@scarlehoff
Copy link
Member Author

The reader is now updated for the new format.

The only thing I'm not happy with how I've solved (@Zaharid) is the way I'm keeping track of the parent since with validphys I'm loading the SetMetaData which can include many ObservableMetaData. Then I select a particular observable.

But I would like this Observable to still know everything about its parent Set. My easy solution is to say, at loading time (since there is only one way to load it and it is the parent doing it), observable._parent = self but that can be broken. Since it is an attribute I could be evil and say somewhere else ._parent = None... but maybe I'm overthinking this.

For the rest, nothing changes with respect to the previous loader. @t7phy if you update the top data to use the new naming format (maybe you already did, I don't remember right now) and @enocera adds a few DIS datasets (with variants, which is something not well tested yet!) I'll update the notebook to do a test to ensure that the data/uncertainties/kinematics have not changed (I would like to do such a test before merging this into master).

I believe it might be interesting to merge this into master in a way in which both commondata formats are allowed at the same time since it will surely help people with debugging and porting (but this is a discussion for another moment ofc)

@t7phy
Copy link
Member

t7phy commented Mar 29, 2023

For the rest, nothing changes with respect to the previous loader. @t7phy if you update the top data to use the new naming format (maybe you already did, I don't remember right now) and @enocera adds a few DIS datasets (with variants, which is something not well tested yet!) I'll update the notebook to do a test to ensure that the data/uncertainties/kinematics have not changed (I would like to do such a test before merging this into master).

#1684 is now updated for testing with the reader based on all newly agreed conventions. Have a go at the reader test, in case I have missed something, please let me know. @scarlehoff

@Zaharid
Copy link
Contributor

Zaharid commented Mar 29, 2023

At this point it would really help to have #1691 , because it is starting to confuse me.

Do I remember correctly that "all the things about the parent" are the metadata? I don't think there is anything wrong with passing the metadata as self, but I think that instead of adding hacks to the dataclass that is supposed to more or less represent the input verbatim, I would make a new one.

@scarlehoff
Copy link
Member Author

At this point it would really help to have #1691 , because it is starting to confuse me.

I think we are close to be able to document it.

Do I remember correctly that "all the things about the parent" are the metadata?

No. The metadata we are interested in within vp is the ObservableMetaData (plotting information, ndata, etc). The parent only contains information about the folder where the observable is, the experiment and the hepdata/arxiv/etc information.

@t7phy
Copy link
Member

t7phy commented Mar 30, 2023

@scarlehoff could you please test the reader on this specific dataset/folder: https://github.com/NNPDF/nnpdf/tree/gluon_pdf_ncd/buildmaster/ATLAS_TTBAR_8TEV_LJ_DIF As you know, for the other PR i still need to delete theory, add 3rd kinematic variable and so on. This particular dataset should have every single point that we have agreed on in the last few days. Have a go and please let me know if this is fine.

@scarlehoff
Copy link
Member Author

You have to remove the file key.

But I'll try it tomorrow

@t7phy
Copy link
Member

t7phy commented Mar 30, 2023

You have to remove the file key.
fixed it, thanks!
But I'll try it tomorrow

Ok, you can test this and also #1684 is now fixed. Besides these, the following two can also be tested:
https://github.com/NNPDF/nnpdf/tree/gluon_pdf_ncd/buildmaster/H1_1JET_319GEV_351PB-1
https://github.com/NNPDF/nnpdf/tree/gluon_pdf_ncd/buildmaster/H1_2JET_319GEV_351PB-1

so I expect the 4 of these so far to be fully compatible with the reader (I hope).

@enocera
Copy link
Contributor

enocera commented Mar 30, 2023

You have to remove the file key.
fixed it, thanks!
But I'll try it tomorrow

Ok, you can test this and also #1684 is now fixed. Besides these, the following two can also be tested: https://github.com/NNPDF/nnpdf/tree/gluon_pdf_ncd/buildmaster/H1_1JET_319GEV_351PB-1 https://github.com/NNPDF/nnpdf/tree/gluon_pdf_ncd/buildmaster/H1_2JET_319GEV_351PB-1

so I expect the 4 of these so far to be fully compatible with the reader (I hope).

@t7phy Although these are for new data sets (not in NNPDF4.0), aren't they? So perhaps @scarlehoff wants to try first the old data sets.

@t7phy
Copy link
Member

t7phy commented Mar 31, 2023

You have to remove the file key.
fixed it, thanks!
But I'll try it tomorrow

Ok, you can test this and also #1684 is now fixed. Besides these, the following two can also be tested: https://github.com/NNPDF/nnpdf/tree/gluon_pdf_ncd/buildmaster/H1_1JET_319GEV_351PB-1 https://github.com/NNPDF/nnpdf/tree/gluon_pdf_ncd/buildmaster/H1_2JET_319GEV_351PB-1
so I expect the 4 of these so far to be fully compatible with the reader (I hope).

@t7phy Although these are for new data sets (not in NNPDF4.0), aren't they? So perhaps @scarlehoff wants to try first the old data sets.

https://github.com/NNPDF/nnpdf/tree/gluon_pdf_ncd/buildmaster/ATLAS_TTBAR_8TEV_LJ_DIF and #1684 are old, the other 2 are new.

Copy link
Member

@RoyStegeman RoyStegeman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I see you're implementing comments now, here are some more

validphys2/src/validphys/config.py Outdated Show resolved Hide resolved
validphys2/src/validphys/coredata.py Show resolved Hide resolved
validphys2/src/validphys/coredata.py Show resolved Hide resolved
validphys2/src/validphys/coredata.py Show resolved Hide resolved
validphys2/src/validphys/commondataparser.py Outdated Show resolved Hide resolved
validphys2/src/validphys/commondataparser.py Outdated Show resolved Hide resolved
validphys2/src/validphys/commondataparser.py Show resolved Hide resolved
@RoyStegeman
Copy link
Member

What is the status of #1708?

@scarlehoff
Copy link
Member Author

What is the status of #1708?

Not bad, but not fantastic.

Copy link
Member

@Radonirinaunimi Radonirinaunimi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very tiny one (again) that I've just stumbled upon when skimming through.

validphys2/src/validphys/pineparser.py Outdated Show resolved Hide resolved
@RoyStegeman
Copy link
Member

Not bad, but not fantastic.

Okay I think that would deserve relatively high priority, especially once this is merged.

Add a library of process dependent options
@scarlehoff scarlehoff added the run-fit-bot Starts fit bot from a PR. label Mar 3, 2024
validphys2/src/validphys/filters.py Outdated Show resolved Hide resolved
validphys2/src/validphys/loader.py Outdated Show resolved Hide resolved
validphys2/src/validphys/loader.py Outdated Show resolved Hide resolved
validphys2/src/validphys/loader.py Outdated Show resolved Hide resolved
validphys2/src/validphys/commondataparser.py Show resolved Hide resolved
@scarlehoff scarlehoff added run-fit-bot Starts fit bot from a PR. and removed run-fit-bot Starts fit bot from a PR. labels Mar 4, 2024
@scarlehoff
Copy link
Member Author

The previous fitbot failed because of a problem in the server, but the fit was uploaded.
I just submitted the fitbot so that we have a final report, and then will merge this PR.

Copy link

github-actions bot commented Mar 4, 2024

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

Copy link
Member

@RoyStegeman RoyStegeman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, you put a lot of work in this

@RoyStegeman RoyStegeman linked an issue Mar 4, 2024 that may be closed by this pull request
@scarlehoff scarlehoff merged commit 22e2a9f into master Mar 4, 2024
8 checks passed
@scarlehoff scarlehoff deleted the final_reader_for_new_commondata_mk2 branch March 4, 2024 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dataset variants
8 participants