New CommonData Reader #1678

scarlehoff · 2023-02-23T13:26:33Z

This branch contains not only the final reader, but also the new commondata and all changes from master.

Some checks performed

The fitbot there act as a check that old theories can still be used (the fitbot uses theory 200... it will soon be changed to theory 700). There are small changes in the fitbot (while the regression tests have not changed) because there are some dataset for which the precision of the data has changed (e.g., we had 4 digits before and now have 5), but it is only a few LHCB and maybe maybe jets, so it doesn't show up in any regression.

I will submit another fitbot just before the merge to update the reference bot.

This is a comparison to the last baseline (NNPDF40_nnlo_as_01180_qcd) where I've used exactly the same runcard (i.e., I haven't changed the names of the datasets that have been translated automatically by vp) https://vp.nnpdf.science/ClK5YFI-TjCBkzTeewuFow== (since the report is also done with new data, it might be informative to compare to a report done with old data 1)

And this one is the same fit but this time the names of the runcard have also been updated: https://vp.nnpdf.science/QaBlf8XvSmSe8UWMvzIy3g==

In both cases they are <60 replica fits.

The fits have been done with this commit 1a8bf48 which corresponds to this one 7e6599a before the rebase on top of the last batch of comments in the other branch.

Checks for sets of data separated by experiment

https://vp.nnpdf.science/chwFM_lJR025vREJ1I9VPQ==
https://vp.nnpdf.science/Toy_r6uFRm-h1oUFEZF6hw==
https://vp.nnpdf.science/-2VyNN3CTHWnb26fUQkUJQ==
https://vp.nnpdf.science/YPAQHvMtTeyBuNq12nl6RQ==
https://vp.nnpdf.science/UHU-TYLJQCuE-8lRHe8Aog==
https://vp.nnpdf.science/oSZLrPg3Tyyf-i2mE33G-Q==
https://vp.nnpdf.science/wCFRKBNsSA2U2O_6Sa75Zw==
https://vp.nnpdf.science/E0IDIgWFRF6tvdFk3pD0mA==
https://vp.nnpdf.science/0YL4eaPTT7eR3wM9IQwj5w==

You should be able to use it with

git checkout final_reader_for_new_commondata_mk2

You can even install this in a separated isolated virtualenvironment in your computer without having to clone or checkout the repository with:

python -m venv nnpdf_venv
source nnpdf_venv/bin/activate
python -m pip install git+https://github.com/NNPDF/nnpdf.git@final_reader_for_new_commondata_mk2

Leaving the original msg here:

This branch / PR is the final reader for the new commondata files.

As you can see, this is a bit less ambitious than the previous PR. I've added a reader and only a reader. The changes to other parts of the code are minimal and well contained*. The reader is able to read the new commondata but the output is an object of the old type. This will allow me to keep this branch on track with master.

@enocera in principle people can branch out of this branch to implement the new dataset so that they can test. As long as they don't change the code itself there will be no conflicts. If people find it difficult I can test the new commondatas myself as they get implemented.

@Zaharid since this branch is only going to add a reader and thus there will be no conflicts with master, feel free to comment or even modify the parsers to your heart's content. This is even almost orthogonal to people's implementation of the new data so it should be fine.

For people implementing new commondata:

The new data should be found together with the old commondata inside the commondata/new folder found in: validphy2/src/validphys/datafiles/. Note that the "new" part will dissapear. It is just to have the new and old commondata files differentiated.

So, for instance, if you implement a set of data like CMS_TTBAR_13TEV_2L_DIF then it will be inside: validphy2/src/validphys/datafiles/commondata/new/CMS_TTBAR_13TEV_2L_DIF .

If you installed the code with pip install -e . then you can update the metadata there, no need to reinstall or anything (like it used to be the case with cmake).

Then you can load the new commondata in the same way you would do so with Validphys
The you can load the new commondata with:

from validphys.api import API
ds = API.dataset(dataset_input={"dataset": "DATASET"}, theoryid=400, use_cuts="internal")

Important: take into account that the theory id must be one of the new ones, since the theory is read from the commondata itself.

TODO

There's a few "TODO" within the code. These are things to be changed at the end since they might require changes somewhere else or that are useful for debugging.
One thing missing is to make sure that anywhere in vp where peek_commondata_metadata is used one can also use the new CommonMetaData class. It should be a trivial change but will wait until the end.
Remove the cases in which there are both load and load_commondata (as now they will both do the same thing)

(feel free to edit this comment with other to-do items that might be important)

*note that this is possible only now that there is no longer the need to keep C++ compatible or literal objects in the pipeline.

scarlehoff · 2023-02-23T13:52:12Z

@enocera how can I check that I'm reading all data and uncertainties correctly? The number of systematics in this commondata is not the same as in the old one, isn't it? Or the old one contains all the variations in the same table?

vs

so, data and statistics seem correct but the systematic part not really?

Also, the process name is different. Is this fine? I'm using the nnpdf31_process key.

Zaharid · 2023-02-23T14:48:23Z

I think you may have merged something in a funny way: There are a bunch of duplicated classes such as ValidKinematics, Variant and so on.

Zaharid · 2023-02-23T15:06:44Z

Continuing from here, #1679 (comment),
one goal I definitively want to keep is that data files work fine in an incomplete state, with data, theory or various other pieces missing, with only kinematics and metadata being strictly required. There should be mechanisms (like e.g. in vp config) that check that the dataset has enough stuff for the requisite action but e.g. actions not requiring experimental uncertainties should not crash if we don't have them.

Now, at the moment one needs to define some thoery metadata always. The easiest way to satisfy the above would be to make that optional, but I think one can make an argument to also offload it to a separate file like we do for the uncertainties.

enocera · 2023-02-23T15:21:09Z

@enocera how can I check that I'm reading all data and uncertainties correctly? The number of systematics in this commondata is not the same as in the old one, isn't it?

It isn't indeed. I realise that all the data set variants are actually the same (they shouldn't) and that they correspond to the variant of the data set with no nuclear uncertainties. So the extra systematics in the old data set are for the extra nuclear uncertainties that have not been implemented (yet?) in any of the variants of the new data set.

Also, the process name is different. Is this fine? I'm using the nnpdf31_process key.

That is just a mistake.

scarlehoff · 2023-02-23T15:23:14Z

one goal I definitively want to keep is that data files work fine in an incomplete state

Having the theory be optional (so that one can deal with the data in some ways before having a prediction for it) I think it is a good idea.
However, this

with only kinematics and metadata being strictly required

is too lenient.

Of the current set of required options (I've taken out theory)

    setname: str
    ndata: int
    observable: dict
    kinematics: ValidKinematics
    kinematic_coverage: dict
    data_central: ValidPath
    data_uncertainties: list[ValidPath]
    dataset_label: str
    plot_x: str
    figure_by: list[str]
    nnpdf_metadata: dict
    version: int

I think only nnpdf_metadata could be made optional. All the others I believe are important from the first minute to have a valid implementation.

Zaharid · 2023-02-23T15:28:42Z

I meant that out of the possible external per-point files (fk-{stuff}, uncertainties, experimental values, kinematics), only the kinematics are intrinsic to what the dataset it, with the rest being optional and subject to being overwritten by variants.

I don't have strong opinions on what should be required wrt the top level metadata but, if we start more strict, we can always add defaults later on.

Zaharid · 2023-02-23T16:51:58Z

I guess it follows that the current coredata.Commondata structure is but one of the possible views of the data, which is enabled when there are central values and uncertainties. I guess that suggests that the parser for commondata (or at any rate something in config) should be more complicated (by having checks for having the relevant files).

We could also use a bunch of boolean methods for checks like has_experimental_values, has_theory and so on.

Zaharid · 2023-02-23T17:37:52Z

Would you like to move the notebook to some example directory or similar?

scarlehoff · 2023-02-23T19:04:13Z

The notebook is just there for people to use it while they are implementing new cd. It won't be merged to master.

should be more complicated (by having checks for having the relevant files)

I was planning to check on construction. But if we allow for them to be empty I guess yes, the parser will need to do that.

We could also use a bunch of boolean methods for checks like has_experimental_values, has_theory and so on.

sure

the current coredata.Commondata structure is but one of the possible views of the data

And will be the only one for the time being I guess. It is just the CommonMetaData that will be allowed to have missing pieces (so that if in the future there's an use case for a half-constructed CommonData it can be done) but I consider the current* coredata.Commondata as the final realisation of the dataset.

*understanding current as "whatever is in master". I'm happy changing that and as I said at the beginning I'll keep rebasing this branch so that it is able to track whatever the CommonData needs to have in master.

Zaharid · 2023-02-23T19:26:49Z

And will be the only one for the time being I guess. It is just the CommonMetaData that will be allowed to have missing pieces (so that if in the future there's an use case for a half-constructed CommonData it can be done) but I consider the current* coredata.Commondata as the final realisation of the dataset.

I guess there are a few possible ways to go about it, which have different trade offs in terms of how much code we break and how elegant they are. For example, we have said that we want positivity predictions to be basically commondata files without experimental central values (which is why the central values are in a separate, optional, file, different from the uncertainties). We could make that be a "normal" commondata file that just returns zero when asked about central values,. That is close what we have now, but less nice. We could make the commondata object itself have only optional central values, but that would break a lot of stuff unless we walk over the whole code and add checks as needed.

Or we could have a bunch of different "final realisations" that depend on what capabilities an action needs (e.g. the inputs would be things like theory_predictions, experimental_values, systematics_matrix and so on; maybe wrapped with the relevant metadata), in which case, we would have to have several data structures. This is closes to how I envision it. We would start only with commondata but move positivity to that framework fairly quickly (likely before any of this gets merged into master). This is closer to how I envision it.

Zaharid · 2023-02-23T19:27:31Z

Actually would be kind of nice to have some example notebook. People do seem to like those.

scarlehoff · 2023-02-23T20:25:59Z

People do seem to like those.

But then the example notebooks need to be maintained. And so they would need to be tested, etc, etc.

Or we could have a bunch of different "final realisations" that depend on what capabilities an action needs (e.g. the inputs would be things like theory_predictions, experimental_values, systematics_matrix and so on; maybe wrapped with the relevant metadata)

This I like. It was also my original idea (two branches ago)

We would start only with commondata but move positivity to that framework fairly quickly (likely before any of this gets merged into master)

If you mean, "this is done in master and then this is modified to follow that master as required" I completely agree. I'm not planning to merge this until all the new commondata is ready.

I'm currently pruning the dangling methods, attributes and functions that have been left from the python commondata and closure test. That will allow us to take out the C++. Then I'll try to unify the interfaces (so we don't have things like the len / ndata problem of the other day or the fact that currently CommonData gives the central value with two different methods). I'm hoping that will also help me with the to-do in the OP about the commonmetadata.
After that is done we can see about change the systematics / data / theory within the commondata so that it is general enough.

scarlehoff · 2023-03-29T15:27:44Z

The reader is now updated for the new format.

The only thing I'm not happy with how I've solved (@Zaharid) is the way I'm keeping track of the parent since with validphys I'm loading the SetMetaData which can include many ObservableMetaData. Then I select a particular observable.

But I would like this Observable to still know everything about its parent Set. My easy solution is to say, at loading time (since there is only one way to load it and it is the parent doing it), observable._parent = self but that can be broken. Since it is an attribute I could be evil and say somewhere else ._parent = None... but maybe I'm overthinking this.

For the rest, nothing changes with respect to the previous loader. @t7phy if you update the top data to use the new naming format (maybe you already did, I don't remember right now) and @enocera adds a few DIS datasets (with variants, which is something not well tested yet!) I'll update the notebook to do a test to ensure that the data/uncertainties/kinematics have not changed (I would like to do such a test before merging this into master).

I believe it might be interesting to merge this into master in a way in which both commondata formats are allowed at the same time since it will surely help people with debugging and porting (but this is a discussion for another moment ofc)

validphys2/src/validphys/commondataparser.py

t7phy · 2023-03-29T17:58:23Z

For the rest, nothing changes with respect to the previous loader. @t7phy if you update the top data to use the new naming format (maybe you already did, I don't remember right now) and @enocera adds a few DIS datasets (with variants, which is something not well tested yet!) I'll update the notebook to do a test to ensure that the data/uncertainties/kinematics have not changed (I would like to do such a test before merging this into master).

#1684 is now updated for testing with the reader based on all newly agreed conventions. Have a go at the reader test, in case I have missed something, please let me know. @scarlehoff

Zaharid · 2023-03-29T18:01:53Z

At this point it would really help to have #1691 , because it is starting to confuse me.

Do I remember correctly that "all the things about the parent" are the metadata? I don't think there is anything wrong with passing the metadata as self, but I think that instead of adding hacks to the dataclass that is supposed to more or less represent the input verbatim, I would make a new one.

scarlehoff · 2023-03-29T18:06:16Z

At this point it would really help to have #1691 , because it is starting to confuse me.

I think we are close to be able to document it.

Do I remember correctly that "all the things about the parent" are the metadata?

No. The metadata we are interested in within vp is the ObservableMetaData (plotting information, ndata, etc). The parent only contains information about the folder where the observable is, the experiment and the hepdata/arxiv/etc information.

t7phy · 2023-03-30T16:52:02Z

@scarlehoff could you please test the reader on this specific dataset/folder: https://github.com/NNPDF/nnpdf/tree/gluon_pdf_ncd/buildmaster/ATLAS_TTBAR_8TEV_LJ_DIF As you know, for the other PR i still need to delete theory, add 3rd kinematic variable and so on. This particular dataset should have every single point that we have agreed on in the last few days. Have a go and please let me know if this is fine.

scarlehoff · 2023-03-30T17:10:03Z

You have to remove the file key.

But I'll try it tomorrow

t7phy · 2023-03-30T19:51:25Z

You have to remove the file key.
fixed it, thanks!
But I'll try it tomorrow

Ok, you can test this and also #1684 is now fixed. Besides these, the following two can also be tested:
https://github.com/NNPDF/nnpdf/tree/gluon_pdf_ncd/buildmaster/H1_1JET_319GEV_351PB-1
https://github.com/NNPDF/nnpdf/tree/gluon_pdf_ncd/buildmaster/H1_2JET_319GEV_351PB-1

so I expect the 4 of these so far to be fully compatible with the reader (I hope).

enocera · 2023-03-30T20:32:23Z

You have to remove the file key.
fixed it, thanks!
But I'll try it tomorrow

Ok, you can test this and also #1684 is now fixed. Besides these, the following two can also be tested: https://github.com/NNPDF/nnpdf/tree/gluon_pdf_ncd/buildmaster/H1_1JET_319GEV_351PB-1 https://github.com/NNPDF/nnpdf/tree/gluon_pdf_ncd/buildmaster/H1_2JET_319GEV_351PB-1

so I expect the 4 of these so far to be fully compatible with the reader (I hope).

@t7phy Although these are for new data sets (not in NNPDF4.0), aren't they? So perhaps @scarlehoff wants to try first the old data sets.

t7phy · 2023-03-31T07:23:25Z

You have to remove the file key.
fixed it, thanks!
But I'll try it tomorrow

Ok, you can test this and also #1684 is now fixed. Besides these, the following two can also be tested: https://github.com/NNPDF/nnpdf/tree/gluon_pdf_ncd/buildmaster/H1_1JET_319GEV_351PB-1 https://github.com/NNPDF/nnpdf/tree/gluon_pdf_ncd/buildmaster/H1_2JET_319GEV_351PB-1
so I expect the 4 of these so far to be fully compatible with the reader (I hope).

@t7phy Although these are for new data sets (not in NNPDF4.0), aren't they? So perhaps @scarlehoff wants to try first the old data sets.

https://github.com/NNPDF/nnpdf/tree/gluon_pdf_ncd/buildmaster/ATLAS_TTBAR_8TEV_LJ_DIF and #1684 are old, the other 2 are new.

RoyStegeman

Since I see you're implementing comments now, here are some more

validphys2/src/validphys/config.py

validphys2/src/validphys/coredata.py

validphys2/src/validphys/commondataparser.py

RoyStegeman · 2024-03-01T18:20:55Z

What is the status of #1708?

Co-authored-by: Roy Stegeman <[email protected]>

scarlehoff · 2024-03-01T18:24:15Z

What is the status of #1708?

Not bad, but not fantastic.

Radonirinaunimi

A very tiny one (again) that I've just stumbled upon when skimming through.

validphys2/src/validphys/pineparser.py

RoyStegeman · 2024-03-01T18:28:19Z

Not bad, but not fantastic.

Okay I think that would deserve relatively high priority, especially once this is merged.

Co-authored-by: Tanjona Rabemananjara <[email protected]>

Add a library of process dependent options

validphys2/src/validphys/filters.py

validphys2/src/validphys/loader.py

validphys2/src/validphys/commondataparser.py

Co-authored-by: Roy Stegeman <[email protected]>

scarlehoff · 2024-03-04T10:14:06Z

The previous fitbot failed because of a problem in the server, but the fit was uploaded.
I just submitted the fitbot so that we have a final report, and then will merge this PR.

github-actions · 2024-03-04T12:14:12Z

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Fit Name: NNBOT-e8ea50626-2024-03-04
Fit Report wrt master: https://vp.nnpdf.science/0IFTWiE-RzCV0yxUPqn4-g==
Fit Report wrt latest stable reference: https://vp.nnpdf.science/lEB12oQ1RiyLQIjQKARDiQ==
Fit Data: https://data.nnpdf.science/fits/NNBOT-e8ea50626-2024-03-04.tar.gz

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

validphys2/src/validphys/plotoptions/kintransforms.py

RoyStegeman

Thanks, you put a lot of work in this

scarlehoff added the data toolchain label Feb 23, 2023

scarlehoff assigned Zaharid, enocera and scarlehoff Feb 23, 2023

scarlehoff mentioned this pull request Mar 13, 2023

Utils for the new commondata #1693

Closed

scarlehoff force-pushed the final_reader_for_new_commondata_mk2 branch from 5290ba1 to 9849418 Compare March 29, 2023 15:10

Zaharid reviewed Mar 29, 2023

View reviewed changes

validphys2/src/validphys/commondataparser.py Outdated Show resolved Hide resolved

Zaharid reviewed Mar 29, 2023

View reviewed changes

validphys2/src/validphys/commondataparser.py Show resolved Hide resolved

scarlehoff mentioned this pull request Mar 31, 2023

Status of the new commondata format implementation #1709

Closed

77 tasks

fix tevatron energy... it is not gevatron...

5e0df7c

RoyStegeman reviewed Mar 1, 2024

View reviewed changes

scarlehoff added 2 commits March 1, 2024 19:16

updated filters and filters tests

376ec47

3.9 compatibility

7e6599a

Apply suggestions from code review

00be99a

Co-authored-by: Roy Stegeman <[email protected]>

Radonirinaunimi reviewed Mar 1, 2024

View reviewed changes

validphys2/src/validphys/pineparser.py Outdated Show resolved Hide resolved

Update validphys2/src/validphys/pineparser.py

7b8ff6d

Co-authored-by: Tanjona Rabemananjara <[email protected]>

Radonirinaunimi approved these changes Mar 3, 2024

View reviewed changes

Radonirinaunimi mentioned this pull request Mar 3, 2024

Removing traces of Old Commondata #1974

Open

19 tasks

Merge pull request #1949 from NNPDF/add_a_library_of_process_options

07cf9e1

Add a library of process dependent options

scarlehoff added the run-fit-bot Starts fit bot from a PR. label Mar 3, 2024

RoyStegeman reviewed Mar 3, 2024

View reviewed changes

scarlehoff and others added 2 commits March 4, 2024 08:59

Apply suggestions from code review

c95e3a6

Co-authored-by: Roy Stegeman <[email protected]>

more comments ; updated fitbot before merge

c49c8f1

scarlehoff added run-fit-bot Starts fit bot from a PR. and removed run-fit-bot Starts fit bot from a PR. labels Mar 4, 2024

clean up files

0ada415

RoyStegeman reviewed Mar 4, 2024

View reviewed changes

validphys2/src/validphys/plotoptions/kintransforms.py Outdated Show resolved Hide resolved

place (GeV) outside of 40529 in kinlabels

13db094

RoyStegeman approved these changes Mar 4, 2024

View reviewed changes

RoyStegeman linked an issue Mar 4, 2024 that may be closed by this pull request

Dataset variants #494

Closed

scarlehoff merged commit 22e2a9f into master Mar 4, 2024
8 checks passed

scarlehoff deleted the final_reader_for_new_commondata_mk2 branch March 4, 2024 15:32

This was referenced Mar 15, 2024

Positivity sets should be just datasets without experimental data #759

Closed

Replace plotoptions commondata functionality with commondataparser #1172

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New CommonData Reader #1678

New CommonData Reader #1678

scarlehoff commented Feb 23, 2023 •

edited

Loading

scarlehoff commented Feb 23, 2023

Zaharid commented Feb 23, 2023

Zaharid commented Feb 23, 2023 •

edited

Loading

enocera commented Feb 23, 2023

scarlehoff commented Feb 23, 2023

Zaharid commented Feb 23, 2023

Zaharid commented Feb 23, 2023

Zaharid commented Feb 23, 2023

scarlehoff commented Feb 23, 2023

Zaharid commented Feb 23, 2023

Zaharid commented Feb 23, 2023

scarlehoff commented Feb 23, 2023 •

edited

Loading

scarlehoff commented Mar 29, 2023

t7phy commented Mar 29, 2023 •

edited

Loading

Zaharid commented Mar 29, 2023

scarlehoff commented Mar 29, 2023

t7phy commented Mar 30, 2023

scarlehoff commented Mar 30, 2023

t7phy commented Mar 30, 2023

enocera commented Mar 30, 2023

t7phy commented Mar 31, 2023 •

edited

Loading

RoyStegeman left a comment

RoyStegeman commented Mar 1, 2024

scarlehoff commented Mar 1, 2024

Radonirinaunimi left a comment

RoyStegeman commented Mar 1, 2024

scarlehoff commented Mar 4, 2024

github-actions bot commented Mar 4, 2024

RoyStegeman left a comment

New CommonData Reader #1678

New CommonData Reader #1678

Conversation

scarlehoff commented Feb 23, 2023 • edited Loading

Some checks performed

For people implementing new commondata:

TODO

scarlehoff commented Feb 23, 2023

Zaharid commented Feb 23, 2023

Zaharid commented Feb 23, 2023 • edited Loading

enocera commented Feb 23, 2023

scarlehoff commented Feb 23, 2023

Zaharid commented Feb 23, 2023

Zaharid commented Feb 23, 2023

Zaharid commented Feb 23, 2023

scarlehoff commented Feb 23, 2023

Zaharid commented Feb 23, 2023

Zaharid commented Feb 23, 2023

scarlehoff commented Feb 23, 2023 • edited Loading

scarlehoff commented Mar 29, 2023

t7phy commented Mar 29, 2023 • edited Loading

Zaharid commented Mar 29, 2023

scarlehoff commented Mar 29, 2023

t7phy commented Mar 30, 2023

scarlehoff commented Mar 30, 2023

t7phy commented Mar 30, 2023

enocera commented Mar 30, 2023

t7phy commented Mar 31, 2023 • edited Loading

RoyStegeman left a comment

Choose a reason for hiding this comment

RoyStegeman commented Mar 1, 2024

scarlehoff commented Mar 1, 2024

Radonirinaunimi left a comment

Choose a reason for hiding this comment

RoyStegeman commented Mar 1, 2024

scarlehoff commented Mar 4, 2024

github-actions bot commented Mar 4, 2024

RoyStegeman left a comment

Choose a reason for hiding this comment

scarlehoff commented Feb 23, 2023 •

edited

Loading

Zaharid commented Feb 23, 2023 •

edited

Loading

scarlehoff commented Feb 23, 2023 •

edited

Loading

t7phy commented Mar 29, 2023 •

edited

Loading

t7phy commented Mar 31, 2023 •

edited

Loading