Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make pseudodata_table correctly deal with multiple replicas #2034

Merged
merged 5 commits into from
Apr 11, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 20 additions & 6 deletions validphys2/src/validphys/n3fit_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
Providers which prepare the data ready for
:py:func:`n3fit.performfit.performfit`.
"""

from collections import defaultdict
import functools
import hashlib
Expand Down Expand Up @@ -70,8 +71,7 @@ def __init__(self, group_name, seed, masks=None):
super().__init__(group_name, seed)

def __iter__(self):
for m in self.masks:
yield m
yield from self.masks


def tr_masks(data, replica_trvlseed, parallel_models=False, replica=1, replicas=(1,)):
Expand Down Expand Up @@ -343,7 +343,7 @@ def replica_nnseed_fitting_data_dict(replica, exps_fitting_data_dict, replica_nn

replicas_nnseed_fitting_data_dict = collect("replica_nnseed_fitting_data_dict", ("replicas",))
groups_replicas_indexed_make_replica = collect(
"indexed_make_replica", ("group_dataset_inputs_by_experiment", "replicas")
"indexed_make_replica", ("replicas", "group_dataset_inputs_by_experiment")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the change of order necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thin it makes what we do in pseudodata_table more readable. There we group the entries to groups_replicas_indexed_make_replica corresponding to a given replica. I think that's easier to understand the way it's done now than if we had to take e.g. index 0 and then skip a number of indexes equal to the number of groups to get the second input corresponding to the same replica

)


Expand All @@ -359,10 +359,24 @@ def pseudodata_table(groups_replicas_indexed_make_replica, replicas):
`fitting::savepseudodata` is `true` (as per the default setting) and
replicas are fitted one at a time. The table can be found in the replica
folder i.e. <fit dir>/nnfit/replica_*/

"""
# Concatenate over replicas
df = pd.concat(groups_replicas_indexed_make_replica)
# groups_replicas_indexed_make_replica is collected over both replicas and dataset_input groups,
# in that order. What this means is that groups_replicas_indexed_make_replica is a list of size
# number_of_replicas x number_of_data_groups. Where the ordering inside the list is as follows:
# [data1_rep1, data2_rep1, ..., datan_rep1, ..., data1_repn, data2_repn, ..., datan_repn].

# To correctly put this into a single dataframe, we first need to know the number of
# dataset_input groups there are for each replica
groups_per_replica = len(groups_replicas_indexed_make_replica) // len(replicas)
# then we make a list of pandas dataframes, each containing the pseudodata of all datasets
# generated for a single replica
df = [
pd.concat(groups_replicas_indexed_make_replica[i : i + groups_per_replica])
for i in range(0, len(groups_replicas_indexed_make_replica), groups_per_replica)
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't you achieve this with a reshape or permutation of groups_replicas_indexed_make_replica

Copy link
Member Author

@RoyStegeman RoyStegeman Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because groups_replicas_indexed_make_replica is a list of indexed_make_replica containing a number of dataframes equal to number_of_replicas x number_of_datagroups. It's not a really clean input.

Here I group the list items (all different data groups) that correspond to the same replica into a single dataframe for each replica

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just complicating your life here, but would it be possible to do something along the lines of

np.array(groups_replicas_indexed_make_replica).reshape(replicas, groups_per_replica) ?

(or the other way around)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No because it's a list of dataframes and I do need to retain the information on the labels

# then we concatentate the pseudodata of all replicas into a single dataframe
df = pd.concat(df, axis=1)
# and finally we add as column titles the replica name
df.columns = [f"replica {rep}" for rep in replicas]
return df

Expand Down