Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User data input validation #366

Open
AdrianSosic opened this issue Sep 4, 2024 · 0 comments
Open

User data input validation #366

AdrianSosic opened this issue Sep 4, 2024 · 0 comments
Assignees
Labels
new feature New functionality

Comments

@AdrianSosic
Copy link
Collaborator

We should discuss if we want to include a validation step for the user data to check if it matches the campaign specifications they have provided. The rationale is that this could avoid hard-to-detect / silent bugs where a user simply forgets to include a parameter in their campaign definition, which could lead the to worst-case scenario of producing recommendations that have been optimized using the wrong problem specs. I think there is a good chance that this can happen while people are still experimenting with their setup and try different problem specifications.

So what we avoid is that someone starts with a configuration like this ...

searchspace = SearchSpace.from_product(
    [
        NumericalContinuousParameter("Feature_1", (0, 1)),
        NumericalContinuousParameter("Feature_2", (0, 1)),
    ]
)
objective = NumericalTarget("Target", "MAX")
campaign = Campaign(searchspace, objective)
campaign.add_measurements(df)
campaign.recommend(1)

df =
| Feature_1 | Feature_2 | Target |
|-----------|-----------|--------|
| 0.3       | 0.6       | 155    |
| 0.7       | 0.1       | 203    |
| 0.1       | 0.2       | 103    |

... and then thinks "great, works, now let me pull in the real data" and they overlook that the latter has additional context that is relevant for the model. For instance, they could swap out the dataframe against something like the following, resulting in a situation where the different tasks would be mixed up:

df =
| Feature_1 | Feature_2 | Target | Task |
|-----------|-----------|--------|------|
| 0.3       | 0.6       | 155    | A    |
| 0.7       | 0.1       | 203    | A    |
| 0.1       | 0.2       | 103    | B    |

A simple (and at the same time user-friendly) approach could be to simply add an allow_extra: bool = False flag to Campaign.add_measurements, which users can still explicitly deactivate if they are certain what they are doing and want to avoid the check to be able to keep meta data columns in their dataframe.

With the check activated, a simple explicit filtering in the spirit of campaign.add_measurements(df.filter(campaign.columns)) would still do the job.

Potentially, other places could benefit from this as well.

@AdrianSosic AdrianSosic added the new feature New functionality label Sep 4, 2024
@AdrianSosic AdrianSosic self-assigned this Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature New functionality
Projects
None yet
Development

No branches or pull requests

1 participant