User data input validation #366

AdrianSosic · 2024-09-04T18:59:35Z

We should discuss if we want to include a validation step for the user data to check if it matches the campaign specifications they have provided. The rationale is that this could avoid hard-to-detect / silent bugs where a user simply forgets to include a parameter in their campaign definition, which could lead the to worst-case scenario of producing recommendations that have been optimized using the wrong problem specs. I think there is a good chance that this can happen while people are still experimenting with their setup and try different problem specifications.

So what we avoid is that someone starts with a configuration like this ...

searchspace = SearchSpace.from_product(
    [
        NumericalContinuousParameter("Feature_1", (0, 1)),
        NumericalContinuousParameter("Feature_2", (0, 1)),
    ]
)
objective = NumericalTarget("Target", "MAX")
campaign = Campaign(searchspace, objective)
campaign.add_measurements(df)
campaign.recommend(1)

df =
| Feature_1 | Feature_2 | Target |
|-----------|-----------|--------|
| 0.3       | 0.6       | 155    |
| 0.7       | 0.1       | 203    |
| 0.1       | 0.2       | 103    |

... and then thinks "great, works, now let me pull in the real data" and they overlook that the latter has additional context that is relevant for the model. For instance, they could swap out the dataframe against something like the following, resulting in a situation where the different tasks would be mixed up:

df =
| Feature_1 | Feature_2 | Target | Task |
|-----------|-----------|--------|------|
| 0.3       | 0.6       | 155    | A    |
| 0.7       | 0.1       | 203    | A    |
| 0.1       | 0.2       | 103    | B    |

A simple (and at the same time user-friendly) approach could be to simply add an allow_extra: bool = False flag to Campaign.add_measurements, which users can still explicitly deactivate if they are certain what they are doing and want to avoid the check to be able to keep meta data columns in their dataframe.

With the check activated, a simple explicit filtering in the spirit of campaign.add_measurements(df.filter(campaign.columns)) would still do the job.

Potentially, other places could benefit from this as well.

The text was updated successfully, but these errors were encountered:

AdrianSosic added the new feature New functionality label Sep 4, 2024

AdrianSosic self-assigned this Sep 4, 2024

AdrianSosic mentioned this issue Sep 4, 2024

Enable Pending Points #319

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User data input validation #366

User data input validation #366

AdrianSosic commented Sep 4, 2024

User data input validation #366

User data input validation #366

Comments

AdrianSosic commented Sep 4, 2024