Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added capability to handle duplicated timestamps in timeseries data. #36

Merged
merged 8 commits into from
Oct 23, 2023

Conversation

pedrofluxa
Copy link
Contributor

This PR fixes the problem of time-series having duplicated timestamps by averaging groups of duplicates (for numerical data) and/or keeping the first duplicated value for non-numerical data.

A very simple unit test was added as well.

@paxcema
Copy link
Contributor

paxcema commented Sep 25, 2023

I think the failing test is due to the input DF no longer being equal in size to the output DF. This is fine, but the test needs to change.

Copy link
Contributor

@paxcema paxcema left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While overall the approach is nice, there is one important omission: the input df for clean_timeseries may contain several different groups, each of them with duplicate (but valid!) measurements.

We need to handle this differently for the case when tss['group_by'] is a list. I suggest moving the new procedure into something like deduplicate_single_series, then use this for each existing group.

As it stands, this breaks grouped series flows (e.g. lightwood's grouped time series unit tests), so we shouldn't merge yet.

@paxcema paxcema linked an issue Sep 28, 2023 that may be closed by this pull request
@paxcema
Copy link
Contributor

paxcema commented Oct 23, 2023

Added a test that replicates the original Lightwood test which caught a bug in the original implementation of this PR 👍

dataprep_ml/cleaners.py Show resolved Hide resolved
@paxcema paxcema merged commit 1fbc73f into staging Oct 23, 2023
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Multiple time stamp observation handling
2 participants