-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added capability to handle duplicated timestamps in timeseries data. #36
Conversation
I think the failing test is due to the input DF no longer being equal in size to the output DF. This is fine, but the test needs to change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While overall the approach is nice, there is one important omission: the input df
for clean_timeseries
may contain several different groups, each of them with duplicate (but valid!) measurements.
We need to handle this differently for the case when tss['group_by']
is a list. I suggest moving the new procedure into something like deduplicate_single_series
, then use this for each existing group.
As it stands, this breaks grouped series flows (e.g. lightwood's grouped time series unit tests), so we shouldn't merge yet.
…adapted accordingly.
Added a test that replicates the original Lightwood test which caught a bug in the original implementation of this PR 👍 |
This PR fixes the problem of time-series having duplicated timestamps by averaging groups of duplicates (for numerical data) and/or keeping the first duplicated value for non-numerical data.
A very simple unit test was added as well.