added capability to handle duplicated timestamps in timeseries data. #36

pedrofluxa · 2023-09-22T20:15:40Z

This PR fixes the problem of time-series having duplicated timestamps by averaging groups of duplicates (for numerical data) and/or keeping the first duplicated value for non-numerical data.

A very simple unit test was added as well.

paxcema · 2023-09-25T19:56:12Z

I think the failing test is due to the input DF no longer being equal in size to the output DF. This is fine, but the test needs to change.

paxcema

While overall the approach is nice, there is one important omission: the input df for clean_timeseries may contain several different groups, each of them with duplicate (but valid!) measurements.

We need to handle this differently for the case when tss['group_by'] is a list. I suggest moving the new procedure into something like deduplicate_single_series, then use this for each existing group.

As it stands, this breaks grouped series flows (e.g. lightwood's grouped time series unit tests), so we shouldn't merge yet.

…adapted accordingly.

tests/integration_tests/test_cleaners.py

paxcema · 2023-10-23T20:43:21Z

Added a test that replicates the original Lightwood test which caught a bug in the original implementation of this PR 👍

dataprep_ml/cleaners.py

added capability to handle duplicated timestamps in timeseries data.

f0f45bb

pedrofluxa requested a review from paxcema September 22, 2023 20:15

pedrofluxa added 2 commits September 22, 2023 20:18

fixed stupid trailing whitespace

389a2ef

another typo

a154e24

fixed integration test by comparing apples to apples

4380a01

paxcema suggested changes Sep 26, 2023

View reviewed changes

this commit fixes breaking the group_by functionality. unit test was …

3c035c9

…adapted accordingly.

paxcema linked an issue Sep 28, 2023 that may be closed by this pull request

[Bug]: Multiple time stamp observation handling mindsdb/mindsdb#7021

Closed

paxcema reviewed Sep 29, 2023

View reviewed changes

tests/integration_tests/test_cleaners.py Outdated Show resolved Hide resolved

paxcema and others added 3 commits September 29, 2023 18:40

fix dict retrieval

3b31f33

fixed algorithm to remove data with duplicated timestamps

f36d42b

add deduping test

ffee73e

paxcema reviewed Oct 23, 2023

View reviewed changes

dataprep_ml/cleaners.py Show resolved Hide resolved

paxcema approved these changes Oct 23, 2023

View reviewed changes

dataprep_ml/cleaners.py Show resolved Hide resolved

paxcema merged commit 1fbc73f into staging Oct 23, 2023
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added capability to handle duplicated timestamps in timeseries data. #36

added capability to handle duplicated timestamps in timeseries data. #36

pedrofluxa commented Sep 22, 2023

paxcema commented Sep 25, 2023

paxcema left a comment

paxcema commented Oct 23, 2023

added capability to handle duplicated timestamps in timeseries data. #36

added capability to handle duplicated timestamps in timeseries data. #36

Conversation

pedrofluxa commented Sep 22, 2023

paxcema commented Sep 25, 2023

paxcema left a comment

Choose a reason for hiding this comment

paxcema commented Oct 23, 2023