Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resample beliefs about instantaneous sensors #118

Merged
merged 30 commits into from
Nov 22, 2022
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
55341e8
Support downsampling instantaneous sensors
Flix6x Oct 21, 2022
7f13512
More robust implementation of resampling instantaneous sensor data, t…
Flix6x Oct 31, 2022
53a78dc
Add tests
Flix6x Oct 31, 2022
4f5de41
Add NotImplementedError case
Flix6x Oct 31, 2022
13f1fbc
Simplify util function
Flix6x Oct 31, 2022
66ee209
Expand docstring
Flix6x Oct 31, 2022
a904765
isort
Flix6x Oct 31, 2022
f3feabc
Fix test
Flix6x Oct 31, 2022
167f007
Add two regular (non-DST-transition) resampling cases
Flix6x Nov 9, 2022
3537784
Move index reset dance to within util function
Flix6x Nov 9, 2022
707ea63
Remove upsampling example in a test for downsampling (upsampling shou…
Flix6x Nov 9, 2022
e262dd8
Remove redundante call to drop_duplicates
Flix6x Nov 9, 2022
15f6429
Merge remote-tracking branch 'origin/main' into resample-beliefs-abou…
Flix6x Nov 10, 2022
d1bbb5b
typo
Flix6x Nov 13, 2022
cc00933
Clarify inline comment
Flix6x Nov 13, 2022
9330d16
Add inline explanation about handling DST transitions
Flix6x Nov 17, 2022
b951c78
Revert "Remove upsampling example in a test for downsampling (upsampl…
Flix6x Nov 17, 2022
7bf2217
Add new property: event_frequency
Flix6x Nov 21, 2022
5395a6d
Rename function to resample instantaneous events, rewrite its logic b…
Flix6x Nov 21, 2022
3bd3d35
Add a lot more test cases for resampling instantaneous sensor data
Flix6x Nov 21, 2022
1d98721
Test resolution and frequency for resample instantaneous BeliefsDataF…
Flix6x Nov 21, 2022
b3a70b9
Expand test for resampling instantaneous BeliefsDataFrame with 'first…
Flix6x Nov 21, 2022
bb703f9
clarifications
Flix6x Nov 21, 2022
5621506
Do not cast floats to ints (loss of information)
Flix6x Nov 21, 2022
9dec4da
Restrict resampling to BeliefsDataFrames with 1 row per event
Flix6x Nov 21, 2022
ee00e40
isort and flake8
Flix6x Nov 21, 2022
23353ea
Merge branch 'main' into resample-beliefs-about-instantaneous-sensors
Flix6x Nov 21, 2022
ed18518
Rename test
Flix6x Nov 22, 2022
f943d06
Clarify docstring of function parameters
Flix6x Nov 22, 2022
a4fca7e
Expand resample_events docstring
Flix6x Nov 22, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions timely_beliefs/beliefs/classes.py
Original file line number Diff line number Diff line change
Expand Up @@ -1347,6 +1347,15 @@ def resample_events(
return self
df = self

# Resample instantaneous sensors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't explain why your new function only applies to instantaneous sensors. Can you add that?

And why is that the one case in which we take so much care of DST transitions? (this resample_events function here has two other cases)

Note: The NB about keep_only_most_recent_belief=True applies to only one case, the reader might assume it's about all of them.

# The event resolution stays zero, but the data frequency updates
if df.event_resolution == timedelta(0):
index_names = df.index.names
df = df.reset_index().set_index("event_start")
Flix6x marked this conversation as resolved.
Show resolved Hide resolved
df = belief_utils.downsample_first(df, event_resolution)
df = df.reset_index().set_index(index_names)
return df

belief_timing_col = (
"belief_time" if "belief_time" in df.index.names else "belief_horizon"
)
Expand Down
22 changes: 21 additions & 1 deletion timely_beliefs/beliefs/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -283,7 +283,9 @@ def join_beliefs(
if output_resolution > input_resolution:

# Create new BeliefsDataFrame with downsampled event_start
if output_resolution % input_resolution != timedelta(0):
if input_resolution == timedelta(
0
) or output_resolution % input_resolution != timedelta(0):
raise NotImplementedError(
"Cannot downsample from resolution %s to %s."
% (input_resolution, output_resolution)
Expand Down Expand Up @@ -749,3 +751,21 @@ def extreme_timedeltas_not_equal(
if isinstance(td_a, pd.Timedelta):
td_a = td_a.to_pytimedelta()
return td_a != td_b


def downsample_first(df: pd.DataFrame, resolution: timedelta) -> pd.DataFrame:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function name was zero helpful to me. Comment also didn't help a lot.

Is this downsampling, and in case of non-obvious cases opting for the first event?

Also, handling DST correctly is so crucial, we could even consider it in to he name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising this issue. For me it boils down to a more fundamental question: does resampling affect the data frequency or the event resolution? Hopefully these examples illustrate the question:

Let's say there is a sensor recording the average wind speed over a period of 3 seconds. That is, its event resolution is 3 seconds. And let's say a record is made every second, so we get a sliding window of measurements. That is, its data frequency is 1 second. (To be explicit, I'm following pandas terminology here, which measures frequency in units of time, say, s rather than 1/s.)

Now we want to resample and save the results to a new sensor that records the maximum 3-second wind gust in a 10 minute period, for every sequential period of 10 minutes (this is actually a useful measure, e.g. reported by KNMI). That is, the new sensor has an event resolution of 10 minutes, and also a data frequency of 10 minutes.

This resampling operation is downsampling, with both the event resolution and the data frequency being downsampled to 10 minutes. Something like .resample(pd.Timedelta("PT10M")).max() should do the trick to transform the data, and the event resolution would be updated to 10 minutes.

Now consider a sensor recording instantaneous temperature readings, once per hour. That is, its event resolution is 0 hours and its data frequency is 1 hour. What would it mean to resample to 30 minutes? I see two possibilities:

  1. We measure instantaneous temperature every 30 minutes. The event resolution stays 0 hours and the data frequency becomes 30 minutes. Something like .resample(pd.Timedelta("PT30M")).ffill() (or .interpolate()) would do the trick to transform the data. The event resolution would not change. We are upsampling (with regards to the data frequency).
  2. We measure average temperatures over a period of 30 minutes. Some combination of .resample().interpolate().rolling().mean() should do the trick to transform the data. The event resolution would be updated to 30 minutes. We are upsampling (with regards to the data frequency) and downsampling (with regards to the event resolution).

For example, given temperature readings:

10 °C at 1 PM
12 °C at 2 PM

With linear interpolation, option 1 yields:

10 °C at 1 PM
11 °C at 1.30 PM
12 °C at 2 PM

With linear interpolation, option 2 yields:

10.5 °C average between 1 PM and 1.30 PM
11.5 °C average between 1.30 PM and 2 PM

For now, I'm gravitating towards a policy like "in timely-beliefs, event resampling, by default, updates both the data frequency and the event resolution for non-instantaneous sensors, and only the data frequency for instantaneous sensors." But I'd like to support updating the event resolution for instantaneous sensors, too, maybe only when explicitly requested. For example, consider instantaneous battery SoC measurements every 5 minutes. When resampled to a day, one might not really be interested to know the SoC at midnight every day (.first(), and keep the original zero event resolution), but rather something like the daily min or max SoC (.min()/.max(), and update the event resolution to 1 day).

Copy link
Contributor

@nhoening nhoening Nov 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the distinction now (between treating data frequency and event resolution)

W.r.t. your last paragraph,

  1. I can imagine resampling policies, which can be chosen by way of general configuration setting or even per sensor attribute. You named two policies here, hopefully we could keep that simple for now. And you know which default you'd prefer.
  2. The text you've written can probably enter the timely-beliefs documentation, so everybody can learn about our thoughts here.

Finally, you made no suggestion for a better function name yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resample_to_frequency or resample_frequency?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed the function.

I can imagine resampling policies

I now made the util function accept different resampling methods, and restricted its use in resample_events to BeliefsDataFrames containing as many events as data points. Solving the general case is a lot harder, but this should already support most use cases.

The text you've written can probably enter the timely-beliefs documentation, so everybody can learn about our thoughts here.

I opened #123 for this.

"""Resample data representing instantaneous events.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumption (instantaneous events) is never checked. Or that the index is even a DateTimeIndex

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you reply on this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm more leaning towards making this a private function. Would that take away your concern?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite. If it matters so much, why not add an assert statement? I'm guessing you wouldn't be sure what the outcomes would be if df wasn't instantaneous?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #124 as a follow-up.


Updates the data frequency, while keeping the event resolution.

Note that the data frequency may not be constant due to DST transitions
The duration between observations is longer for the fall DST transition,
and shorter for the spring DST transition.
"""
ds_index = df.index.floor(
resolution, ambiguous=[True] * len(df), nonexistent="shift_forward"
Flix6x marked this conversation as resolved.
Show resolved Hide resolved
).drop_duplicates()
ds_df = df[df.index.isin(df.index.join(ds_index, how="inner"))]
if ds_df.index.freq is None and len(ds_df) > 2:
ds_df.index.freq = pd.infer_freq(ds_df.index)
return ds_df
56 changes: 55 additions & 1 deletion timely_beliefs/tests/test_belief_utils.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
import pandas as pd
import pytest

from timely_beliefs.beliefs.probabilistic_utils import get_median_belief
from timely_beliefs.beliefs.utils import propagate_beliefs
from timely_beliefs.beliefs.utils import downsample_first, propagate_beliefs
from timely_beliefs.examples import get_example_df
from timely_beliefs.tests.utils import equal_lists


def test_propagate_multi_sourced_deterministic_beliefs():
Expand Down Expand Up @@ -31,3 +33,55 @@ def test_propagate_multi_sourced_deterministic_beliefs():
== pd.Timestamp("2000-01-01 01:00:00+00:00")
].droplevel("belief_time"),
)


@pytest.mark.parametrize(
Flix6x marked this conversation as resolved.
Show resolved Hide resolved
("start", "periods", "resolution", "exp_event_values"),
[
(
"2022-03-27 01:00+01",
7,
"PT2H",
[2, 3, 5, 7],
), # DST transition from +01 to +02 (spring forward, contracted event)
(
"2022-10-30 01:00+02",
7,
"PT2H",
[2, 5, 7],
nhoening marked this conversation as resolved.
Show resolved Hide resolved
), # DST transition from +02 to +01 (fall back -> extended event)
(
"2022-03-26 01:00+01",
23 + 23 + 23,
"PT24H",
[24, 47],
), # midnight of 1 full (contracted) day, plus the following midnight of 1 partial day
(
"2022-03-26 01:00+01",
23 + 23 + 23,
"P1D",
[24, 47],
), # midnight of 1 full (contracted) day, plus the following midnight of 1 partial day
(
"2022-10-29 01:00+02",
23 + 25 + 23,
"PT24H",
[24, 49],
), # midnight of 1 full (extended) day, plus the following midnight of 1 partial day
(
"2022-10-29 01:00+02",
23 + 25 + 24 + 23,
"P1D",
[24, 49, 73],
), # midnight of 1 full (extended) day and 1 full (regular) day, plus the following midnight of 1 partial day
],
)
def test_downsample_first(start, periods, resolution, exp_event_values):
"""Enumerate the events and check whether downsampling returns the expected events."""
index = pd.date_range(start, periods=periods, freq="1H").tz_convert(
"Europe/Amsterdam"
)
df = pd.DataFrame(list(range(1, periods + 1)), index=index)
ds_df = downsample_first(df, pd.Timedelta(resolution))
print(ds_df)
assert equal_lists(ds_df.values, exp_event_values)
65 changes: 55 additions & 10 deletions timely_beliefs/tests/test_df_resampling.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,26 +15,27 @@

@pytest.fixture(scope="function", autouse=True)
def df_wxyz(
time_slot_sensor: Sensor, test_source_a: BeliefSource, test_source_b: BeliefSource
) -> Callable[[int, int, int, int, Optional[datetime]], BeliefsDataFrame]:
test_source_a: BeliefSource, test_source_b: BeliefSource
) -> Callable[[Sensor, int, int, int, int, Optional[datetime]], BeliefsDataFrame]:
"""Convenient BeliefsDataFrame to run tests on.
For a single sensor, it contains w events, for each of which x beliefs by y sources each (max 2),
described by z probabilistic values (max 3).
Note that the event resolution of the sensor is 15 minutes.
"""

sources = [test_source_a, test_source_b] # expand to increase max y
cps = [0.1587, 0.5, 0.8413] # expand to increase max z

def f(w: int, x: int, y: int, z: int, start: Optional[datetime] = None):
def f(
sensor: Sensor, w: int, x: int, y: int, z: int, start: Optional[datetime] = None
):
if start is None:
start = datetime(2000, 1, 3, 9, tzinfo=pytz.utc)

# Build up a BeliefsDataFrame with various events, beliefs, sources and probabilistic accuracy (for a single sensor)
beliefs = [
TimedBelief(
source=sources[s],
sensor=time_slot_sensor,
sensor=sensor,
value=1000 * e + 100 * b + 10 * s + p,
belief_time=datetime(2000, 1, 1, tzinfo=pytz.utc) + timedelta(hours=b),
event_start=start + timedelta(hours=e),
Expand All @@ -45,7 +46,7 @@ def f(w: int, x: int, y: int, z: int, start: Optional[datetime] = None):
for s in range(y) # y sources
for p in range(z) # z cumulative probabilities
]
return BeliefsDataFrame(sensor=time_slot_sensor, beliefs=beliefs)
return BeliefsDataFrame(sensor=sensor, beliefs=beliefs)

return f

Expand All @@ -55,15 +56,35 @@ def df_4323(
time_slot_sensor: Sensor,
test_source_a: BeliefSource,
test_source_b: BeliefSource,
df_wxyz: Callable[[int, int, int, int, Optional[datetime]], BeliefsDataFrame],
df_wxyz: Callable[
[Sensor, int, int, int, int, Optional[datetime]], BeliefsDataFrame
],
) -> BeliefsDataFrame:
"""Convenient BeliefsDataFrame to run tests on.
For a single sensor, it contains 4 events, for each of which 3 beliefs by 2 sources each, described by 3
probabilistic values.
Note that the event resolution of the sensor is 15 minutes.
"""
start = pytz.timezone("utc").localize(datetime(2000, 1, 3, 9))
return df_wxyz(time_slot_sensor, 4, 3, 2, 3, start)


@pytest.fixture(scope="function", autouse=True)
def df_instantaneous_4323(
instantaneous_sensor: Sensor,
test_source_a: BeliefSource,
test_source_b: BeliefSource,
df_wxyz: Callable[
[Sensor, int, int, int, int, Optional[datetime]], BeliefsDataFrame
],
) -> BeliefsDataFrame:
"""Convenient BeliefsDataFrame to run tests on.
For a single sensor, it contains 4 events, for each of which 3 beliefs by 2 sources each, described by 3
probabilistic values.
Note that the event resolution of the sensor is 15 minutes.
"""
start = pytz.timezone("utc").localize(datetime(2000, 1, 3, 9))
return df_wxyz(4, 3, 2, 3, start)
return df_wxyz(instantaneous_sensor, 4, 3, 2, 3, start)


def test_replace_index_level_with_intersect(df_4323):
Expand Down Expand Up @@ -241,13 +262,18 @@ def test_percentages_and_accuracy_of_probabilistic_model(df_4323: BeliefsDataFra


def test_downsample_once_upsample_once_around_dst(
df_wxyz: Callable[[int, int, int, int, Optional[datetime]], BeliefsDataFrame]
time_slot_sensor: Sensor,
df_wxyz: Callable[
[Sensor, int, int, int, int, Optional[datetime]], BeliefsDataFrame
],
):
"""Fast track resampling is enabled because the data contains 1 deterministic belief per event and a unique belief time and source."""
downsampled_event_resolution = timedelta(hours=24)
upsampled_event_resolution = timedelta(minutes=10)
start = pytz.timezone("Europe/Amsterdam").localize(datetime(2020, 3, 29, 0))
df = df_wxyz(25, 1, 1, 1, start) # 1 deterministic belief per event
df = df_wxyz(
time_slot_sensor, 25, 1, 1, 1, start
) # 1 deterministic belief per event
df.iloc[0] = np.NaN # introduce 1 NaN value
print(df)

Expand Down Expand Up @@ -311,3 +337,22 @@ def test_groupby_preserves_metadata(df_4323: BeliefsDataFrame):
assert slice_0.sensor == df.sensor
df_2 = grouper.apply(lambda x: x.head(1))
assert df_2.sensor == df.sensor


def test_downsample_instantaneous(df_instantaneous_4323):
"""Check resampling instantaneous events from hourly readings to 2-hourly readings.

Given data for 9, 10, 11 and 12 o'clock, we expect to get out only data for 10 and 12 o'clock.
"""
pd.set_option("display.max_rows", None)
print(df_instantaneous_4323)
# Downsample the original frame
downsampled_event_resolution = timedelta(hours=2)
df_resampled_1 = df_instantaneous_4323.resample_events(downsampled_event_resolution)
print(df_resampled_1)
df_expected = df_instantaneous_4323[
df_instantaneous_4323.index.get_level_values("event_start").isin(
["2000-01-03 10:00:00+00:00", "2000-01-03 12:00:00+00:00"]
)
]
pd.testing.assert_frame_equal(df_resampled_1, df_expected)