Speed up metrics computation by optimizing segment validation #1338

Mr-Geekman · 2023-07-28T08:19:56Z

Before submitting (must do checklist)

Did you read the contribution guide?
Did you update the docs? We use Numpy format for all the methods and classes.
Did you write any new necessary tests?
Did you update the CHANGELOG?

Proposed Changes

Optimize _validate_segment_columns
Remove redundant dataframe selection in metrics computation.

Closing issues

Closes #1336.

Mr-Geekman · 2023-07-28T08:22:05Z

Script for testing:

import time
import json

import numpy as np
import pandas as pd
from loguru import logger

from etna.models import NaiveModel
from etna.datasets import TSDataset, generate_ar_df
from etna.metrics import MAE
from etna.pipeline import Pipeline

HORIZON = 14


def make_df(num_segments: int, num_features: int, num_periods: int, random_state: int = 0) -> pd.DataFrame:
    rng = np.random.default_rng(random_state)
    df = generate_ar_df(
        periods=num_periods, start_time="2020-01-01", n_segments=num_segments
    )

    for i in range(num_features):
        # add int column
        df[f"new_int_{i}"] = rng.integers(low=-100, high=100, size=df.shape[0])

    return df


def check_time(num_segments: int, num_features: int, num_periods: int = 365):
    df = make_df(num_segments=num_segments, num_features=num_features, num_periods=num_periods)
    df_wide = TSDataset.to_dataset(df)
    ts = TSDataset(df=df_wide, freq="D")

    model = NaiveModel(lag=1)
    transforms = []
    pipeline = Pipeline(model=model, transforms=transforms, horizon=HORIZON)

    start_time = time.perf_counter()
    metrics, _, _ = pipeline.backtest(ts=ts, metrics=[MAE()], n_folds=3)
    elapsed_time = time.perf_counter() - start_time

    return elapsed_time


def main():
    num_segments = [10, 100, 1000, 10_000]
    num_features = [0, 3, 10]

    results = []
    for cur_num_segments in num_segments:
        for cur_num_features in num_features:
            time_result = check_time(num_segments=cur_num_segments, num_features=cur_num_features)
            record = {"num_segments": cur_num_segments, "num_features": cur_num_features, "time": time_result}
            results.append(record)
            logger.info(json.dumps(record))

    json.dump(results, open("records_2.json", "w"), indent=2)


if __name__ == "__main__":
    main()

Results without changes:

[
  {
    "num_segments": 10,
    "num_features": 0,
    "time": 0.3590500030000001
  },
  {
    "num_segments": 10,
    "num_features": 3,
    "time": 0.4593144890000005
  },
  {
    "num_segments": 10,
    "num_features": 10,
    "time": 0.3713757409999978
  },
  {
    "num_segments": 100,
    "num_features": 0,
    "time": 1.2579138940000014
  },
  {
    "num_segments": 100,
    "num_features": 3,
    "time": 1.4334653250000002
  },
  {
    "num_segments": 100,
    "num_features": 10,
    "time": 1.562795714
  },
  {
    "num_segments": 1000,
    "num_features": 0,
    "time": 9.964996322999998
  },
  {
    "num_segments": 1000,
    "num_features": 3,
    "time": 13.49794635
  },
  {
    "num_segments": 1000,
    "num_features": 10,
    "time": 15.799086332999998
  },
  {
    "num_segments": 10000,
    "num_features": 0,
    "time": 104.504586417
  },
  {
    "num_segments": 10000,
    "num_features": 3,
    "time": 235.44386497699998
  },
  {
    "num_segments": 10000,
    "num_features": 10,
    "time": 281.90819511
  }
]

Results with changes in _validate_segment_columns:

[
  {
    "num_segments": 10,
    "num_features": 0,
    "time": 0.21762080500000014
  },
  {
    "num_segments": 10,
    "num_features": 3,
    "time": 0.25818068400000005
  },
  {
    "num_segments": 10,
    "num_features": 10,
    "time": 0.2985194660000001
  },
  {
    "num_segments": 100,
    "num_features": 0,
    "time": 0.7977682599999998
  },
  {
    "num_segments": 100,
    "num_features": 3,
    "time": 0.8086340990000007
  },
  {
    "num_segments": 100,
    "num_features": 10,
    "time": 0.9596903270000006
  },
  {
    "num_segments": 1000,
    "num_features": 0,
    "time": 6.304263003999999
  },
  {
    "num_segments": 1000,
    "num_features": 3,
    "time": 6.816117811000002
  },
  {
    "num_segments": 1000,
    "num_features": 10,
    "time": 8.042914625000002
  },
  {
    "num_segments": 10000,
    "num_features": 0,
    "time": 62.120309238999994
  },
  {
    "num_segments": 10000,
    "num_features": 3,
    "time": 70.261712954
  },
  {
    "num_segments": 10000,
    "num_features": 10,
    "time": 85.14670703500002
  }
]

Results with removing extra dataframe selection:

[
  {
    "num_segments": 10,
    "num_features": 0,
    "time": 0.21787695200000012
  },
  {
    "num_segments": 10,
    "num_features": 3,
    "time": 0.2454114409999999
  },
  {
    "num_segments": 10,
    "num_features": 10,
    "time": 0.27747879200000014
  },
  {
    "num_segments": 100,
    "num_features": 0,
    "time": 0.6796172610000006
  },
  {
    "num_segments": 100,
    "num_features": 3,
    "time": 0.678246498
  },
  {
    "num_segments": 100,
    "num_features": 10,
    "time": 0.7758322989999993
  },
  {
    "num_segments": 1000,
    "num_features": 0,
    "time": 4.967111137000002
  },
  {
    "num_segments": 1000,
    "num_features": 3,
    "time": 5.283647876000002
  },
  {
    "num_segments": 1000,
    "num_features": 10,
    "time": 6.464345726000001
  },
  {
    "num_segments": 10000,
    "num_features": 0,
    "time": 49.259913931999996
  },
  {
    "num_segments": 10000,
    "num_features": 3,
    "time": 54.42493236100002
  },
  {
    "num_segments": 10000,
    "num_features": 10,
    "time": 66.683967212
  }
]

Mr-Geekman · 2023-07-28T08:25:04Z

There was also a profiling using py-spy with a script:

import time
import json

import numpy as np
import pandas as pd
from loguru import logger

from etna.models import NaiveModel
from etna.datasets import TSDataset, generate_ar_df
from etna.metrics import MAE
from etna.pipeline import Pipeline

HORIZON = 14


def make_df(num_segments: int, num_features: int, num_periods: int, random_state: int = 0) -> pd.DataFrame:
    rng = np.random.default_rng(random_state)
    df = generate_ar_df(
        periods=num_periods, start_time="2020-01-01", n_segments=num_segments
    )

    for i in range(num_features):
        # add int column
        df[f"new_int_{i}"] = rng.integers(low=-100, high=100, size=df.shape[0])

    return df


def check_time(num_segments: int, num_features: int, num_periods: int = 365):
    df = make_df(num_segments=num_segments, num_features=num_features, num_periods=num_periods)
    df_wide = TSDataset.to_dataset(df)
    ts = TSDataset(df=df_wide, freq="D")

    model = NaiveModel(lag=1)
    transforms = []
    pipeline = Pipeline(model=model, transforms=transforms, horizon=HORIZON)

    start_time = time.perf_counter()
    metrics, _, _ = pipeline.backtest(ts=ts, metrics=[MAE()], n_folds=3)
    elapsed_time = time.perf_counter() - start_time

    return elapsed_time


def main():
    check_time(num_segments=10_000, num_features=3)


if __name__ == "__main__":
    main()

Results look like this:

Key notions:

Metric computation is still takes a lot of time, about 44%.
TSDataset.describe takes a lot of time, about 25%.

github-actions · 2023-07-28T08:32:53Z

🚀 Deployed on https://deploy-preview-1338--etna-docs.netlify.app

codecov-commenter · 2023-07-28T09:53:05Z

Codecov Report

Merging #1338 (3ddb1da) into master (cceb500) will increase coverage by 0.02%.
The diff coverage is 100.00%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##           master    #1338      +/-   ##
==========================================
+ Coverage   89.09%   89.12%   +0.02%     
==========================================
  Files         204      204              
  Lines       12642    12665      +23     
==========================================
+ Hits        11264    11288      +24     
+ Misses       1378     1377       -1

Files Changed	Coverage Δ
etna/metrics/base.py	`95.91% <100.00%> (+2.24%)`	⬆️
etna/metrics/intervals_metrics.py	`98.27% <100.00%> (+0.12%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

alex-hse-repository

We actually need to investigate it deeper. Places to speed up:

Per-segment iteration here -- do we really need to check each segment separately? I guess this is done only for the convenient error msg
We should validate the timestamps and NaNs in vectorized way here -- first of all check the existence of NaNs in passed datasets and then compare the index. We don't need to do it in per-segment fashion. This way we don't need to do dropna and _validate_timestamp_columns in the loop
For the built-in metrics (MAE, SMAPE...) we can implement the vectorized version of metric_fn, this might be hard to implement without changing the base classes. At least try to add @ngit decorator to speed up the computation. One sollution is to create the separate class VectorizedMetric(Metric) and override __call__.

# Conflicts: # CHANGELOG.md

Mr-Geekman · 2023-07-31T11:37:55Z

Results after reworking _validate_timestamp_columns:

[
  {
    "num_segments": 10,
    "num_features": 0,
    "time": 0.2025824759999999
  },
  {
    "num_segments": 10,
    "num_features": 3,
    "time": 0.24138570799999925
  },
  {
    "num_segments": 10,
    "num_features": 10,
    "time": 0.272586608000001
  },
  {
    "num_segments": 100,
    "num_features": 0,
    "time": 0.6173706429999992
  },
  {
    "num_segments": 100,
    "num_features": 3,
    "time": 0.622621066999999
  },
  {
    "num_segments": 100,
    "num_features": 10,
    "time": 0.7178853730000014
  },
  {
    "num_segments": 1000,
    "num_features": 0,
    "time": 4.244795499
  },
  {
    "num_segments": 1000,
    "num_features": 3,
    "time": 4.704148591000001
  },
  {
    "num_segments": 1000,
    "num_features": 10,
    "time": 5.926159300000002
  },
  {
    "num_segments": 10000,
    "num_features": 0,
    "time": 40.73026384199999
  },
  {
    "num_segments": 10000,
    "num_features": 3,
    "time": 46.77938691599999
  },
  {
    "num_segments": 10000,
    "num_features": 10,
    "time": 60.692529713
  }
]

Mr-Geekman · 2023-07-31T12:05:13Z

Results after iteration optimization:

[
  {
    "num_segments": 10,
    "num_features": 0,
    "time": 0.22758442000000034
  },
  {
    "num_segments": 10,
    "num_features": 3,
    "time": 0.303228195
  },
  {
    "num_segments": 10,
    "num_features": 10,
    "time": 0.2589208110000003
  },
  {
    "num_segments": 100,
    "num_features": 0,
    "time": 0.5286539329999993
  },
  {
    "num_segments": 100,
    "num_features": 3,
    "time": 0.575157707999999
  },
  {
    "num_segments": 100,
    "num_features": 10,
    "time": 0.6489074320000014
  },
  {
    "num_segments": 1000,
    "num_features": 0,
    "time": 3.4659758200000006
  },
  {
    "num_segments": 1000,
    "num_features": 3,
    "time": 4.007518799999998
  },
  {
    "num_segments": 1000,
    "num_features": 10,
    "time": 5.0106977860000015
  },
  {
    "num_segments": 10000,
    "num_features": 0,
    "time": 33.155186359
  },
  {
    "num_segments": 10000,
    "num_features": 3,
    "time": 39.375860941000006
  },
  {
    "num_segments": 10000,
    "num_features": 10,
    "time": 53.37710269499999
  }
]

alex-hse-repository · 2023-08-01T10:42:04Z

etna/metrics/base.py

+        df_true = y_true.df.loc[:, pd.IndexSlice[:, "target"]].sort_index(axis=1)
+        df_pred = y_pred.df.loc[:, pd.IndexSlice[:, "target"]].sort_index(axis=1)
+
+        df_true_isna = df_true.isna()


Why can't we just ckech that both of df_true_isna and df_pred_isna sum to 0? As I understand we also need to compare the index here, does equals do it?

I'm not sure that your suggested solution gives the same result as the initial solution. In initial solution we select segment from ts, it uses first_valid_index under the hood and skips the first NaNs.

Here we apply first_valid_index on the whole dataframe and it skips only the nans that are present in all segments. So if segments start at different timestamps some of the segments will have NaNs.

DataFrame.equals compares taking into account index.

Ok, I made a mistake in initial solution we check timestamps after dropna. So we can't really check that sum of isna equals to true. It will give not equivalent solution.
If we want to do not-equivalent check we should discuss what kind of check do we really want here, because I'm not really sure that existing type of check is reasonable enough.

Here we need to check the following things:

There are no NaNs in the datasets

Index of the datasets are the same

alex-hse-repository · 2023-08-01T10:43:12Z

etna/metrics/base.py

-            metrics_per_segment[segment] = self.metric_fn(
-                y_true=y_true[:, segment, "target"].values, y_pred=y_pred[:, segment, "target"].values, **self.kwargs
-            )
+        segments = df_true.columns.get_level_values("segment").unique()


Shouldn't it be sorted as index in the dataframe is sorted?

May be we need test for such behaviour(input datasets have unsorted segments)

Yes, it will be sorted because we sorted index of df_true. Also we have a guarantee that unique returns values in the order of its appearance.

Ok, I'll try to add a test on this.

feature: improve speed of metrics computation

5c53459

Mr-Geekman self-assigned this Jul 28, 2023

Mr-Geekman changed the title ~~Improve speed of metrics computation~~ Optimize metrics computation Jul 28, 2023

Mr-Geekman changed the title ~~Optimize metrics computation~~ Speed up metrics computation by optimizing segment validation Jul 28, 2023

chore: update changelog

db26e16

github-actions bot temporarily deployed to pull request July 28, 2023 08:32 Inactive

alex-hse-repository self-requested a review July 31, 2023 05:41

alex-hse-repository suggested changes Jul 31, 2023

View reviewed changes

Merge branch 'master' into issue-1336

8c9d27b

# Conflicts: # CHANGELOG.md

fix: optimize _validate_timestamp_columns, optimize iteration, fix tests

cdf5ee9

github-actions bot temporarily deployed to pull request July 31, 2023 12:17 Inactive

alex-hse-repository suggested changes Aug 1, 2023

View reviewed changes

d.a.bunin added 3 commits August 1, 2023 15:09

test: add test on changed segment order during metric computation

d2fa452

fix: change the validation of datasets in metrics, update tests

7af7661

chore: update changelog

3ddb1da

github-actions bot temporarily deployed to pull request August 1, 2023 12:58 Inactive

Mr-Geekman requested a review from alex-hse-repository August 1, 2023 13:33

alex-hse-repository approved these changes Aug 1, 2023

View reviewed changes

alex-hse-repository merged commit ddc1711 into master Aug 1, 2023

This was referenced Aug 2, 2023

Optimize TSDataset.describe and TSDataset.info by vectorization #1344

Merged

Vectorize metric computation #1345

Closed

Mr-Geekman deleted the issue-1336 branch August 4, 2023 09:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up metrics computation by optimizing segment validation #1338

Speed up metrics computation by optimizing segment validation #1338

Mr-Geekman commented Jul 28, 2023 •

edited

Loading

Mr-Geekman commented Jul 28, 2023

Mr-Geekman commented Jul 28, 2023

github-actions bot commented Jul 28, 2023 •

edited

Loading

codecov-commenter commented Jul 28, 2023 •

edited

Loading

alex-hse-repository left a comment

Mr-Geekman commented Jul 31, 2023

Mr-Geekman commented Jul 31, 2023

alex-hse-repository Aug 1, 2023

Mr-Geekman Aug 1, 2023

Mr-Geekman Aug 1, 2023

Mr-Geekman Aug 1, 2023

alex-hse-repository Aug 1, 2023

alex-hse-repository Aug 1, 2023

alex-hse-repository Aug 1, 2023

Mr-Geekman Aug 1, 2023

Mr-Geekman Aug 1, 2023

Speed up metrics computation by optimizing segment validation #1338

Speed up metrics computation by optimizing segment validation #1338

Conversation

Mr-Geekman commented Jul 28, 2023 • edited Loading

Before submitting (must do checklist)

Proposed Changes

Closing issues

Mr-Geekman commented Jul 28, 2023

Mr-Geekman commented Jul 28, 2023

github-actions bot commented Jul 28, 2023 • edited Loading

codecov-commenter commented Jul 28, 2023 • edited Loading

Codecov Report

alex-hse-repository left a comment

Choose a reason for hiding this comment

Mr-Geekman commented Jul 31, 2023

Mr-Geekman commented Jul 31, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mr-Geekman commented Jul 28, 2023 •

edited

Loading

github-actions bot commented Jul 28, 2023 •

edited

Loading

codecov-commenter commented Jul 28, 2023 •

edited

Loading