Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend context for usage statistics collection & add latencies for performance analysis #1983

Merged
merged 21 commits into from
Nov 5, 2021

Conversation

pyalex
Copy link
Collaborator

@pyalex pyalex commented Oct 28, 2021

Signed-off-by: pyalex [email protected]

What this PR does / why we need it:

This PR extends existing usage collection system with time profiling. This will inform how we make step improvements in latency reduction, which can be a blocker for customer adoption.

In addition I also refactored existing usage collection system to group all events produced for a single call stack into one event that will share useful context from all nested functions making analysis much easier. For example FeatureStore.get_online_features (entrypoint function) starts with enriching context with some request specific details (whether ODFV was used in request), then it calls Provider.online_read which adds information about type of provider and in turn calls Store.online_read, which adds to context type of store and detailed traces of each remote call to the Database. All this data will be gathered in single event and will help to answer more comprehensive question like performance of specific store implementation or popularity of this store among all installations.

Usage event will have next attributes:

  • installation_id: str - identifier stored in $HOME/.feast on the first Feast run (basically, it’s installation id and can be used for uniquely identifying user)
  • session_id: str - unique identifier for Python session (generated on the first import of Feast module)
  • entrypoint: str, usually root level function name
  • provider: str, provider type
  • store: str, store type (Optional, for get_*_features calls)
  • calls: List[Dict], list of all function calls in the stack with start and end timestamp
  • exception: Option[str], short string representation of error
  • traceback: Option[List[Tuple]]
  • is_test: bool, whether Feast called inside test
  • is_webserver: bool, whether Feast is called in production (in webserver context)
  • version: str, Feast version
  • python_version: str
  • platform: str, platform identifier
  • timestamp: str, datetime in ISO format

Which issue(s) this PR fixes:

Fixes #

Does this PR introduce a user-facing change?:

NONE

@codecov-commenter
Copy link

codecov-commenter commented Oct 28, 2021

Codecov Report

Merging #1983 (bfa4b92) into master (600d38e) will decrease coverage by 1.68%.
The diff coverage is 95.57%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1983      +/-   ##
==========================================
- Coverage   82.21%   80.52%   -1.69%     
==========================================
  Files         100      103       +3     
  Lines        8052     9161    +1109     
==========================================
+ Hits         6620     7377     +757     
- Misses       1432     1784     +352     
Flag Coverage Δ
integrationtests 74.22% <65.39%> (-0.45%) ⬇️
unittests 59.96% <84.07%> (+1.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
sdk/python/feast/infra/offline_stores/bigquery.py 81.97% <78.57%> (+1.97%) ⬆️
sdk/python/feast/infra/provider.py 90.90% <85.71%> (+1.47%) ⬆️
sdk/python/feast/usage.py 91.44% <90.47%> (+24.09%) ⬆️
sdk/python/feast/feature_store.py 82.80% <100.00%> (-8.52%) ⬇️
sdk/python/feast/infra/aws.py 34.25% <100.00%> (-5.42%) ⬇️
sdk/python/feast/infra/gcp.py 88.13% <100.00%> (+0.63%) ⬆️
sdk/python/feast/infra/local.py 87.23% <100.00%> (+0.87%) ⬆️
sdk/python/feast/infra/offline_stores/file.py 97.60% <100.00%> (+0.09%) ⬆️
sdk/python/feast/infra/offline_stores/redshift.py 87.80% <100.00%> (+0.73%) ⬆️
sdk/python/feast/infra/online_stores/datastore.py 93.70% <100.00%> (+0.25%) ⬆️
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 600d38e...bfa4b92. Read the comment docs.

@feast-ci-bot
Copy link
Collaborator

@pyalex: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
test-usage e5e1aae link /test test-usage

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@pyalex pyalex changed the title [WIP] Extend context for usage statistics collection & add latencies for performance analysis Extend context for usage statistics collection & add latencies for performance analysis Nov 1, 2021
Signed-off-by: pyalex <[email protected]>
Signed-off-by: pyalex <[email protected]>
Signed-off-by: pyalex <[email protected]>
Signed-off-by: pyalex <[email protected]>
@pyalex
Copy link
Collaborator Author

pyalex commented Nov 2, 2021

Benchmarks showed that usage decorators & context managers don't add significant overhead:

Master (FEAST_USAGE=False) vs Refactored usage decorators (FEAST_USAGE=True)

----------------------------------------------------------------------------- benchmark 'test_online_retrieval[Provider: local-Fil-redis]': 2 tests -----------------------------------------------------------------------------
Name (time in ms)                                                      Min                Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_online_retrieval[Provider: local-Fil-redis] (master)       8.9013 (1.0)      10.9754 (1.0)      9.3332 (1.0)      0.3007 (1.0)      9.2611 (1.00)     0.2614 (1.0)          15;9  107.1440 (1.0)         112           1
test_online_retrieval[Provider: local-Fil-redis] (this branch)  8.9959 (1.01)     11.5883 (1.06)     9.4398 (1.01)     0.4705 (1.56)     9.2559 (1.0)      0.3520 (1.35)        13;11  105.9342 (0.99)        111           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Signed-off-by: pyalex <[email protected]>
Signed-off-by: pyalex <[email protected]>
Signed-off-by: pyalex <[email protected]>
@felixwang9817 felixwang9817 mentioned this pull request Nov 3, 2021
Comment on lines +148 to +152
f"SELECT entity_key, feature_name, value, event_ts "
f"FROM {_table_id(config.project, table)} "
f"WHERE entity_key IN ({','.join('?' * len(entity_keys))}) "
f"ORDER BY entity_key",
[serialize_entity_key(entity_key) for entity_key in entity_keys],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this SQL query got changed. Any reasons why it's part of this PR?

Copy link
Collaborator Author

@pyalex pyalex Nov 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done to avoid using tracing_span context manager in the loop. So I refactored this function to have one call to sqlite instead. As a side effect this sped up sqlite retrieval in total by 10-15%. And this change seemed pretty small to create separate PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any concerns with query itself?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, no concerns with the query

Comment on lines 167 to 174
if config.provider == "gcp":
from feast.infra.gcp import GcpProvider

return PassthroughProvider(config)
return GcpProvider(config)

from feast.infra.local import LocalProvider

return LocalProvider(config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we make this logic similar to how we deal with other plugins (online stores, offline stores, etc). In other words create a mapping of provider name to the class path (

ONLINE_STORE_CLASS_FOR_TYPE = {
) and then importing the class by the common logic on lines 178-184 the current file (under else statement).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines +217 to +219
try:
yield
finally:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we record exception in except block here?

Copy link
Collaborator Author

@pyalex pyalex Nov 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exception is handled in @log_usage decorator. And tracing_span context manager must be called only after this decorator applied, other words UsageContext must be already available.

@tsotnet
Copy link
Collaborator

tsotnet commented Nov 5, 2021

/lgtm

@feast-ci-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pyalex, tsotnet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@feast-ci-bot feast-ci-bot merged commit 963d3ac into master Nov 5, 2021
@adchia adchia deleted the telemetry branch November 12, 2021 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants