Skip to content

Commit

Permalink
feat: Add Cassandra/AstraDB online store contribution (#2873)
Browse files Browse the repository at this point in the history
* Cassandra online store

* Refactor file-editing to a shared utils module
* Use f-strings in the CassandraOnlineStoreCreator
* Specify version 2 in serializing to make the entity key
* Remove unnecessary empty comment lines
* Rename proj to columns in _read_rows_by_entity_key
* Introduce Cassandra-specific pytest targets
* Adapt roadmaps and docs to cover/index Cassandra online store
* Add license notes to code files

Signed-off-by: Stefano Lottini <[email protected]>

* remove from main CI path and update entity key serialization in template

Signed-off-by: Danny Chiao <[email protected]>

* revert makefile change

Signed-off-by: Danny Chiao <[email protected]>

Co-authored-by: Kevin Zhang <[email protected]>
Co-authored-by: Danny Chiao <[email protected]>
  • Loading branch information
3 people authored Aug 9, 2022
1 parent d4c15e7 commit feb6cb8
Show file tree
Hide file tree
Showing 32 changed files with 1,432 additions and 57 deletions.
1 change: 1 addition & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -317,6 +317,7 @@ The services with containerized replacements currently implemented are:
- Trino
- HBase
- Postgres
- Cassandra

You can run `make test-python-integration-container` to run tests against the containerized versions of dependencies.

Expand Down
25 changes: 25 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,31 @@ test-python-universal-postgres:
not test_universal_types" \
sdk/python/tests

test-python-universal-cassandra:
PYTHONPATH='.' \
FULL_REPO_CONFIGS_MODULE=sdk.python.feast.infra.online_stores.contrib.cassandra_repo_configuration \
FEAST_USAGE=False \
IS_TEST=True \
python -m pytest -x --integration \
sdk/python/tests

test-python-universal-cassandra-no-cloud-providers:
PYTHONPATH='.' \
FULL_REPO_CONFIGS_MODULE=sdk.python.feast.infra.online_stores.contrib.cassandra_repo_configuration \
FEAST_USAGE=False \
IS_TEST=True \
python -m pytest -x --integration \
-k "not test_lambda_materialization_consistency and \
not test_apply_entity_integration and \
not test_apply_feature_view_integration and \
not test_apply_entity_integration and \
not test_apply_feature_view_integration and \
not test_apply_data_source_integration and \
not test_nullable_online_store and \
not gcs_registry and \
not s3_registry" \
sdk/python/tests

test-python-universal:
FEAST_USAGE=False IS_TEST=True python -m pytest -n 8 --integration sdk/python/tests

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -177,7 +177,7 @@ The list below contains the functionality that contributors are planning to deve
* [x] [Azure Cache for Redis (community plugin)](https://github.com/Azure/feast-azure)
* [x] [Postgres (contrib plugin)](https://docs.feast.dev/reference/online-stores/postgres)
* [x] [Custom online store support](https://docs.feast.dev/how-to-guides/adding-support-for-a-new-online-store)
* [x] [Cassandra / AstraDB](https://github.com/datastaxdevs/feast-cassandra-online-store)
* [x] [Cassandra / AstraDB](https://docs.feast.dev/reference/online-stores/cassandra)
* [ ] Bigtable (in progress)
* **Feature Engineering**
* [x] On-demand Transformations (Alpha release. See [RFC](https://docs.google.com/document/d/1lgfIw0Drc65LpaxbUu49RCeJgMew547meSJttnUqz7c/edit#))
Expand Down
4 changes: 4 additions & 0 deletions docs/reference/online-stores/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,3 +25,7 @@ Please see [Online Store](../../getting-started/architecture-and-components/onli
{% content-ref url="postgres.md" %}
[postgres.md](postgres.md)
{% endcontent-ref %}

{% content-ref url="cassandra.md" %}
[cassandra.md](cassandra.md)
{% endcontent-ref %}
61 changes: 61 additions & 0 deletions docs/reference/online-stores/cassandra.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Cassandra / Astra DB online store

## Description

The [Cassandra / Astra DB] online store provides support for materializing feature values into an Apache Cassandra / Astra DB database for online features.

* The whole project is contained within a Cassandra keyspace
* Each feature view is mapped one-to-one to a specific Cassandra table
* This implementation inherits all strengths of Cassandra such as high availability, fault-tolerance, and data distribution

An easy way to get started is the command `feast init REPO_NAME -t cassandra`.

### Example (Cassandra)

{% code title="feature_store.yaml" %}
```yaml
project: my_feature_repo
registry: data/registry.db
provider: local
online_store:
type: cassandra
hosts:
- 192.168.1.1
- 192.168.1.2
- 192.168.1.3
keyspace: KeyspaceName
port: 9042 # optional
username: user # optional
password: secret # optional
protocol_version: 5 # optional
load_balancing: # optional
local_dc: 'datacenter1' # optional
load_balancing_policy: 'TokenAwarePolicy(DCAwareRoundRobinPolicy)' # optional
```
{% endcode %}
### Example (Astra DB)
{% code title="feature_store.yaml" %}
```yaml
project: my_feature_repo
registry: data/registry.db
provider: local
online_store:
type: cassandra
secure_bundle_path: /path/to/secure/bundle.zip
keyspace: KeyspaceName
username: Client_ID
password: Client_Secret
protocol_version: 4 # optional
load_balancing: # optional
local_dc: 'eu-central-1' # optional
load_balancing_policy: 'TokenAwarePolicy(DCAwareRoundRobinPolicy)' # optional

```
{% endcode %}

For a full explanation of configuration options please look at file
`sdk/python/feast/infra/online_stores/contrib/cassandra_online_store/README.md`.

Storage specifications can be found at `docs/specs/online_store_format.md`.
2 changes: 1 addition & 1 deletion docs/roadmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ The list below contains the functionality that contributors are planning to deve
* [x] [Azure Cache for Redis (community plugin)](https://github.com/Azure/feast-azure)
* [x] [Postgres (contrib plugin)](https://docs.feast.dev/reference/online-stores/postgres)
* [x] [Custom online store support](https://docs.feast.dev/how-to-guides/adding-support-for-a-new-online-store)
* [x] [Cassandra / AstraDB](https://github.com/datastaxdevs/feast-cassandra-online-store)
* [x] [Cassandra / AstraDB](https://docs.feast.dev/reference/online-stores/cassandra)
* [ ] Bigtable (in progress)
* **Feature Engineering**
* [x] On-demand Transformations (Alpha release. See [RFC](https://docs.google.com/document/d/1lgfIw0Drc65LpaxbUu49RCeJgMew547meSJttnUqz7c/edit#))
Expand Down
80 changes: 80 additions & 0 deletions docs/specs/online_store_format.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,86 @@ Other types of entity keys are not supported in this version of the specificatio

![Datastore Online Example](datastore_online_example.png)

## Cassandra/Astra DB Online Store Format

### Overview

Apache Cassandra™ is a table-oriented NoSQL distributed database. Astra DB is a managed database-as-a-service
built on Cassandra, and will be assimilated to the former in what follows.

In Cassandra, tables are grouped in _keyspaces_ (groups of related tables). Each table is comprised of
_rows_, each containing data for a given set of _columns_. Moreover, rows are grouped in _partitions_ according
to a _partition key_ (a portion of the uniqueness-defining _primary key_ set of columns), so that all rows
with the same values for the partition key are guaranteed to be stored on the same Cassandra nodes, next to each other,
which guarantees fast retrieval times.

This architecture makes Cassandra a good fit for an online feature store in Feast.

### Cassandra Online Store Format

Each project (denoted by its name, called "feature store name" elsewhere) may contain an
arbitrary number of `FeatureView`s: these correspond each to a specific table, and
all tables for a project are to be contained in a single keyspace. The keyspace should
have been created by the Feast user preliminarly and is to be specified in the feature store
configuration `yaml`.

The table for a project `project` and feature view `FeatureView` will have name
`project_FeatureView` (e.g. `feature_repo_driver_hourly_stats`).

All tables have the same structure. Cassandra is schemaful and the columns are strongly typed.
In the following table schema (which also serves as Chebotko diagram) the Python
and Cassandra data types are both specified:

|Table: |`<project>`_`<FeatureView>` | | _(Python type)_ |
|---------------|-----------------------------|--|----------------------|
|`entity_key` |`TEXT` |K | `str` |
|`feature_name` |`TEXT` |C↑| `str` |
|`value` |`BLOB` | | `bytes` |
|`event_ts` |`TIMESTAMP` | | `datetime.datetime` |
|`created_ts` |`TIMESTAMP` | | `datetime.datetime` |

Each row in the table represents a single value for a feature in a feature view,
thus associated to a specific entity. The choice of partitioning ensures that,
within a given feature view (i.e. a single table), for a given entity any number
of features can be retrieved with a single, best-practice-respecting query
(which is what happens in the `online_read` method implementation).


The `entity_key` column is computed as `serialize_entity_key(entityKey).hex()`,
where `entityKey` is of type `feast.protos.feast.types.EntityKey_pb2.EntityKey`.

The value of `feature_name` is the plain-text name of the feature as defined
in the corresponding `FeatureView`.

For `value`, the bytes from `[protoValue].SerializeToString()`
are used, where `protoValue` is of type `feast.protos.feast.types.Value_pb2.Value`.

Column `event_ts` stores the timestamp the feature value refers to, as passed
to the store method. Conversely, column `created_ts`, meant to store the write
time for the entry, is now being deprecated and will be never written by this
online-store implementation. Thanks to the internal storage mechanism of Cassandra,
this does not incur a noticeable performance penalty (hence, for the time being,
the column can be maintained in the schema).

### Example entry

For a project `feature_repo` and feature view named `driver_hourly_stats`,
a typical row in table `feature_repo_driver_hourly_stats` might look like:

|Column |content | notes |
|---------------|-----------------------------------------------------|-------------------------------------------------------------------|
|`entity_key` |`020000006472697665725f69640400000004000000ea030000` | from `"driver_id = 1002"` |
|`feature_name` |`conv_rate` | |
|`value` |`0x35f5696d3f` | from `float_val: 0.9273980259895325`, i.e. `(b'5\xf5im?').hex()` |
|`event_ts` |`2022-07-07 09:00:00.000000+0000` | from `datetime.datetime(2022, 7, 7, 9, 0)` |
|`created_ts` |`null` | not explicitly written to avoid unnecessary tombstones |

### Known Issues

If a `FeatureView` ever gets _re-defined_ in a schema-breaking way, the implementation is not able to rearrange the
schema of the underlying table accordingly (neither dropping all data nor, even less so, keeping it somehow).
This should never occur, lest one encounters all sorts of data-retrieval issues anywhere in Feast usage.

# Appendix

##### Appendix A. Value proto format.
Expand Down
7 changes: 7 additions & 0 deletions sdk/python/docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -287,6 +287,13 @@ HBase Online Store
:members:
:noindex:

Cassandra Online Store
-----------------------

.. automodule:: feast.infra.online_stores.contrib.cassandra_online_store.cassandra_online_store
:members:
:noindex:


Batch Materialization Engine
============================
Expand Down
20 changes: 14 additions & 6 deletions sdk/python/docs/source/feast.infra.offline_stores.contrib.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,18 +14,26 @@ Subpackages
Submodules
----------

feast.infra.offline\_stores.contrib.contrib\_repo\_configuration module
-----------------------------------------------------------------------
feast.infra.offline\_stores.contrib.postgres\_repo\_configuration module
------------------------------------------------------------------------

.. automodule:: feast.infra.offline_stores.contrib.contrib_repo_configuration
.. automodule:: feast.infra.offline_stores.contrib.postgres_repo_configuration
:members:
:undoc-members:
:show-inheritance:

feast.infra.offline\_stores.contrib.postgres\_repo\_configuration module
------------------------------------------------------------------------
feast.infra.offline\_stores.contrib.spark\_repo\_configuration module
---------------------------------------------------------------------

.. automodule:: feast.infra.offline_stores.contrib.postgres_repo_configuration
.. automodule:: feast.infra.offline_stores.contrib.spark_repo_configuration
:members:
:undoc-members:
:show-inheritance:

feast.infra.offline\_stores.contrib.trino\_repo\_configuration module
---------------------------------------------------------------------

.. automodule:: feast.infra.offline_stores.contrib.trino_repo_configuration
:members:
:undoc-members:
:show-inheritance:
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
feast.infra.online\_stores.contrib.cassandra\_online\_store package
===================================================================

Submodules
----------

feast.infra.online\_stores.contrib.cassandra\_online\_store.cassandra\_online\_store module
-------------------------------------------------------------------------------------------

.. automodule:: feast.infra.online_stores.contrib.cassandra_online_store.cassandra_online_store
:members:
:undoc-members:
:show-inheritance:

Module contents
---------------

.. automodule:: feast.infra.online_stores.contrib.cassandra_online_store
:members:
:undoc-members:
:show-inheritance:
9 changes: 9 additions & 0 deletions sdk/python/docs/source/feast.infra.online_stores.contrib.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,20 @@ Subpackages
.. toctree::
:maxdepth: 4

feast.infra.online_stores.contrib.cassandra_online_store
feast.infra.online_stores.contrib.hbase_online_store

Submodules
----------

feast.infra.online\_stores.contrib.cassandra\_repo\_configuration module
------------------------------------------------------------------------

.. automodule:: feast.infra.online_stores.contrib.cassandra_repo_configuration
:members:
:undoc-members:
:show-inheritance:

feast.infra.online\_stores.contrib.hbase\_repo\_configuration module
--------------------------------------------------------------------

Expand Down
8 changes: 8 additions & 0 deletions sdk/python/docs/source/feast.rst
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,14 @@ feast.field module
:undoc-members:
:show-inheritance:

feast.file\_utils module
------------------------

.. automodule:: feast.file_utils
:members:
:undoc-members:
:show-inheritance:

feast.flags\_helper module
--------------------------

Expand Down
7 changes: 7 additions & 0 deletions sdk/python/docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -287,6 +287,13 @@ HBase Online Store
:members:
:noindex:

Cassandra Online Store
-----------------------

.. automodule:: feast.infra.online_stores.contrib.cassandra_online_store.cassandra_online_store
:members:
:noindex:


Batch Materialization Engine
============================
Expand Down
2 changes: 1 addition & 1 deletion sdk/python/feast/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -590,7 +590,7 @@ def materialize_incremental_command(ctx: click.Context, end_ts: str, views: List
"--template",
"-t",
type=click.Choice(
["local", "gcp", "aws", "snowflake", "spark", "postgres", "hbase"],
["local", "gcp", "aws", "snowflake", "spark", "postgres", "hbase", "cassandra"],
case_sensitive=False,
),
help="Specify a template for the created project",
Expand Down
Loading

0 comments on commit feb6cb8

Please sign in to comment.