EMR launcher #1061

oavdeev · 2020-10-15T20:17:49Z

What this PR does / why we need it:
This adds a EMR launcher in line with the new launcher interface. It does historical retrieval and offline-to-online retrieval. I'll do stream ingestion next, figured this is large enough.

A few notable bits (aside from just moving things around):

this adds redis configuration to the SDK client, which exists solely to be passed to the spark job. We might want to make it optional
I've converted a lot of the existing EMR code to aws/emr_utils.py, but left the old code in for now -- will clean it up in a separate PR
added a convenience method to upload a pandas DF to s3 (currently only supported by EMR launcher). It is not strictly necessary but comes in handy for testing/cli

Which issue(s) this PR fixes:

Fixes #

Does this PR introduce a user-facing change?:

NONE

oavdeev · 2020-10-15T20:18:04Z

/kind feature

khorshuheng · 2020-10-16T01:12:03Z

sdk/python/feast/cli.py

+    "--entity-df-path",
+    "-e",
+    help="Path to entity df in CSV format. It is assumed to have event_timestamp column and a header.",
+    required=True,


I am wondering if it might be better if the users are expected to provide a uri that is recognizable by the Spark Launcher, such as s3:// for EMR, gs:// for Dataproc, and file:// for standalone cluster launchers running locally. That way, we skip the process of reading to Pandas dataframe and convert the file again.

Staged panda dataframe is still a useful method to have though, because we plan to add support to pandas dataframe as input argument for historical feature retrieval method, for Feast Client.

I agree, it is mostly for convenience/testing for now, to reduce a # of steps that someone need to do to see if historical retrieval works. I wouldn't expect people to normally use local CSV for entity dfs. I'd tweak this interface in later PRs though.

sdk/python/feast/pyspark/launchers/aws/emr.py

sdk/python/feast/pyspark/abc.py

khorshuheng · 2020-10-16T01:51:09Z

sdk/python/feast/pyspark/historical_feature_retrieval_job.py

@@ -773,7 +773,7 @@ def _feature_table_from_dict(dct: Dict[str, Any]) -> FeatureTable:
    spark = SparkSession.builder.getOrCreate()
    args = _get_args()
    feature_tables_conf = json.loads(args.feature_tables)
-    feature_tables_sources_conf = json.loads(args.feature_tables_source)
+    feature_tables_sources_conf = json.loads(args.feature_tables_sources)


Thanks for catching the typo

khorshuheng · 2020-10-16T01:54:18Z

sdk/python/feast/pyspark/launchers/aws/emr.py

+        self, df: pandas.DataFrame, event_timestamp: str, created_timestamp_column: str
+    ) -> FileSource:
+        with tempfile.NamedTemporaryFile() as f:
+            df.to_parquet(f)


@pyalex As of Feast 0.7 the dataframe is typically converted to pyarrow / Avro format. I am wondering, if we should make parquet the standard staging format as oppose to pyarrow / Avro? Since Spark doesn't support Avro format out of the box.

As of Feast 0.7 the dataframe is typically converted to pyarrow / Avro format

In which instances do you see that? Uploading to BQ?

https://github.com/feast-dev/feast/blob/v0.7.1/sdk/python/feast/loaders/file.py#L29

we already write parquet in client.ingest but with DataFormat PR we'll have to support all formats that user may specify as Batch Source format, since client.ingest writes to existing batch source

@pyalex In the case of historical feature retrieval, where the user input is Panda dataframe, do we actually want to make the user specify the format of the staged file? Or simply standardized to parquet?

Didn't we previously agree to only support parquet?

khorshuheng · 2020-10-16T01:56:44Z

sdk/python/feast/cli.py

+    entity_df["event_timestamp"] = pandas.to_datetime(entity_df["event_timestamp"])
+
+    uploaded_df = client.stage_dataframe(
+        entity_df, "event_timestamp", "created_timestamp"


Just a heads up: "created_timestamp" is actually supposed to be optional for the entity, so it's a bug that we need to resolve in another PR.

sdk/python/feast/pyspark/launchers/aws/emr.py

pyalex · 2020-10-16T02:41:19Z

sdk/python/feast/client.py

    ) -> SparkJob:
        return start_offline_to_online_ingestion(feature_table, start, end, self)  # type: ignore
+
+    def stage_dataframe(


We actually already have method in the Feast Client that takes dataframe and put it in offline storage. Currently it called ingest but I guess we'll rename it. Anyway, this shouldn't be a part of spark interop, since it's not related to spark

In my mind this is somewhat different from ingest. It is not intended for permanent storage, this is a convenience method "take this dataframe, put it in some temp location where Spark can access it". I agree launcher might not be the best place for it, just gotta be some code that can read staging_location from the config to construct the temp path.

if it's some upload-to-temp-location function - it's probably shouldn't be part of Feast Client API. maybe contrib?
or just keep it internally. What's the user use case?

Just the user convenience - if you're getting started with feast and want to run historical retrieval, we can upload your pandas entity dataframe to S3 so you don't have to think about how to upload it and what bucket to use. We'll just put it in staging location for you. Basically trying to remove an extra friction point in onboarding and tutorials. Right now it is only used for CLI historical-retrieval command.

I agree it may not be the best place for it, but at the same time it needs to have access to the config to figure out where to upload the dataframe. So i can't make it completely detached from the client (that has the config object)

pyalex · 2020-10-16T03:25:45Z

sdk/python/feast/pyspark/launchers/aws/emr_utils.py

+    return h.hexdigest()
+
+
+def _s3_upload(


Can this be moved to staging client?

Yes, if it is ok i'd do it in a separate PR though. It is not exactly 1-1 with the current staging client interface.

I submitted a PR here: https://github.com/feast-dev/feast/pull/1063/files

Signed-off-by: Oleg Avdeev <[email protected]>

feast-ci-bot · 2020-10-16T11:14:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: oavdeev, pyalex

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [pyalex]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pyalex · 2020-10-16T11:15:34Z

We agreed with @oavdeev to refactor staging client in separate PR and merge this to unblock further development.

pyalex · 2020-10-16T11:15:40Z

/lgtm

oavdeev requested review from pyalex and khorshuheng October 15, 2020 20:17

oavdeev requested review from davidheryanto, woop and zhilingc as code owners October 15, 2020 20:17

feast-ci-bot added the needs-kind label Oct 15, 2020

feast-ci-bot added size/XL kind/feature New feature or request and removed needs-kind labels Oct 15, 2020

khorshuheng reviewed Oct 16, 2020

View reviewed changes

sdk/python/feast/pyspark/abc.py Show resolved Hide resolved

khorshuheng reviewed Oct 16, 2020

View reviewed changes

woop reviewed Oct 16, 2020

View reviewed changes

sdk/python/feast/pyspark/launchers/aws/emr.py Outdated Show resolved Hide resolved

woop reviewed Oct 16, 2020

View reviewed changes

sdk/python/feast/pyspark/launchers/aws/emr.py Outdated Show resolved Hide resolved

woop reviewed Oct 16, 2020

View reviewed changes

sdk/python/feast/pyspark/launchers/aws/emr.py Show resolved Hide resolved

pyalex reviewed Oct 16, 2020

View reviewed changes

oavdeev force-pushed the emr-launcher branch 3 times, most recently from e33c71b to dfccccb Compare October 16, 2020 06:08

EMR launcher and configuration

d5ca4ea

Signed-off-by: Oleg Avdeev <[email protected]>

oavdeev force-pushed the emr-launcher branch from dfccccb to d5ca4ea Compare October 16, 2020 06:20

khorshuheng mentioned this pull request Oct 16, 2020

Use existing staging client for dataproc staging #1063

Merged

pyalex approved these changes Oct 16, 2020

View reviewed changes

feast-ci-bot added the approved label Oct 16, 2020

feast-ci-bot assigned pyalex Oct 16, 2020

feast-ci-bot added the lgtm label Oct 16, 2020

feast-ci-bot merged commit 7f2e40c into feast-dev:master Oct 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EMR launcher #1061

EMR launcher #1061

oavdeev commented Oct 15, 2020

oavdeev commented Oct 15, 2020

khorshuheng Oct 16, 2020

oavdeev Oct 16, 2020

khorshuheng Oct 16, 2020

khorshuheng Oct 16, 2020

woop Oct 16, 2020

khorshuheng Oct 16, 2020

pyalex Oct 16, 2020

khorshuheng Oct 16, 2020 •

edited

Loading

woop Oct 16, 2020

khorshuheng Oct 16, 2020

pyalex Oct 16, 2020

oavdeev Oct 16, 2020

pyalex Oct 16, 2020

oavdeev Oct 16, 2020

pyalex Oct 16, 2020

oavdeev Oct 16, 2020

khorshuheng Oct 16, 2020

feast-ci-bot commented Oct 16, 2020

pyalex commented Oct 16, 2020

pyalex commented Oct 16, 2020

EMR launcher #1061

EMR launcher #1061

Conversation

oavdeev commented Oct 15, 2020

oavdeev commented Oct 15, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khorshuheng Oct 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

feast-ci-bot commented Oct 16, 2020

pyalex commented Oct 16, 2020

pyalex commented Oct 16, 2020

khorshuheng Oct 16, 2020 •

edited

Loading