Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Great Expectations Plugin #495

Merged
merged 48 commits into from
Aug 18, 2021

Conversation

samhita-alla
Copy link
Contributor

@samhita-alla samhita-alla commented May 31, 2021

Signed-off-by: Samhita Alla [email protected]

TL;DR

Add great expectations plugin to validate data.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

TASK

Define GETask and validate the dataset by passing it as an argument to the task.

Valid data example:

import os

import pandas as pd
from flytekitplugins.greatexpectations import GETask

from flytekit import kwtypes, task, workflow

task_object = GETask(
    name="test4",
    data_source="data",
    inputs=kwtypes(dataset=str),
    expectation_suite="test.demo",
    data_connector="data_example_data_connector",
)


@task
def my_task(csv_file: str) -> int:
    task_object(dataset=csv_file)
    df = pd.read_csv(os.path.join("data", csv_file))
    return df.shape[0]


@workflow
def valid_wf(dataset: str = "yellow_tripdata_sample_2019-01.csv") -> int:
    return my_task(csv_file=dataset)


if __name__ == "__main__":
    print(f"Running {__file__} main...")
    print(valid_wf())

Invalid data example:

@workflow
def invalid_wf(dataset: str = "yellow_tripdata_sample_2019-02.csv") -> int:
    return my_task(csv_file=dataset)


if __name__ == "__main__":
    print(f"Running {__file__} main...")
    invalid_wf()

Output:

...
COLUMN          FAILED EXPECTATION
passenger_count -> expect_column_min_to_be_between
passenger_count -> expect_column_mean_to_be_between
passenger_count -> expect_column_quantile_values_to_be_between
passenger_count -> expect_column_values_to_be_in_set
passenger_count -> expect_column_proportion_of_unique_values_to_be_between
trip_distance -> expect_column_max_to_be_between
trip_distance -> expect_column_mean_to_be_between
trip_distance -> expect_column_median_to_be_between
trip_distance -> expect_column_quantile_values_to_be_between
trip_distance -> expect_column_proportion_of_unique_values_to_be_between
rate_code_id -> expect_column_max_to_be_between
rate_code_id -> expect_column_mean_to_be_between
rate_code_id -> expect_column_proportion_of_unique_values_to_be_between
...

TYPE

Attach GEType to your dataset in a task which shall then automatically validate your data.

import os

import pandas as pd
import pytest
from flytekitplugins.greatexpectations import GEConfig, GEType
from great_expectations.exceptions import ValidationError

from flytekit import task, workflow


ge_config = GEConfig(
    data_source="data",
    expectation_suite="test.demo",
    data_connector="data_example_data_connector",
)


@task
def my_task(csv_file: GEType[str, ge_config]) -> int:
    df = pd.read_csv(os.path.join("data", csv_file))
    df.drop(5, axis=0, inplace=True)
    return df.shape[0]


@workflow
def valid_wf(dataset: str = "yellow_tripdata_sample_2019-01.csv") -> int:
    return my_task(csv_file=dataset)


valid_wf()

EXAMPLES CONCERNING FlyteSchema & FlyteFile ARE AVAILABLE IN TESTS

Tracking Issue

https://github.com/lyft/flyte/issues/

Follow-up issue

NA
OR
https://github.com/lyft/flyte/issues/

Signed-off-by: Samhita Alla <[email protected]>
Signed-off-by: Samhita Alla <[email protected]>
Signed-off-by: Samhita Alla <[email protected]>
…d of same dataclass name in both task and schema

Signed-off-by: Samhita Alla <[email protected]>
@samhita-alla samhita-alla force-pushed the great-expectations-plugin branch from e08b0f0 to 2ac96ab Compare June 10, 2021 11:53
@samhita-alla
Copy link
Contributor Author

@kumare3 Added support for FlyteSchema and FlyteFile. Please review.

Signed-off-by: Samhita Alla <[email protected]>
Signed-off-by: Samhita Alla <[email protected]>
Signed-off-by: Samhita Alla <[email protected]>
Signed-off-by: Samhita Alla <[email protected]>
Signed-off-by: Samhita Alla <[email protected]>
Signed-off-by: Samhita Alla <[email protected]>
@talagluck
Copy link

Thanks for these updates, @samhita-alla! This looks great.

talagluck
talagluck previously approved these changes Aug 10, 2021
@samhita-alla
Copy link
Contributor Author

Just a heads up—pinned the Great Expectations version to 0.13.23. If it's the latest, test_flytekit_sagemaker_runner.py is throwing an error:

_______ ERROR collecting tests/scripts/test_flytekit_sagemaker_runner.py _______ 
ImportError while importing test module '/home/runner/work/flytekit/flytekit/tests/scripts/test_flytekit_sagemaker_runner.py'. 
Hint: make sure your test modules/packages have valid Python names. 

Traceback: 
/opt/hostedtoolcache/Python/3.7.11/x64/lib/python3.7/importlib/__init__.py:127: in import_module 
return _bootstrap._gcd_import(name[level:], package, level) 
tests/scripts/test_flytekit_sagemaker_runner.py:5: in <module> 
from scripts.flytekit_sagemaker_runner import run as _flyte_sagemaker_run 
E ModuleNotFoundError: No module named 'scripts.flytekit_sagemaker_runner'

@talagluck
Copy link

Hi @samhita-alla - thanks for letting us know. That error doesn't look like it would be due to Great Expectations. Do you have any ideas about why it might be occurring?

I think pinning will be ok temporarily, but we're in the process of rolling out some important fixes and powerful features, so it would be good to unpin in the near future.

@samhita-alla
Copy link
Contributor Author

Hi @samhita-alla - thanks for letting us know. That error doesn't look like it would be due to Great Expectations. Do you have any ideas about why it might be occurring?

I think pinning will be ok temporarily, but we're in the process of rolling out some important fixes and powerful features, so it would be good to unpin in the near future.

I'm not sure as to why it's happening. We will have to dig deeper to find that out. I'm sure the problem is on our end.
@wild-endeavor Any thoughts about the test error?

@eapolinario
Copy link
Collaborator

I was able to repro this. This was caused by great-expectations/great_expectations#3003, which ends up defining a package called scripts which takes precedence during package resolution.

I opened #598 to fix the issue.

@kumare3
Copy link
Contributor

kumare3 commented Aug 17, 2021

@samhita-alla this is great, I have a few nits. we can handle them in 2 ways. I am ok with merging and then revisiting the nits, especially refactoring the extremely large functions and also handling other types - either with an error or supporting them

Signed-off-by: Samhita Alla <[email protected]>
Signed-off-by: Samhita Alla <[email protected]>
Signed-off-by: Samhita Alla <[email protected]>
Signed-off-by: Samhita Alla <[email protected]>
Signed-off-by: Samhita Alla <[email protected]>
}

# Great Expectations' RuntimeBatchRequest
if self._batch_request_config and (self._batch_request_config.runtime_parameters or is_runtime):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this always a json? not a typed structure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_batch_request_config is a dataclass, whereas final_batch_request is a dictionary.

@kumare3 kumare3 self-requested a review August 17, 2021 17:47
@samhita-alla samhita-alla merged commit b8a3c7d into flyteorg:master Aug 18, 2021
@samhita-alla samhita-alla deleted the great-expectations-plugin branch August 18, 2021 07:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants