Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Update Expectation conditions docs #10661

Open
wants to merge 45 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
218e6c9
Add tests and update api
NathanFarmer Nov 14, 2024
c1d5c73
Update tests
NathanFarmer Nov 14, 2024
62f0ac2
Make docs changes
NathanFarmer Nov 14, 2024
7ce9718
Update md and py file content
NathanFarmer Nov 14, 2024
d15babc
Remove sample code and add tabs for logic near the top
NathanFarmer Nov 14, 2024
495e82d
Minor text change
NathanFarmer Nov 14, 2024
6f67e83
Add more test cases
NathanFarmer Nov 14, 2024
4358e82
Add more test cases
NathanFarmer Nov 14, 2024
a819dc0
Revert "Add more test cases"
NathanFarmer Nov 14, 2024
37b15ed
Minor text change again
NathanFarmer Nov 14, 2024
3106d5f
Minor update to snippet
NathanFarmer Nov 14, 2024
1204b7d
Merge branch 'm/ph-1627/row-conditions-api' into d/ph-1627/update-row…
NathanFarmer Nov 14, 2024
28ec591
Make ruff happy
NathanFarmer Nov 14, 2024
105bdd3
Use type alias
NathanFarmer Nov 14, 2024
2805f47
Update enum handling for python 3.12
NathanFarmer Nov 14, 2024
5ad1ebc
Simplfy
NathanFarmer Nov 14, 2024
0d30b1d
Merge branch 'm/ph-1627/row-conditions-api' into d/ph-1627/update-row…
NathanFarmer Nov 14, 2024
d8ead36
Update schemas again
NathanFarmer Nov 14, 2024
5cea1c6
condition_parser validator should allow None
NathanFarmer Nov 14, 2024
c0e5042
Merge branch 'm/ph-1627/row-conditions-api' into d/ph-1627/update-row…
NathanFarmer Nov 14, 2024
ece1e6b
Merge branch 'develop' into d/ph-1627/update-row-condition-docs
NathanFarmer Nov 14, 2024
e8a5d9a
Move functional changes into another PR
NathanFarmer Nov 14, 2024
543db2f
Merge branch 'd/ph-1627/update-row-condition-docs' of github.com:grea…
NathanFarmer Nov 14, 2024
d0561a5
Move functional changes into another PR
NathanFarmer Nov 14, 2024
0639ba5
Indentation
NathanFarmer Nov 14, 2024
bc05790
Indentation
NathanFarmer Nov 14, 2024
0223962
Move tabs up below procedure
NathanFarmer Nov 14, 2024
5e33dfd
Try to make all tabs one
NathanFarmer Nov 14, 2024
3b85965
Revert "Try to make all tabs one"
NathanFarmer Nov 14, 2024
26f7956
Try again
NathanFarmer Nov 14, 2024
7a4511b
Revert "Try again"
NathanFarmer Nov 14, 2024
cb8a40b
Hide unwanted tabs
NathanFarmer Nov 14, 2024
b681498
Try to outdent Tabs>
NathanFarmer Nov 15, 2024
6c00233
Revert "Try to outdent Tabs>"
NathanFarmer Nov 15, 2024
90e8790
Exclude new link due to limitations of link checker
NathanFarmer Nov 15, 2024
c812f13
Escape periods
NathanFarmer Nov 15, 2024
b189d52
Try adding index.html
NathanFarmer Nov 15, 2024
6e0ca97
Try wildcard
NathanFarmer Nov 15, 2024
ee3ae04
Fix syntax
NathanFarmer Nov 15, 2024
425db6f
Don't use exclude keyword twice
NathanFarmer Nov 15, 2024
6495bac
Don't escape periods
NathanFarmer Nov 15, 2024
f5ad018
Try wildcard
NathanFarmer Nov 15, 2024
72f86a9
Try again
NathanFarmer Nov 15, 2024
c297e44
Remove include
NathanFarmer Nov 15, 2024
e2a2c27
Trying everything I can
NathanFarmer Nov 15, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -308,7 +308,7 @@ jobs:
# We decided to exclude all external HTTP requests but the ones that under the domain greatexpectations.io
# The reason is to avoid having network errors such as pages that throw 429 after too many requests (like Github)
# and to prevent other possible errors related to user agent or lychee capturing hrefs from metadata that don't resolve to a specific page (preconnects in JS)
args: "--exclude='http.*' --include='^https://(.+\\.)?greatexpectations\\.io/' 'docs/docusaurus/build/**/*.html'"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a workaround due to limitations of the new link checker. It doesn't know how to check new pages. I will put in a follow-up PR reverting this one line change.

args: "--exclude='http.*' 'docs/docusaurus/build/**/*.html'"

docs-tests:
runs-on: ubuntu-latest
Expand Down
6 changes: 3 additions & 3 deletions docs/docusaurus/docs/components/examples_under_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -473,9 +473,9 @@
),
IntegrationTestFixture(
# To test, run:
# pytest --docs-tests -k "docs_example_expectation_row_conditions" tests/integration/test_script_runner.py
name="docs_example_expectation_row_conditions",
user_flow_script="docs/docusaurus/docs/core/customize_expectations/_examples/expectation_row_conditions.py",
# pytest --docs-tests -k "docs_example_expectation_conditions" tests/integration/test_script_runner.py
name="docs_example_expectation_conditions",
user_flow_script="docs/docusaurus/docs/core/customize_expectations/_examples/expectation_conditions.py",
data_dir="docs/docusaurus/docs/components/_testing/test_data_sets/titantic_test_file",
# data_context_dir="",
backend_dependencies=[],
Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
"""
This is an example script for how to use Expectation row conditions.
This is an example script for how to use Expectation conditions.

To test, run:
pytest --docs-tests -k "doc_example_expectation_row_conditions" tests/integration/test_script_runner.py
pytest --docs-tests -k "docs_example_expectation_conditions" tests/integration/test_script_runner.py
"""


Expand Down Expand Up @@ -31,7 +31,7 @@ def set_up_context_for_example(context):


# EXAMPLE SCRIPT STARTS HERE:
# <snippet name="docs/docusaurus/docs/core/customize_expectations/_examples/expectation_row_conditions.py - full code example">
# <snippet name="docs/docusaurus/docs/core/customize_expectations/_examples/expectation_conditions.py - full code example">
import great_expectations as gx

context = gx.get_context()
Expand All @@ -49,8 +49,8 @@ def set_up_context_for_example(context):
.get_batch()
)

# An unconditional Expectation is defined without the `row_condition` or `condition_parser` parameters:
# <snippet name="docs/docusaurus/docs/core/customize_expectations/_examples/expectation_row_conditions.py - example unconditional Expectation">
# An Expectation without conditions is defined without the `row_condition` or `condition_parser` parameters:
# <snippet name="docs/docusaurus/docs/core/customize_expectations/_examples/expectation_conditions.py - example Expectation without conditions">
expectation = gx.expectations.ExpectColumnValuesToBeInSet(
column="Survived", value_set=[0, 1]
)
Expand All @@ -59,28 +59,42 @@ def set_up_context_for_example(context):
# Test the Expectation:
print(batch.validate(expectation))

# A Conditional Expectation for a pandas Data Source would be defined like this:
# <snippet name="docs/docusaurus/docs/core/customize_expectations/_examples/expectation_row_conditions.py - example conditional Expectation">
conditional_expectation = gx.expectations.ExpectColumnValuesToBeInSet(
# An Expectation condition for a pandas Data Source would be defined like this:
# <snippet name="docs/docusaurus/docs/core/customize_expectations/_examples/expectation_conditions.py - pandas example Expectation with conditions">
expectation_with_condition = gx.expectations.ExpectColumnValuesToBeInSet(
column="Survived",
value_set=[1],
# <snippet name="docs/docusaurus/docs/core/customize_expectations/_examples/expectation_row_conditions.py - pandas example row_condition">
# <snippet name="docs/docusaurus/docs/core/customize_expectations/_examples/expectation_conditions.py - pandas example row_condition">
condition_parser="pandas",
row_condition='PClass=="1st"',
# </snippet>
)
# </snippet>

# Test the Conditional Expectation:
print(batch.validate(conditional_expectation))
# Test the Expectation condition:
print(batch.validate(expectation_with_condition))

# A Conditional Expectation for a Spark or SQL Data Source would be defined like this:
conditional_expectation = gx.expectations.ExpectColumnValuesToBeInSet(
# A Conditional Expectation for a Spark Data Source would be defined like this:
# <snippet name="docs/docusaurus/docs/core/customize_expectations/_examples/expectation_conditions.py - spark example Expectation with conditions">
expectation_with_condition = gx.expectations.ExpectColumnValuesToBeInSet(
column="Survived",
value_set=[1],
# <snippet name="docs/docusaurus/docs/core/customize_expectations/_examples/expectation_row_conditions.py - spark example row_condition">
condition_parser="spark",
row_condition='PClass=="1st"',
# <snippet name="docs/docusaurus/docs/core/customize_expectations/_examples/expectation_conditions.py - spark example row_condition">
condition_parser="great_expectations",
row_condition='col("PClass")=="1st"',
# </snippet>
)
# </snippet>

# A Conditional Expectation for a SQL Data Source would be defined like this:
# <snippet name="docs/docusaurus/docs/core/customize_expectations/_examples/expectation_conditions.py - sql example Expectation with conditions">
expectation_with_condition = gx.expectations.ExpectColumnValuesToBeInSet(
column="Survived",
value_set=[1],
# <snippet name="docs/docusaurus/docs/core/customize_expectations/_examples/expectation_conditions.py - sql example row_condition">
condition_parser="great_expectations",
row_condition='col("PClass")=="1st"',
# </snippet>
)
# </snippet>
# </snippet>
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ import OverviewCard from '@site/src/components/OverviewCard';
topIcon
label="Restrict an Expectation with row conditions"
description="Use `row_conditions` to restrict the data an Expectation evaluates"
to="/core/customize_expectations/expectation_row_conditions"
to="/core/customize_expectations/expectation_conditions"
icon="/img/expectation_icon.svg"
/>
<LinkCard
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,261 @@
---
title: Apply Expectation conditions to specific rows within a Batch
---
import TabItem from '@theme/TabItem';
import Tabs from '@theme/Tabs';

import PrereqPythonInstalled from '../_core_components/prerequisites/_python_installation.md';
import PrereqGxInstalled from '../_core_components/prerequisites/_gx_installation.md';
import PrereqPreconfiguredDataContext from '../_core_components/prerequisites/_preconfigured_data_context.md';
import PrereqPreconfiguredDataSourceAndAsset from '../_core_components/prerequisites/_data_source_and_asset_connected_to_data.md';

By default, Expectations apply to the entire dataset retrieved in a Batch. However, there are instances when an Expectation may not be relevant for every row. Validating every row could lead to false positives or false negatives in the Validation Results.

For example, you might define an Expectation that a column indicating the country of origin for a product should not be null. If this Expectation is only applicable when the product is an import, applying it to every row in the Batch could result in many false negatives when the country of origin column is null for products produced locally.

To address this issue, GX allows you to define Expectation conditions that apply only to a subset of the data retrieved in a Batch.

## Create an Expectation condition

Great Expectations allows you to specify conditions for validating rows using the `row_condition` argument, which can be applied to all Expectations that assess rows within a Dataset. The `row_condition` argument should be a string that represents a boolean expression. Rows will be validated when the `row_condition` expression evaluates to `True`. Conversely, if the `row_condition` evaluates to `False`, the corresponding row will not be validated by the Expectation.
### Prerequisites

- <PrereqPythonInstalled/>.
- <PrereqGxInstalled/>.
- <PrereqPreconfiguredDataContext/>.
- Recommended. <PrereqPreconfiguredDataSourceAndAsset/> for [testing your customized Expectation](/core/define_expectations/test_an_expectation.md).

### Procedure

<Tabs queryString="condition_parser" groupId="condition_parser" defaultValue='pandas' values={[{label: 'pandas', value:'pandas'}, {label: 'Spark', value:'spark'}, {label: 'SQL', value:'sql'}]}>

<TabItem value="pandas" label="pandas">

In this procedure, it is assumed that your Data Context is stored in the variable `context`, and your Expectation Suite is stored in the variable `suite`. The `suite` can either be a newly created and empty Expectation Suite or an existing Expectation Suite retrieved from the Data Context.

The examples in this procedure use passenger data from the Titanic, which includes details about the class of ticket held by the passenger and whether or not they survived the journey.

1. Determine the `condition_parser` for your `row_condition`.

The `condition_parser` defines the syntax of `row_condition` strings. When implementing Expectation conditions with pandas, set this argument to `"pandas"`.

</TabItem>

<TabItem value="spark" label="Spark">

In this procedure, it is assumed that your Data Context is stored in the variable `context`, and your Expectation Suite is stored in the variable `suite`. The `suite` can either be a newly created and empty Expectation Suite or an existing Expectation Suite retrieved from the Data Context.

The examples in this procedure use passenger data from the Titanic, which includes details about the class of ticket held by the passenger and whether or not they survived the journey.

1. Determine the `condition_parser` for your `row_condition`.

The `condition_parser` defines the syntax of `row_condition` strings. When implementing Expectation conditions with Spark, set this argument to `"great_expectations"`.

</TabItem>

<TabItem value="sql" label="SQL">

In this procedure, it is assumed that your Data Context is stored in the variable `context`, and your Expectation Suite is stored in the variable `suite`. The `suite` can either be a newly created and empty Expectation Suite or an existing Expectation Suite retrieved from the Data Context.

The examples in this procedure use passenger data from the Titanic, which includes details about the class of ticket held by the passenger and whether or not they survived the journey.

1. Determine the `condition_parser` for your `row_condition`.

The `condition_parser` defines the syntax of `row_condition` strings. When implementing Expectation conditions with SQL, set this argument to `"great_expectations"`.

</TabItem>

</Tabs>

Note that the Expectation with conditions will fail if the Batch being validated is from a different type of Data Source than indicated by the `condition_parser`.

2. Determine the `row_condition` expression.

The `row_condition` argument should be a boolean expression string that is evaluated for each row in the Batch that the Expectation validates. If the `row_condition` evaluates to `True`, the row will be included in the Expectation's validations. If it evaluates to `False`, the Expectation will be skipped for that row.

The syntax of the `row_condition` argument is based on the `condition_parser` specified earlier.

3. Create the Expectation.

An Expectation with conditions is created like a regular Expectation, with the addition of the `row_condition` and `condition_parser` parameters alongside the Expectation's other arguments.

<Tabs className="hidden" queryString="condition_parser" groupId="condition_parser" defaultValue='pandas' values={[{label: 'pandas', value:'pandas'}, {label: 'Spark', value:'spark'}, {label: 'SQL', value:'sql'}]}>

<TabItem value="pandas" label="pandas">

In pandas, the `row_condition` value is passed to `pandas.DataFrame.query()` prior to Expectation Validation, and the resulting rows from the evaluated Batch will undergo validation by the Expectation.

```python title="Python" name="docs/docusaurus/docs/core/customize_expectations/_examples/expectation_conditions.py - pandas example row_condition"
```

Do not use single quotes, newlines, or `\n` in the specified `row_condition` as shown in the following examples:

```python title="Python"
row_condition = "PClass=='1st'" # Don't do this. Single quotes aren't valid!

row_condition="""
PClass=="1st"
""" # Don't do this. Newlines and \n aren't valid!

row_condition = 'PClass=="1st"' # Do this instead.
```

</TabItem>

<TabItem value="spark" label="Spark">

In Spark, the `row_condition` uses custom syntax, which is parsed as a data filter or query prior to Expectation Validation.

```python title="Python" name="docs/docusaurus/docs/core/customize_expectations/_examples/expectation_conditions.py - spark example row_condition"
```

Do not use single quotes, newlines, or `\n` in the specified `row_condition` as shown in the following examples:

```python title="Python"
row_condition = "col('PClass')=='1st'" # Don't do this. Single quotes aren't valid!

row_condition="""
col("PClass")=="1st"
""" # Don't do this. Newlines and \n aren't valid!

row_condition = 'col("PClass")=="1st"' # Do this instead.
```

</TabItem>

<TabItem value="sql" label="SQL">

In SQL, the `row_condition` uses custom syntax, which is parsed as a data filter or query prior to Expectation Validation.

```python title="Python" name="docs/docusaurus/docs/core/customize_expectations/_examples/expectation_conditions.py - sql example row_condition"
```

Do not use single quotes, newlines, or `\n` in the specified `row_condition` as shown in the following examples:

```python title="Python"
row_condition = "col('PClass')=='1st'" # Don't do this. Single quotes aren't valid!

row_condition="""
col("PClass")=="1st"
""" # Don't do this. Newlines and \n aren't valid!

row_condition = 'col("PClass")=="1st"' # Do this instead.
```

</TabItem>

</Tabs>

<Tabs className="hidden" queryString="condition_parser" groupId="condition_parser" defaultValue='pandas' values={[{label: 'pandas', value:'pandas'}, {label: 'Spark', value:'spark'}, {label: 'SQL', value:'sql'}]}>

<TabItem value="pandas" label="pandas">

In pandas, you can reference variables from the environment by prefixing them with `@`. Additionally, when a column name contains spaces, you can specify it by enclosing the name in backticks: `` ` ``.

Some examples of valid `row_condition` values for pandas include:

```python title="Python"
row_condition = '`foo foo`=="bar bar"' # The value of the column "foo foo" is "bar bar"

row_condition = 'foo==@bar' # the value of the foo field is equal to the value of the bar environment variable
```

For more information on the syntax accepted by pandas `row_condition` values see [pandas.DataFrame.query](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html).

</TabItem>

<TabItem value="spark" label="Spark">

For Spark, you should specify your columns using the `col()` function.

Some examples of valid `row_condition` values for Spark include:

```python title="Python"
row_condition='col("foo") == "Two Two"' # foo is 'Two Two'

row_condition='col("foo").notNull()' # foo is not null

row_condition='col("foo") > 5' # foo is greater than 5

row_condition='col("foo") <= 3.14' # foo is less than 3.14

row_condition='col("foo") <= date("2023-03-13")' # foo is earlier than 2023-03-13

```

</TabItem>

<TabItem value="sql" label="SQL">

For SQL, you should specify your columns using the `col()` function.

Some examples of valid `row_condition` values for SQL include:

```python title="Python"
row_condition='col("foo") == "Two Two"' # foo is 'Two Two'

row_condition='col("foo").notNull()' # foo is not null

row_condition='col("foo") > 5' # foo is greater than 5

row_condition='col("foo") <= 3.14' # foo is less than 3.14

row_condition='col("foo") <= date("2023-03-13")' # foo is earlier than 2023-03-13

```

</TabItem>

</Tabs>

4. Optional. Create additional Expectation conditions

Expectations that have different conditions are treated as unique, even if they belong to the same type and apply to the same column within an Expectation Suite. This approach allows you to create one unconditional Expectation and an unlimited number of Conditional Expectations, each with a distinct condition.

For instance, the following code establishes an Expectation that the value in the `"Survived"` column is either 0 or 1:

```python title="Python" name="docs/docusaurus/docs/core/customize_expectations/_examples/expectation_conditions.py - example Expectation without conditions"
```

And this code adds a condition to the Expectation that specifies the value of the `"Survived"` column is `1` if the individual was a first class passenger:

<Tabs className="hidden" queryString="condition_parser" groupId="condition_parser" defaultValue='pandas' values={[{label: 'pandas', value:'pandas'}, {label: 'Spark', value:'spark'}, {label: 'SQL', value:'sql'}]}>

<TabItem value="pandas" label="pandas">

```python title="Python" name="docs/docusaurus/docs/core/customize_expectations/_examples/expectation_conditions.py - pandas example Expectation with conditions"
```
</TabItem>

<TabItem value="spark" label="Spark">

```python title="Python" name="docs/docusaurus/docs/core/customize_expectations/_examples/expectation_conditions.py - spark example Expectation with conditions"
```
</TabItem>

<TabItem value="sql" label="SQL">

```python title="Python" name="docs/docusaurus/docs/core/customize_expectations/_examples/expectation_conditions.py - sql example Expectation with conditions"
```
</TabItem>

</Tabs>

## Data Docs and Expectation conditions

Expectations with conditions are presented differently from standard Expectations in the Data Docs. Each Expectation with conditions is prefaced with *if 'row_condition_string', then values must be...* as illustrated in the following image:

![Image](/docs/oss/images/conditional_data_docs_screenshot.png)

If the *'row_condition_string'* is a complex expression, it will be divided into several components to enhance readability.

## Scope and limitations

While conditions can be applied to most Expectations, the following Expectations cannot be conditioned and do not accept the `row_condition` argument:

* `expect_column_to_exist`
* `expect_table_columns_to_match_ordered_list`
* `expect_table_columns_to_match_set`
* `expect_table_column_count_to_be_between`
* `expect_table_column_count_to_equal`
* `unexpected_rows_expectation`
Loading
Loading