feat: Delta table partition watermarks #1694

harmw · 2022-02-01T13:52:40Z

Summary of Changes

Introducing watermark detection on Delta tables. Since [1] I lost track of this one around the time of going monorepo, here it is once more 🙈

[1] amundsen-io/amundsendatabuilder#427

Tests

Testing for watermarks.

Documentation

Nothing.

CheckList

Make sure you have checked all steps below to ensure a timely review.

PR title addresses the issue accurately and concisely. Example: "Updates the version of Flask to v1.0.2"
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.
PR includes a summary of changes.
PR adds unit tests, updates existing unit tests, OR documents why no test additions or modifications are needed.
In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain docstrings that explain what it does

boring-cyborg · 2022-02-01T13:52:43Z

Congratulations on your first Pull Request and welcome to Amundsen community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/amundsen-io/amundsen/blob/main/CONTRIBUTING.md)

So now we can run pytest on a fresh clone. Due to the rather old version this will throw some DeprecationWarning messages, but we can upgrade to 3.1 at a later stage. Signed-off-by: Harm Weites <[email protected]>

Signed-off-by: Harm Weites <[email protected]>

Going with the first item of the returned list will return the same column, which is not deterministic at all (given there are multiple partitions). Signed-off-by: Harm Weites <[email protected]>

Signed-off-by: Harm Weites <[email protected]>

Since watermarking strings doesn't make much sense, keep to checking integer/float/date/datetime types. Signed-off-by: Harm Weites <[email protected]>

Signed-off-by: Harm Weites <[email protected]>

mgorsk1 · 2022-02-01T16:53:47Z

requirements-dev.txt

@@ -13,3 +13,4 @@ pytest-cov>=2.12.0
 pytest-env>=0.6.2
 pytest-mock>=3.6.1
 typed-ast>=1.4.3
+pyspark==3.0.1


does it need to be a hard pin or will this work with >= ?

I thought we had pyspark deps in setup.py?

mgorsk1 · 2022-02-01T16:56:27Z

databuilder/databuilder/extractor/delta_lake_metadata_extractor.py

@@ -425,3 +431,96 @@ def is_array_type(self, delta_type: Any) -> bool:

    def is_map_type(self, delta_type: Any) -> bool:
        return isinstance(delta_type, MapType)
+
+    def create_table_watermarks(self, table: ScrapedTableMetadata) -> Union[List[Tuple[Optional[Watermark],


I think it doesn't really matter which watermark is high and which low so this method could just return Optional[List[Watermark]] which would be more readable and simple. wdyt?

mgorsk1 · 2022-02-01T16:57:00Z

databuilder/databuilder/extractor/delta_lake_metadata_extractor.py

@@ -206,6 +207,11 @@ def _get_extract_iter(self) -> Iterator[Union[TableMetadata, TableLastUpdated, N
                continue
            else:
                yield self.create_table_metadata(scraped_table)
+                watermarks = self.create_table_watermarks(scraped_table)
+                if watermarks:


in regards to comment on line 435 I would just do:

for watermark in watermarks: yield watermark

mgorsk1 · 2022-02-01T17:02:10Z

by going through this code I realized that this is actually just SparkCatalogExtractor, not particularly DeltaExtractor. Wdyt about generalizing it @harmw @feng-tao @samshuster

feng-tao · 2022-02-01T17:05:11Z

@mgorsk1 i don't think so given the extractor is for delta only which requires a databricks cluster to execute. I think we could have a separate extractor for standalone spark but let's not change the scope of this pr.

mgorsk1 · 2022-02-01T17:20:28Z

I didn't mean it to be in the scope for this PR, there is just not much code here that makes it delta specific, just iterating over sparkCatalog databases and tables which might as well be for alternative hive metastore extraction.

samshuster · 2022-02-01T17:25:19Z

Yeah all good points. At the time that I wrote this extractor, I wasn't aware of the python libraries that supported stand alone delta operations without spark. Something like this: https://github.com/delta-io/delta-rs I think ideally this extractor is rewritten to use that library which would also enable getting rid of spark dependency and should also greatly speed up this implementation. Unfortunately as you know the hive metastore extractor will not work with delta tables because most of the delta table metadata is not stored in metastore but rather on cloud storage.

…

On Tue, Feb 1, 2022 at 9:20 AM mgorsk1 ***@***.***> wrote: I didn't mean it to be in the scope for this PR, there is just not much code here that makes it delta specific, just iterating over sparkCatalog databases and tables which might as well be for alternative hive metastore extraction. — Reply to this email directly, view it on GitHub <#1694 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAQASBZRQL7LENUAOYAREPTUZAI6PANCNFSM5NJGTP6Q> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

harmw · 2022-02-09T11:50:23Z

hm, cool, interesting 🤔 I simply picked up from where I left over a year ago, get the watermarks in and have it (Amundsen) be more valuable to how we're using it in our shop 😂

Dropping the changes in this PR in favour of something new (extracting-without-spark) sounds pretty reasonable, but I personally don't have time to do that in a rather timely fashion :(

Would it make sense to get these changes in (after resolving the code review notes) and work on the break-out in a separate PR? And if yes, would that necessitate an RFC of some sorts?

samshuster · 2022-02-09T14:12:10Z

Oh yes, I definitely agree! I think that is a separate issue on its own and that shouldn't stop your PR.

…

On Wed, Feb 9, 2022 at 3:50 AM Harm Weites ***@***.***> wrote: hm, cool, interesting 🤔 I simply picked up from where I left over a year ago, get the watermarks in and have it (Amundsen) be more valuable to how we're using it in our shop 😂 Dropping the changes in this PR in favour of something new (extracting-without-spark) sounds pretty reasonable, *but* I personally don't have time to do that in a rather timely fashion :( Would it make sense to get these changes in (after resolving the code review notes) and work on the break-out in a separate PR? And if yes, would that necessitate an RFC of some sorts? — Reply to this email directly, view it on GitHub <#1694 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAQASBY7EQ2344J6HKI7WFDU2JIIXANCNFSM5NJGTP6Q> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

feng-tao · 2022-02-10T05:52:31Z

agree, we shouldn't block this pr instead should just file a github issue for enhancement.

btw, could you fix the lint:

flake8 .
./tests/unit/extractor/test_deltalake_extractor.py:62: [E501] line too long (125 > 120 characters)
./tests/unit/extractor/test_deltalake_extractor.py:366: [E501] line too long (128 > 120 characters)
./tests/unit/extractor/test_deltalake_extractor.py:392: [E501] line too long (127 > 120 characters)
./databuilder/extractor/delta_lake_metadata_extractor.py:435: [C901] 'DeltaLakeMetadataExtractor.create_table_watermarks' is too complex (12)
make: *** [lint] Error 1
Makefile:11: recipe for target 'lint' failed

I will take a look

There are scenarios where a dataset exists, but is empty. In this case .first() will fail. Signed-off-by: Harm Weites <[email protected]>

Signed-off-by: Harm Weites <[email protected]>

This reverts commit 06b9fc3. Working with this as part of job.launch() brings errors, where the original code would bring the desired result. Signed-off-by: Harm Weites <[email protected]>

Signed-off-by: Harm Weites <[email protected]>

feng-tao · 2022-02-15T06:21:26Z

flake8 .
mypy .
databuilder/extractor/delta_lake_metadata_extractor.py:213: error: Value of type "Watermark" is not indexable
databuilder/extractor/delta_lake_metadata_extractor.py:214: error: Value of type "Watermark" is not indexable
databuilder/extractor/delta_lake_metadata_extractor.py:503: error: Argument 2 to "filter" has incompatible type "Optional[List[ScrapedColumnMetadata]]"; expected "Iterable[ScrapedColumnMetadata]"
databuilder/extractor/delta_lake_metadata_extractor.py:530: error: Incompatible return value type (got "List[Tuple[Watermark, Watermark]]", expected "Optional[List[Watermark]]")
tests/unit/extractor/test_deltalake_extractor.py:369: error: Argument 1 to "create_table_watermarks" of "DeltaLakeMetadataExtractor" has incompatible type "Optional[ScrapedTableMetadata]"; expected "ScrapedTableMetadata"
tests/unit/extractor/test_deltalake_extractor.py:370: error: Argument 1 to "len" has incompatible type "Optional[List[Watermark]]"; expected "Sized"
tests/unit/extractor/test_deltalake_extractor.py:371: error: Value of type "Optional[List[Watermark]]" is not indexable
tests/unit/extractor/test_deltalake_extractor.py:371: error: Argument 1 to "len" has incompatible type "Union[Watermark, Any]"; expected "Sized"
tests/unit/extractor/test_deltalake_extractor.py:372: error: Value of type "Optional[List[Watermark]]" is not indexable
tests/unit/extractor/test_deltalake_extractor.py:372: error: Value of type "Union[Watermark, Any]" is not indexable
tests/unit/extractor/test_deltalake_extractor.py:396: error: Argument 1 to "create_table_watermarks" of "DeltaLakeMetadataExtractor" has incompatible type "Optional[ScrapedTableMetadata]"; expected "ScrapedTableMetadata"
tests/unit/extractor/test_deltalake_extractor.py:397: error: Argument 1 to "len" has incompatible type "Optional[List[Watermark]]"; expected "Sized"
tests/unit/extractor/test_deltalake_extractor.py:398: error: Value of type "Optional[List[Watermark]]" is not indexable
tests/unit/extractor/test_deltalake_extractor.py:398: error: Argument 1 to "len" has incompatible type "Union[Watermark, Any]"; expected "Sized"
tests/unit/extractor/test_deltalake_extractor.py:399: error: Value of type "Optional[List[Watermark]]" is not indexable
tests/unit/extractor/test_deltalake_extractor.py:399: error: Value of type "Union[Watermark, Any]" is not indexable
tests/unit/extractor/test_deltalake_extractor.py:439: error: Argument 1 to "create_table_watermarks" of "DeltaLakeMetadataExtractor" has incompatible type "Optional[ScrapedTableMetadata]"; expected "ScrapedTableMetadata"
Found 17 errors in 2 files (checked 361 source files)

Signed-off-by: Harm Weites <[email protected]>

harmw · 2022-03-01T07:46:20Z

@feng-tao should be all good now, finally 😅

First-time contributors need a maintainer to approve running workflows.

Is this new?

feng-tao · 2022-03-02T00:44:22Z

@harmw , almost there :) :

 from collections import namedtuple
 from datetime import datetime
 from typing import (  # noqa: F401
-    Any, Dict, Iterator, List, Optional, Union, Tuple,
+    Any, Dict, Iterator, List, Optional, Tuple, Union,
 )
 
 from pyhocon import ConfigFactory, ConfigTree  # noqa: F401
Skipped 2 files
make: *** [isort_check] Error 1

Signed-off-by: Harm Weites <[email protected]>

harmw · 2022-03-02T07:42:35Z

completely ignore the contribution guidelines, my bad - should be good now, the steps in the Makefile succeeded. Future addition could be make all support, but whatever, all good now 🙂

feng-tao · 2022-03-08T01:44:39Z

trigger the CI button again, will merge once it is green

boring-cyborg · 2022-03-09T17:46:58Z

Awesome work, congrats on your first merged pull request!

* Install pyspark for dev work So now we can run pytest on a fresh clone. Due to the rather old version this will throw some DeprecationWarning messages, but we can upgrade to 3.1 at a later stage. Signed-off-by: Harm Weites <[email protected]> * Read watermarks for Delta tables Signed-off-by: Harm Weites <[email protected]> * Include tests Signed-off-by: Harm Weites <[email protected]> * More proper watermark yielding Signed-off-by: Harm Weites <[email protected]> * Select the partition_column Going with the first item of the returned list will return the same column, which is not deterministic at all (given there are multiple partitions). Signed-off-by: Harm Weites <[email protected]> * Cut the line length Signed-off-by: Harm Weites <[email protected]> * Only process partitions of a workable type Since watermarking strings doesn't make much sense, keep to checking integer/float/date/datetime types. Signed-off-by: Harm Weites <[email protected]> * Updated tests Signed-off-by: Harm Weites <[email protected]> * Oops, the .first() returns a Row object Signed-off-by: Harm Weites <[email protected]> * Wrap this extraction in a try/except There are scenarios where a dataset exists, but is empty. In this case .first() will fail. Signed-off-by: Harm Weites <[email protected]> * Flake8 fixes Signed-off-by: Harm Weites <[email protected]> * Simplicity Signed-off-by: Harm Weites <[email protected]> * Revert "Simplicity" This reverts commit 06b9fc3. Working with this as part of job.launch() brings errors, where the original code would bring the desired result. Signed-off-by: Harm Weites <[email protected]> * Simplicity in return typing Signed-off-by: Harm Weites <[email protected]> * There is no complexity here :jedi_hand_wave: Signed-off-by: Harm Weites <[email protected]> * Pass the mypy Signed-off-by: Harm Weites <[email protected]> * Fix the return type here, finally Signed-off-by: Harm Weites <[email protected]> * Fix import sorting order Signed-off-by: Harm Weites <[email protected]> Signed-off-by: Ozan Dogrultan <[email protected]>

* Install pyspark for dev work So now we can run pytest on a fresh clone. Due to the rather old version this will throw some DeprecationWarning messages, but we can upgrade to 3.1 at a later stage. Signed-off-by: Harm Weites <[email protected]> * Read watermarks for Delta tables Signed-off-by: Harm Weites <[email protected]> * Include tests Signed-off-by: Harm Weites <[email protected]> * More proper watermark yielding Signed-off-by: Harm Weites <[email protected]> * Select the partition_column Going with the first item of the returned list will return the same column, which is not deterministic at all (given there are multiple partitions). Signed-off-by: Harm Weites <[email protected]> * Cut the line length Signed-off-by: Harm Weites <[email protected]> * Only process partitions of a workable type Since watermarking strings doesn't make much sense, keep to checking integer/float/date/datetime types. Signed-off-by: Harm Weites <[email protected]> * Updated tests Signed-off-by: Harm Weites <[email protected]> * Oops, the .first() returns a Row object Signed-off-by: Harm Weites <[email protected]> * Wrap this extraction in a try/except There are scenarios where a dataset exists, but is empty. In this case .first() will fail. Signed-off-by: Harm Weites <[email protected]> * Flake8 fixes Signed-off-by: Harm Weites <[email protected]> * Simplicity Signed-off-by: Harm Weites <[email protected]> * Revert "Simplicity" This reverts commit 06b9fc3. Working with this as part of job.launch() brings errors, where the original code would bring the desired result. Signed-off-by: Harm Weites <[email protected]> * Simplicity in return typing Signed-off-by: Harm Weites <[email protected]> * There is no complexity here :jedi_hand_wave: Signed-off-by: Harm Weites <[email protected]> * Pass the mypy Signed-off-by: Harm Weites <[email protected]> * Fix the return type here, finally Signed-off-by: Harm Weites <[email protected]> * Fix import sorting order Signed-off-by: Harm Weites <[email protected]> Signed-off-by: Zachary Ruiz <[email protected]>

* Install pyspark for dev work So now we can run pytest on a fresh clone. Due to the rather old version this will throw some DeprecationWarning messages, but we can upgrade to 3.1 at a later stage. Signed-off-by: Harm Weites <[email protected]> * Read watermarks for Delta tables Signed-off-by: Harm Weites <[email protected]> * Include tests Signed-off-by: Harm Weites <[email protected]> * More proper watermark yielding Signed-off-by: Harm Weites <[email protected]> * Select the partition_column Going with the first item of the returned list will return the same column, which is not deterministic at all (given there are multiple partitions). Signed-off-by: Harm Weites <[email protected]> * Cut the line length Signed-off-by: Harm Weites <[email protected]> * Only process partitions of a workable type Since watermarking strings doesn't make much sense, keep to checking integer/float/date/datetime types. Signed-off-by: Harm Weites <[email protected]> * Updated tests Signed-off-by: Harm Weites <[email protected]> * Oops, the .first() returns a Row object Signed-off-by: Harm Weites <[email protected]> * Wrap this extraction in a try/except There are scenarios where a dataset exists, but is empty. In this case .first() will fail. Signed-off-by: Harm Weites <[email protected]> * Flake8 fixes Signed-off-by: Harm Weites <[email protected]> * Simplicity Signed-off-by: Harm Weites <[email protected]> * Revert "Simplicity" This reverts commit 06b9fc3. Working with this as part of job.launch() brings errors, where the original code would bring the desired result. Signed-off-by: Harm Weites <[email protected]> * Simplicity in return typing Signed-off-by: Harm Weites <[email protected]> * There is no complexity here :jedi_hand_wave: Signed-off-by: Harm Weites <[email protected]> * Pass the mypy Signed-off-by: Harm Weites <[email protected]> * Fix the return type here, finally Signed-off-by: Harm Weites <[email protected]> * Fix import sorting order Signed-off-by: Harm Weites <[email protected]>

harmw requested a review from a team as a code owner February 1, 2022 13:52

boring-cyborg bot added area:databuilder From databuilder folder category:models labels Feb 1, 2022

harmw added 9 commits February 1, 2022 14:54

Install pyspark for dev work

8954e87

So now we can run pytest on a fresh clone. Due to the rather old version this will throw some DeprecationWarning messages, but we can upgrade to 3.1 at a later stage. Signed-off-by: Harm Weites <[email protected]>

Read watermarks for Delta tables

c32cc1f

Signed-off-by: Harm Weites <[email protected]>

Include tests

9455b56

Signed-off-by: Harm Weites <[email protected]>

More proper watermark yielding

d6e6d95

Signed-off-by: Harm Weites <[email protected]>

Select the partition_column

1398b88

Going with the first item of the returned list will return the same column, which is not deterministic at all (given there are multiple partitions). Signed-off-by: Harm Weites <[email protected]>

Cut the line length

e3eaaf8

Signed-off-by: Harm Weites <[email protected]>

Only process partitions of a workable type

e821fae

Since watermarking strings doesn't make much sense, keep to checking integer/float/date/datetime types. Signed-off-by: Harm Weites <[email protected]>

Updated tests

323db25

Signed-off-by: Harm Weites <[email protected]>

Oops, the .first() returns a Row object

200e64b

Signed-off-by: Harm Weites <[email protected]>

harmw force-pushed the more-delta-lake branch from 049e2a9 to 200e64b Compare February 1, 2022 13:55

Merge branch 'main' into more-delta-lake

33c8676

Signed-off-by: Harm Weites <[email protected]>

mgorsk1 reviewed Feb 1, 2022

View reviewed changes

feng-tao approved these changes Feb 10, 2022

View reviewed changes

harmw added 3 commits February 10, 2022 22:11

Wrap this extraction in a try/except

0db83c9

There are scenarios where a dataset exists, but is empty. In this case .first() will fail. Signed-off-by: Harm Weites <[email protected]>

Flake8 fixes

ad0280e

Signed-off-by: Harm Weites <[email protected]>

Simplicity

06b9fc3

Signed-off-by: Harm Weites <[email protected]>

harmw force-pushed the more-delta-lake branch from 89f9fa0 to 06b9fc3 Compare February 10, 2022 21:11

feng-tao approved these changes Feb 11, 2022

View reviewed changes

Revert "Simplicity"

94766f9

This reverts commit 06b9fc3. Working with this as part of job.launch() brings errors, where the original code would bring the desired result. Signed-off-by: Harm Weites <[email protected]>

harmw added 2 commits February 11, 2022 10:33

Simplicity in return typing

b50d3d7

Signed-off-by: Harm Weites <[email protected]>

There is no complexity here :jedi_hand_wave:

7f2c191

Signed-off-by: Harm Weites <[email protected]>

harmw force-pushed the more-delta-lake branch from 6b7adc3 to 7f2c191 Compare February 11, 2022 09:34

harmw added 2 commits February 28, 2022 10:42

Pass the mypy

f4ccec8

Signed-off-by: Harm Weites <[email protected]>

Fix the return type here, finally

d99bc17

Signed-off-by: Harm Weites <[email protected]>

Fix import sorting order

1c4ab7d

Signed-off-by: Harm Weites <[email protected]>

feng-tao merged commit f9c0eeb into amundsen-io:main Mar 9, 2022

harmw deleted the more-delta-lake branch March 10, 2022 08:22

kristenarmes mentioned this pull request Dec 11, 2023

chore(deps-dev): bump pyspark from 3.0.1 to 3.2.2 #2028

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Delta table partition watermarks #1694

feat: Delta table partition watermarks #1694

harmw commented Feb 1, 2022

boring-cyborg bot commented Feb 1, 2022

mgorsk1 Feb 1, 2022

feng-tao Feb 10, 2022

mgorsk1 Feb 1, 2022

mgorsk1 Feb 1, 2022

mgorsk1 commented Feb 1, 2022 •

edited

Loading

feng-tao commented Feb 1, 2022

mgorsk1 commented Feb 1, 2022

samshuster commented Feb 1, 2022 via email

harmw commented Feb 9, 2022

samshuster commented Feb 9, 2022 via email

feng-tao commented Feb 10, 2022

feng-tao commented Feb 15, 2022

harmw commented Mar 1, 2022

feng-tao commented Mar 2, 2022

harmw commented Mar 2, 2022

feng-tao commented Mar 8, 2022

boring-cyborg bot commented Mar 9, 2022

feat: Delta table partition watermarks #1694

feat: Delta table partition watermarks #1694

Conversation

harmw commented Feb 1, 2022

Summary of Changes

Tests

Documentation

CheckList

boring-cyborg bot commented Feb 1, 2022

mgorsk1 Feb 1, 2022

Choose a reason for hiding this comment

feng-tao Feb 10, 2022

Choose a reason for hiding this comment

mgorsk1 Feb 1, 2022

Choose a reason for hiding this comment

mgorsk1 Feb 1, 2022

Choose a reason for hiding this comment

mgorsk1 commented Feb 1, 2022 • edited Loading

feng-tao commented Feb 1, 2022

mgorsk1 commented Feb 1, 2022

samshuster commented Feb 1, 2022 via email

harmw commented Feb 9, 2022

samshuster commented Feb 9, 2022 via email

feng-tao commented Feb 10, 2022

feng-tao commented Feb 15, 2022

harmw commented Mar 1, 2022

feng-tao commented Mar 2, 2022

harmw commented Mar 2, 2022

feng-tao commented Mar 8, 2022

boring-cyborg bot commented Mar 9, 2022

mgorsk1 commented Feb 1, 2022 •

edited

Loading