-
Notifications
You must be signed in to change notification settings - Fork 959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Delta table partition watermarks #1694
Conversation
Congratulations on your first Pull Request and welcome to Amundsen community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/amundsen-io/amundsen/blob/main/CONTRIBUTING.md) |
So now we can run pytest on a fresh clone. Due to the rather old version this will throw some DeprecationWarning messages, but we can upgrade to 3.1 at a later stage. Signed-off-by: Harm Weites <[email protected]>
Signed-off-by: Harm Weites <[email protected]>
Signed-off-by: Harm Weites <[email protected]>
Signed-off-by: Harm Weites <[email protected]>
Going with the first item of the returned list will return the same column, which is not deterministic at all (given there are multiple partitions). Signed-off-by: Harm Weites <[email protected]>
Signed-off-by: Harm Weites <[email protected]>
Since watermarking strings doesn't make much sense, keep to checking integer/float/date/datetime types. Signed-off-by: Harm Weites <[email protected]>
Signed-off-by: Harm Weites <[email protected]>
Signed-off-by: Harm Weites <[email protected]>
049e2a9
to
200e64b
Compare
Signed-off-by: Harm Weites <[email protected]>
@@ -13,3 +13,4 @@ pytest-cov>=2.12.0 | |||
pytest-env>=0.6.2 | |||
pytest-mock>=3.6.1 | |||
typed-ast>=1.4.3 | |||
pyspark==3.0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it need to be a hard pin or will this work with >=
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we had pyspark deps in setup.py?
@@ -425,3 +431,96 @@ def is_array_type(self, delta_type: Any) -> bool: | |||
|
|||
def is_map_type(self, delta_type: Any) -> bool: | |||
return isinstance(delta_type, MapType) | |||
|
|||
def create_table_watermarks(self, table: ScrapedTableMetadata) -> Union[List[Tuple[Optional[Watermark], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it doesn't really matter which watermark is high and which low so this method could just return Optional[List[Watermark]] which would be more readable and simple. wdyt?
@@ -206,6 +207,11 @@ def _get_extract_iter(self) -> Iterator[Union[TableMetadata, TableLastUpdated, N | |||
continue | |||
else: | |||
yield self.create_table_metadata(scraped_table) | |||
watermarks = self.create_table_watermarks(scraped_table) | |||
if watermarks: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in regards to comment on line 435 I would just do:
for watermark in watermarks:
yield watermark
by going through this code I realized that this is actually just |
@mgorsk1 i don't think so given the extractor is for delta only which requires a databricks cluster to execute. I think we could have a separate extractor for standalone spark but let's not change the scope of this pr. |
I didn't mean it to be in the scope for this PR, there is just not much code here that makes it delta specific, just iterating over sparkCatalog databases and tables which might as well be for alternative hive metastore extraction. |
Yeah all good points. At the time that I wrote this extractor, I wasn't
aware of the python libraries that supported stand alone delta operations
without spark.
Something like this:
https://github.com/delta-io/delta-rs
I think ideally this extractor is rewritten to use that library which would
also enable getting rid of spark dependency and should also greatly speed
up this implementation.
Unfortunately as you know the hive metastore extractor will not work with
delta tables because most of the delta table metadata is not stored in
metastore but rather on cloud storage.
…On Tue, Feb 1, 2022 at 9:20 AM mgorsk1 ***@***.***> wrote:
I didn't mean it to be in the scope for this PR, there is just not much
code here that makes it delta specific, just iterating over sparkCatalog
databases and tables which might as well be for alternative hive metastore
extraction.
—
Reply to this email directly, view it on GitHub
<#1694 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAQASBZRQL7LENUAOYAREPTUZAI6PANCNFSM5NJGTP6Q>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
hm, cool, interesting 🤔 I simply picked up from where I left over a year ago, get the watermarks in and have it (Amundsen) be more valuable to how we're using it in our shop 😂 Dropping the changes in this PR in favour of something new (extracting-without-spark) sounds pretty reasonable, but I personally don't have time to do that in a rather timely fashion :( Would it make sense to get these changes in (after resolving the code review notes) and work on the break-out in a separate PR? And if yes, would that necessitate an RFC of some sorts? |
Oh yes, I definitely agree! I think that is a separate issue on its own and
that shouldn't stop your PR.
…On Wed, Feb 9, 2022 at 3:50 AM Harm Weites ***@***.***> wrote:
hm, cool, interesting 🤔 I simply picked up from where I left over a year
ago, get the watermarks in and have it (Amundsen) be more valuable to how
we're using it in our shop 😂
Dropping the changes in this PR in favour of something new
(extracting-without-spark) sounds pretty reasonable, *but* I personally
don't have time to do that in a rather timely fashion :(
Would it make sense to get these changes in (after resolving the code
review notes) and work on the break-out in a separate PR? And if yes, would
that necessitate an RFC of some sorts?
—
Reply to this email directly, view it on GitHub
<#1694 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAQASBY7EQ2344J6HKI7WFDU2JIIXANCNFSM5NJGTP6Q>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
agree, we shouldn't block this pr instead should just file a github issue for enhancement. btw, could you fix the lint:
I will take a look |
There are scenarios where a dataset exists, but is empty. In this case .first() will fail. Signed-off-by: Harm Weites <[email protected]>
Signed-off-by: Harm Weites <[email protected]>
Signed-off-by: Harm Weites <[email protected]>
89f9fa0
to
06b9fc3
Compare
This reverts commit 06b9fc3. Working with this as part of job.launch() brings errors, where the original code would bring the desired result. Signed-off-by: Harm Weites <[email protected]>
Signed-off-by: Harm Weites <[email protected]>
Signed-off-by: Harm Weites <[email protected]>
6b7adc3
to
7f2c191
Compare
|
Signed-off-by: Harm Weites <[email protected]>
Signed-off-by: Harm Weites <[email protected]>
@feng-tao should be all good now, finally 😅
Is this new? |
@harmw , almost there :) :
|
Signed-off-by: Harm Weites <[email protected]>
completely ignore the contribution guidelines, my bad - should be good now, the steps in the |
trigger the CI button again, will merge once it is green |
Awesome work, congrats on your first merged pull request! |
* Install pyspark for dev work So now we can run pytest on a fresh clone. Due to the rather old version this will throw some DeprecationWarning messages, but we can upgrade to 3.1 at a later stage. Signed-off-by: Harm Weites <[email protected]> * Read watermarks for Delta tables Signed-off-by: Harm Weites <[email protected]> * Include tests Signed-off-by: Harm Weites <[email protected]> * More proper watermark yielding Signed-off-by: Harm Weites <[email protected]> * Select the partition_column Going with the first item of the returned list will return the same column, which is not deterministic at all (given there are multiple partitions). Signed-off-by: Harm Weites <[email protected]> * Cut the line length Signed-off-by: Harm Weites <[email protected]> * Only process partitions of a workable type Since watermarking strings doesn't make much sense, keep to checking integer/float/date/datetime types. Signed-off-by: Harm Weites <[email protected]> * Updated tests Signed-off-by: Harm Weites <[email protected]> * Oops, the .first() returns a Row object Signed-off-by: Harm Weites <[email protected]> * Wrap this extraction in a try/except There are scenarios where a dataset exists, but is empty. In this case .first() will fail. Signed-off-by: Harm Weites <[email protected]> * Flake8 fixes Signed-off-by: Harm Weites <[email protected]> * Simplicity Signed-off-by: Harm Weites <[email protected]> * Revert "Simplicity" This reverts commit 06b9fc3. Working with this as part of job.launch() brings errors, where the original code would bring the desired result. Signed-off-by: Harm Weites <[email protected]> * Simplicity in return typing Signed-off-by: Harm Weites <[email protected]> * There is no complexity here :jedi_hand_wave: Signed-off-by: Harm Weites <[email protected]> * Pass the mypy Signed-off-by: Harm Weites <[email protected]> * Fix the return type here, finally Signed-off-by: Harm Weites <[email protected]> * Fix import sorting order Signed-off-by: Harm Weites <[email protected]> Signed-off-by: Ozan Dogrultan <[email protected]>
* Install pyspark for dev work So now we can run pytest on a fresh clone. Due to the rather old version this will throw some DeprecationWarning messages, but we can upgrade to 3.1 at a later stage. Signed-off-by: Harm Weites <[email protected]> * Read watermarks for Delta tables Signed-off-by: Harm Weites <[email protected]> * Include tests Signed-off-by: Harm Weites <[email protected]> * More proper watermark yielding Signed-off-by: Harm Weites <[email protected]> * Select the partition_column Going with the first item of the returned list will return the same column, which is not deterministic at all (given there are multiple partitions). Signed-off-by: Harm Weites <[email protected]> * Cut the line length Signed-off-by: Harm Weites <[email protected]> * Only process partitions of a workable type Since watermarking strings doesn't make much sense, keep to checking integer/float/date/datetime types. Signed-off-by: Harm Weites <[email protected]> * Updated tests Signed-off-by: Harm Weites <[email protected]> * Oops, the .first() returns a Row object Signed-off-by: Harm Weites <[email protected]> * Wrap this extraction in a try/except There are scenarios where a dataset exists, but is empty. In this case .first() will fail. Signed-off-by: Harm Weites <[email protected]> * Flake8 fixes Signed-off-by: Harm Weites <[email protected]> * Simplicity Signed-off-by: Harm Weites <[email protected]> * Revert "Simplicity" This reverts commit 06b9fc3. Working with this as part of job.launch() brings errors, where the original code would bring the desired result. Signed-off-by: Harm Weites <[email protected]> * Simplicity in return typing Signed-off-by: Harm Weites <[email protected]> * There is no complexity here :jedi_hand_wave: Signed-off-by: Harm Weites <[email protected]> * Pass the mypy Signed-off-by: Harm Weites <[email protected]> * Fix the return type here, finally Signed-off-by: Harm Weites <[email protected]> * Fix import sorting order Signed-off-by: Harm Weites <[email protected]> Signed-off-by: Zachary Ruiz <[email protected]>
* Install pyspark for dev work So now we can run pytest on a fresh clone. Due to the rather old version this will throw some DeprecationWarning messages, but we can upgrade to 3.1 at a later stage. Signed-off-by: Harm Weites <[email protected]> * Read watermarks for Delta tables Signed-off-by: Harm Weites <[email protected]> * Include tests Signed-off-by: Harm Weites <[email protected]> * More proper watermark yielding Signed-off-by: Harm Weites <[email protected]> * Select the partition_column Going with the first item of the returned list will return the same column, which is not deterministic at all (given there are multiple partitions). Signed-off-by: Harm Weites <[email protected]> * Cut the line length Signed-off-by: Harm Weites <[email protected]> * Only process partitions of a workable type Since watermarking strings doesn't make much sense, keep to checking integer/float/date/datetime types. Signed-off-by: Harm Weites <[email protected]> * Updated tests Signed-off-by: Harm Weites <[email protected]> * Oops, the .first() returns a Row object Signed-off-by: Harm Weites <[email protected]> * Wrap this extraction in a try/except There are scenarios where a dataset exists, but is empty. In this case .first() will fail. Signed-off-by: Harm Weites <[email protected]> * Flake8 fixes Signed-off-by: Harm Weites <[email protected]> * Simplicity Signed-off-by: Harm Weites <[email protected]> * Revert "Simplicity" This reverts commit 06b9fc3. Working with this as part of job.launch() brings errors, where the original code would bring the desired result. Signed-off-by: Harm Weites <[email protected]> * Simplicity in return typing Signed-off-by: Harm Weites <[email protected]> * There is no complexity here :jedi_hand_wave: Signed-off-by: Harm Weites <[email protected]> * Pass the mypy Signed-off-by: Harm Weites <[email protected]> * Fix the return type here, finally Signed-off-by: Harm Weites <[email protected]> * Fix import sorting order Signed-off-by: Harm Weites <[email protected]>
* Install pyspark for dev work So now we can run pytest on a fresh clone. Due to the rather old version this will throw some DeprecationWarning messages, but we can upgrade to 3.1 at a later stage. Signed-off-by: Harm Weites <[email protected]> * Read watermarks for Delta tables Signed-off-by: Harm Weites <[email protected]> * Include tests Signed-off-by: Harm Weites <[email protected]> * More proper watermark yielding Signed-off-by: Harm Weites <[email protected]> * Select the partition_column Going with the first item of the returned list will return the same column, which is not deterministic at all (given there are multiple partitions). Signed-off-by: Harm Weites <[email protected]> * Cut the line length Signed-off-by: Harm Weites <[email protected]> * Only process partitions of a workable type Since watermarking strings doesn't make much sense, keep to checking integer/float/date/datetime types. Signed-off-by: Harm Weites <[email protected]> * Updated tests Signed-off-by: Harm Weites <[email protected]> * Oops, the .first() returns a Row object Signed-off-by: Harm Weites <[email protected]> * Wrap this extraction in a try/except There are scenarios where a dataset exists, but is empty. In this case .first() will fail. Signed-off-by: Harm Weites <[email protected]> * Flake8 fixes Signed-off-by: Harm Weites <[email protected]> * Simplicity Signed-off-by: Harm Weites <[email protected]> * Revert "Simplicity" This reverts commit 06b9fc3. Working with this as part of job.launch() brings errors, where the original code would bring the desired result. Signed-off-by: Harm Weites <[email protected]> * Simplicity in return typing Signed-off-by: Harm Weites <[email protected]> * There is no complexity here :jedi_hand_wave: Signed-off-by: Harm Weites <[email protected]> * Pass the mypy Signed-off-by: Harm Weites <[email protected]> * Fix the return type here, finally Signed-off-by: Harm Weites <[email protected]> * Fix import sorting order Signed-off-by: Harm Weites <[email protected]>
Summary of Changes
Introducing watermark detection on Delta tables. Since [1] I lost track of this one around the time of going monorepo, here it is once more 🙈
[1] amundsen-io/amundsendatabuilder#427
Tests
Testing for watermarks.
Documentation
Nothing.
CheckList
Make sure you have checked all steps below to ensure a timely review.