Parse information returned by list_relations_without_caching macro to speed up catalog generation #160

franloza · 2021-04-13T21:33:56Z

Resolves #93

Description

Use regular expressions to parse the information retrieved by show table extended in [schema] like '*'. In this way, it's no longer necessary to run show tblproperties and describe extended on every single relation.

Checklist

I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

franloza · 2021-04-14T16:20:58Z

@jtcohen6 I think this is ready for review 👍

jtcohen6

@franloza This is incredible! Thank you so much for the contribution.

The changes worked wonderfully for me locally, and docs generate was so much quicker than all those pesky describe extended. I have one ask... if you wanted to take this a step further, and include stats as well... if not no worries! You've done a lot as it is :)

jtcohen6 · 2021-04-14T22:22:41Z

dbt/adapters/spark/impl.py

+    INFORMATION_COLUMNS_REGEX = re.compile(
+        r"\|-- (.*): (.*) \(nullable = (.*)\b", re.MULTILINE)
+    INFORMATION_OWNER_REGEX = re.compile(r"^Owner: (.*)$", re.MULTILINE)


While we're here... any chance I could convince you to regex for stats as well? :)

INFORMATION_STATISTICS_REGEX = re.compile('^Statistics: (.*)$", re.MULTILINE)

From some (anecdotal) testing, delta tables include a line like Statistics: 1109049927 bytes, whereas (e.g.) parquet tables include a line like Statistics: 1109049927 bytes, 14093476 rows.

The SparkColumn object takes a table_stats argument; we'd just need to adjust convert_table_stats to handle the delta table case, which is missing the rows bit after the comma split.

Alternatively, if you wanted to do it all in regex, I guess you could:

INFORMATION_STATISTICS_REGEX = re.compile('^Statistics: ([0-9]+) bytes(, ([0-9]+) rows)?', re.MULTILINE) bytes, _, rows = re.findall(self.INFORMATION_OWNER_REGEX, relation.information)

@jtcohen6 This is a great idea! This is useful information and I think it's worth trying to parse it. I'll give it a try and add an extra test for parquet tables. I'll let you know when I have this implemented. (Switching back to WIP)

franloza · 2021-04-15T18:11:00Z

@jtcohen6 I'm getting an authentication error in CircleCI:

Build-agent version 1.0.55761-e64d5704 (2021-04-13T14:06:42+0000)
System information:
 Server Version: 19.03.13
 Storage Driver: overlay2
  Backing Filesystem: xfs
 Cgroup Driver: cgroupfs
 Kernel Version: 4.15.0-1092-aws
 Operating System: Ubuntu 18.04.5 LTS
 OSType: linux
 Architecture: x86_64

Starting container fishtownanalytics/test-container:10
Warning: No authentication provided, using CircleCI credentials for pulls from Docker Hub.
  image cache not found on this host, downloading fishtownanalytics/test-container:10

Error response from daemon: unauthorized: authentication required

Would you mind to rerun the job after the problem has been solved? I don't have permission for that.

jtcohen6 · 2021-04-15T20:06:39Z

@franloza Just re-ran! I think DockerHub was down momentarily, causing the error above. Now looks like a linting error.

franloza · 2021-04-15T21:34:35Z

@jtcohen6 Thanks for re-running!

we'd just need to adjust convert_table_stats to handle the delta table case, which is missing the rows bit after the comma split.

I didn't have to modify this function to use it in the new parsing method. I don't know if I missed something but I included a couple of unit tests to verify that this function was working as expected.

This PR is ready for a re-review 😃

jtcohen6

Thanks so much for this contribution @franloza! LGTM :)

mmeasic · 2021-05-04T20:01:25Z

This is great work @franloza, thank you!

Parse information returned by show table extended

07e4c55

cla-bot bot added the cla:yes label Apr 13, 2021

Fran Lozano added 7 commits April 13, 2021 23:42

Fix linter errors

d98588d

Add logic when relation is None

fe50b05

Revert previous commit and fix bug

2e307f6

Rename method and add unit test

3b54482

Fix bug in column_index

e1bf654

Add test with view

64870e9

Update CHANGELOG.md

2c6a5d8

franloza changed the title ~~WIP: Parse information returned by show table extended to speed up catalog generation~~ Parse information returned by list_relations_without_caching macro to speed up catalog generation Apr 14, 2021

jtcohen6 reviewed Apr 14, 2021

View reviewed changes

franloza changed the title ~~Parse information returned by list_relations_without_caching macro to speed up catalog generation~~ WIP: Parse information returned by list_relations_without_caching macro to speed up catalog generation Apr 15, 2021

Parse statistics

7612fcb

Fix inter errors

36367e6

franloza changed the title ~~WIP: Parse information returned by list_relations_without_caching macro to speed up catalog generation~~ Parse information returned by list_relations_without_caching macro to speed up catalog generation Apr 15, 2021

franloza requested a review from jtcohen6 April 15, 2021 21:34

jtcohen6 approved these changes Apr 17, 2021

View reviewed changes

jtcohen6 merged commit 27409c4 into dbt-labs:master Apr 17, 2021

franloza deleted the feature/93-catalog-generation branch April 19, 2021 06:39

jtcohen6 mentioned this pull request Feb 24, 2022

[CT-279] Delta tables missing column schema in show table extended #295

Closed

jtcohen6 mentioned this pull request Mar 18, 2022

[CT-385] Limit catalog generation to specific relations #300

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse information returned by list_relations_without_caching macro to speed up catalog generation #160

Parse information returned by list_relations_without_caching macro to speed up catalog generation #160

franloza commented Apr 13, 2021 •

edited

Loading

franloza commented Apr 14, 2021

jtcohen6 left a comment

jtcohen6 Apr 14, 2021 •

edited

Loading

franloza Apr 15, 2021

franloza commented Apr 15, 2021

jtcohen6 commented Apr 15, 2021

franloza commented Apr 15, 2021

jtcohen6 left a comment

mmeasic commented May 4, 2021

Parse information returned by list_relations_without_caching macro to speed up catalog generation #160

Parse information returned by list_relations_without_caching macro to speed up catalog generation #160

Conversation

franloza commented Apr 13, 2021 • edited Loading

Description

Checklist

franloza commented Apr 14, 2021

jtcohen6 left a comment

Choose a reason for hiding this comment

jtcohen6 Apr 14, 2021 • edited Loading

Choose a reason for hiding this comment

franloza Apr 15, 2021

Choose a reason for hiding this comment

franloza commented Apr 15, 2021

jtcohen6 commented Apr 15, 2021

franloza commented Apr 15, 2021

jtcohen6 left a comment

Choose a reason for hiding this comment

mmeasic commented May 4, 2021

franloza commented Apr 13, 2021 •

edited

Loading

jtcohen6 Apr 14, 2021 •

edited

Loading