-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse information returned by list_relations_without_caching macro to speed up catalog generation #160
Parse information returned by list_relations_without_caching macro to speed up catalog generation #160
Conversation
@jtcohen6 I think this is ready for review 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@franloza This is incredible! Thank you so much for the contribution.
The changes worked wonderfully for me locally, and docs generate
was so much quicker than all those pesky describe extended
. I have one ask... if you wanted to take this a step further, and include stats as well... if not no worries! You've done a lot as it is :)
INFORMATION_COLUMNS_REGEX = re.compile( | ||
r"\|-- (.*): (.*) \(nullable = (.*)\b", re.MULTILINE) | ||
INFORMATION_OWNER_REGEX = re.compile(r"^Owner: (.*)$", re.MULTILINE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we're here... any chance I could convince you to regex for stats as well? :)
INFORMATION_STATISTICS_REGEX = re.compile('^Statistics: (.*)$", re.MULTILINE)
From some (anecdotal) testing, delta tables include a line like Statistics: 1109049927 bytes
, whereas (e.g.) parquet tables include a line like Statistics: 1109049927 bytes, 14093476 rows
.
The SparkColumn
object takes a table_stats
argument; we'd just need to adjust convert_table_stats
to handle the delta table case, which is missing the rows
bit after the comma split.
Alternatively, if you wanted to do it all in regex, I guess you could:
INFORMATION_STATISTICS_REGEX = re.compile('^Statistics: ([0-9]+) bytes(, ([0-9]+) rows)?', re.MULTILINE)
bytes, _, rows = re.findall(self.INFORMATION_OWNER_REGEX, relation.information)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtcohen6 This is a great idea! This is useful information and I think it's worth trying to parse it. I'll give it a try and add an extra test for parquet tables. I'll let you know when I have this implemented. (Switching back to WIP)
@jtcohen6 I'm getting an authentication error in CircleCI:
Would you mind to rerun the job after the problem has been solved? I don't have permission for that. |
@jtcohen6 Thanks for re-running!
I didn't have to modify this function to use it in the new parsing method. I don't know if I missed something but I included a couple of unit tests to verify that this function was working as expected. This PR is ready for a re-review 😃 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for this contribution @franloza! LGTM :)
This is great work @franloza, thank you! |
Resolves #93
Description
Use regular expressions to parse the information retrieved by
show table extended in [schema] like '*'
. In this way, it's no longer necessary to runshow tblproperties
anddescribe extended
on every single relation.Checklist
CHANGELOG.md
and added information about my change to the "dbt next" section.