Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add index metadata for PostgreSQL #144

Merged
merged 2 commits into from
Mar 1, 2021

Conversation

peterthesling
Copy link
Contributor

@peterthesling peterthesling commented Feb 15, 2021

This PR adds the ability to scrape index metadata of a PostgreSQL instance, like this:
image

Structure of the code:

  • I add add_indexes to configure_bigquery_extractor, which adds index information to the table metadata.
  • add_indexes adds PostgresIndexExtractor as an additional extractor. This is done via a dict, meaning it will be easy to add a e.g. BigQueryIndexExtractor.
  • PostgresIndexExtractor inherits from IndexExtractor and expands on IndexExtractor by a SQL query that gets all the index information we need, e.g. a list of indexes incl. name & which columns are included in an index.
  • IndexExtractor itself contains mostly boilerplate code to initiate a connection and interact with the iterator and SQLAlchemyEngine.
  • Inside WhaleLoader PostgresIndexExtractor is run, since it was added to the list of extractors through add_indexes.
  • IndexMetadata is the index class, which stores all information contained inside an index.
  • Similar to Moving formatting of columns and table to corresponding classes and outside of WhaleLoader #143 I added a format_for_markdown function to IndexMetadata, where the index is transformed into a string like [unique] `index_name` [`column_1`, `column_2`]

What will be left for future PRs:

  • I think it would be helpful to also add information if an index is a primary key.
  • This PR only adds the ability for PostgreSQL to see indexes. All other SQL dialects should be worked on next.

@peterthesling peterthesling changed the title Add index meta data for Postgres Add index metadata for Postgres Feb 15, 2021
@peterthesling peterthesling changed the title Add index metadata for Postgres Add index metadata for PostgreSQL Feb 15, 2021
@@ -52,6 +55,25 @@ def add_ugc_runner(extractors: list, conf: ConfigTree, connection):
return extractors, conf


def add_indexes(extractors: list, conf: ConfigTree, connection):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha. Thanks for following my pattern here - it is a bit weird though, isn't it? Probably would make sense to refactor at some point...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I am sure it could be improved, but I don't think it should block us from making progress. Let me see if I can come up with a refactor.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no didn't mean to block the review. Just got bogged down checking things before merging :)

@rsyi
Copy link
Owner

rsyi commented Feb 25, 2021

Okay @peterthesling one thing we should discuss - running this multiple times against a single index just appends those indices, which means that the index list just gets longer and longer with each pull.

My gut says the best way to get around this is to modify the extractor to return a IndexSet object instead, then process this all at once in the loader, overwriting the previous indices. Hackier options probably also exist.

Let me know what you think @peterthesling. :) I'm happy to take this on, btw, just let me know.

@codecov-io
Copy link

Codecov Report

Merging #144 (1c7b95a) into master (6760c72) will increase coverage by 22.00%.
The diff coverage is 98.46%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #144       +/-   ##
===========================================
+ Coverage   53.67%   75.68%   +22.00%     
===========================================
  Files          49       38       -11     
  Lines        2541     2044      -497     
===========================================
+ Hits         1364     1547      +183     
+ Misses       1177      497      -680     
Impacted Files Coverage Δ
pipelines/whale/utils/parsers.py 94.80% <66.66%> (-1.15%) ⬇️
...elines/whale/extractor/postgres_index_extractor.py 95.65% <95.65%> (ø)
pipelines/whale/extractor/base_index_extractor.py 100.00% <100.00%> (ø)
pipelines/whale/loader/whale_loader.py 64.00% <100.00%> (+4.38%) ⬆️
pipelines/whale/models/index_metadata.py 100.00% <100.00%> (ø)
pipelines/whale/utils/extractor_wrappers.py 75.53% <100.00%> (+2.10%) ⬆️
pipelines/whale/utils/markdown_delimiters.py 100.00% <100.00%> (ø)
cli/src/filesystem.rs
cli/src/skimmer.rs
cli/src/warehouse/buildscript.rs
... and 31 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c2c3da7...1c7b95a. Read the comment docs.

if not sections[INDEX_SECTION]:
sections[INDEX_SECTION] = INDEX_DELIMITER + "\n"

sections[INDEX_SECTION] = INDEX_DELIMITER + "\n"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this almost works, except if there are multiple indices, only the last one shows up :/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 Will take another look today.

@peterthesling peterthesling force-pushed the pthesling/index-info-132 branch 2 times, most recently from aec05cb to d7c84e4 Compare February 28, 2021 23:34
@peterthesling peterthesling force-pushed the pthesling/index-info-132 branch from d7c84e4 to ed41557 Compare February 28, 2021 23:49
@rsyi
Copy link
Owner

rsyi commented Mar 1, 2021

Love it! Thanks, Peter. :) Merging now!

@rsyi rsyi merged commit ce95bba into rsyi:master Mar 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants