Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discuss Data endpoint strategy and data model plan #3165

Closed
vrajmohan opened this issue May 29, 2018 · 6 comments
Closed

Discuss Data endpoint strategy and data model plan #3165

vrajmohan opened this issue May 29, 2018 · 6 comments

Comments

@vrajmohan
Copy link
Contributor

@wjiangFEC, @rjayasekera, @fecjjeng and I had a meeting this morning (05-29-2018) to discuss:

  1. How we ended up with the proliferation of materialized views and indexes
  2. Strategy for the data model going forward
  3. How we should distribute work between direct SQL and SQLAlchemy

I didn't have answers for these as some of the decisions were made before my time and do not understand the considerations that led to the current state. We should have a meeting with @ccostino to kick this off this discussion.

@ccostino
Copy link
Contributor

ccostino commented Jun 4, 2018

Filling in a bit of details and history here as best I can!

The proliferation of materialized views came about because of three reasons: not having a good window into the structure of the data to begin with (in other words, understanding which pieces serve as the source of truth for which things), the data model not fully leveraging the power and value of a relational database, and performance. Each of these things contributed to the need to find a solution for crafting the API endpoints that we have, which provide the data in a usable and understandable manner for end-users and combine disparate sources of information that belong together.

Digging into these three facets a bit deeper, over the time of the project I know it has been hard to get information about the data itself and what truly belongs where and with what. Schedule E is a great example of this because the data comes from at least three different source tables. We had to revisit the implementation several times before settling on something that munged data from multiple tables together in a way that was deemed accurate. It has been hard to get this information without a fully documented data model that clearly indicates what pieces of information live where and what they represent. Furthermore, the data model itself is more of a digital representation of the physical world than not, meaning the data is fairly denormalized. Things like committee information and disclosure data details live in multiple places instead of there being one source of truth (e.g., a single committee entity) for them that are then joined with other pieces of information when necessary.

Speaking of joins, these can be expensive operations in a query, especially one run many times per minute over a large volume of data, even with indexes. Since we also want the API to be performant, materialized views seemed to be the optimal route to go. As we learned more about the underlying data and how denormalized it can be in some cases, we also learned what pieces need to go together. Materialized views allow us to craft those relationships without disrupting the underlying data model (which is used by many of the data ingestion and import systems, e.g. Golden Gate) but treat them as if they are separate tables too, which means we can do things like put indexes on them. This is also why we have a nightly refresh process - the bulk of what it does is goes through and updates the data in each of the materialized views so that we're always displaying the most current information.

I hope this answers the question and provides a bit more context as to why things are the way they are. If anything is still unclear, please ask! If I can't answer it, we'll try and find someone who can. :-)

@wjiangFEC
Copy link

Thank you, Carlo, for filing in some history regarding current MV structures. Data in Disclosure database in FEC is not easy to understand. There are patches with the table design over time trying to accommodate how the forms were designed in the first place and how the data was collected and processed, and timing, etc..

I think the goal of this ticket is to kick off discussion and review among FEC team members (Developers and DBAs). During the process, team members will understand more about the data and database structures, it will help on future development as well as consolidation some of the current MVs.

@wjiangFEC
Copy link

We had a great meeting yesterday to discuss current MV situation. @vrajmohan has made the excellent suggestion as follow:

  1. Distribute work between direct SQL and SQLAlchemy.
  2. Review MVs gradually, attack some low hanging fruit first (do the consolidation and modification for those easy ones first)
  3. Look into Unit testing data with pgTAP. Learn pgTAP functionality, it might come handy in some cases.
  4. Look into PostgreSQL aggregate functionality to some data roll up needs.
    I will create some new tickets based on @vrajmohan suggestion and close this ticket soon.

@wjiangFEC
Copy link

Paul has opened some tickets to address incorrect data (#2068, #2069, #2070, #2071, #2072, #2073, #2074). These tickets will lead to some query redesign to use base FECP tables, and resulting in some replacement of these MVs. So I will not open additional tickets for item #2 and #4 in @vrajmohan suggestion above.

@wjiangFEC
Copy link

A new ticket (#3202) has been open for item #3 in @vrajmohan suggestion above.

#3202

I am closing out this Data endpoint strategy and data model plan discussion ticket.

@wjiangFEC
Copy link

All the items come out of the discussion has been addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants