Discuss Data endpoint strategy and data model plan #3165

vrajmohan · 2018-05-29T18:39:08Z

@wjiangFEC, @rjayasekera, @fecjjeng and I had a meeting this morning (05-29-2018) to discuss:

How we ended up with the proliferation of materialized views and indexes
Strategy for the data model going forward
How we should distribute work between direct SQL and SQLAlchemy

I didn't have answers for these as some of the decisions were made before my time and do not understand the considerations that led to the current state. We should have a meeting with @ccostino to kick this off this discussion.

ccostino · 2018-06-04T14:44:11Z

Filling in a bit of details and history here as best I can!

The proliferation of materialized views came about because of three reasons: not having a good window into the structure of the data to begin with (in other words, understanding which pieces serve as the source of truth for which things), the data model not fully leveraging the power and value of a relational database, and performance. Each of these things contributed to the need to find a solution for crafting the API endpoints that we have, which provide the data in a usable and understandable manner for end-users and combine disparate sources of information that belong together.

Digging into these three facets a bit deeper, over the time of the project I know it has been hard to get information about the data itself and what truly belongs where and with what. Schedule E is a great example of this because the data comes from at least three different source tables. We had to revisit the implementation several times before settling on something that munged data from multiple tables together in a way that was deemed accurate. It has been hard to get this information without a fully documented data model that clearly indicates what pieces of information live where and what they represent. Furthermore, the data model itself is more of a digital representation of the physical world than not, meaning the data is fairly denormalized. Things like committee information and disclosure data details live in multiple places instead of there being one source of truth (e.g., a single committee entity) for them that are then joined with other pieces of information when necessary.

Speaking of joins, these can be expensive operations in a query, especially one run many times per minute over a large volume of data, even with indexes. Since we also want the API to be performant, materialized views seemed to be the optimal route to go. As we learned more about the underlying data and how denormalized it can be in some cases, we also learned what pieces need to go together. Materialized views allow us to craft those relationships without disrupting the underlying data model (which is used by many of the data ingestion and import systems, e.g. Golden Gate) but treat them as if they are separate tables too, which means we can do things like put indexes on them. This is also why we have a nightly refresh process - the bulk of what it does is goes through and updates the data in each of the materialized views so that we're always displaying the most current information.

I hope this answers the question and provides a bit more context as to why things are the way they are. If anything is still unclear, please ask! If I can't answer it, we'll try and find someone who can. :-)

wjiangFEC · 2018-06-05T14:52:20Z

Thank you, Carlo, for filing in some history regarding current MV structures. Data in Disclosure database in FEC is not easy to understand. There are patches with the table design over time trying to accommodate how the forms were designed in the first place and how the data was collected and processed, and timing, etc..

I think the goal of this ticket is to kick off discussion and review among FEC team members (Developers and DBAs). During the process, team members will understand more about the data and database structures, it will help on future development as well as consolidation some of the current MVs.

wjiangFEC · 2018-06-07T16:41:26Z

We had a great meeting yesterday to discuss current MV situation. @vrajmohan has made the excellent suggestion as follow:

Distribute work between direct SQL and SQLAlchemy.
Review MVs gradually, attack some low hanging fruit first (do the consolidation and modification for those easy ones first)
Look into Unit testing data with pgTAP. Learn pgTAP functionality, it might come handy in some cases.
Look into PostgreSQL aggregate functionality to some data roll up needs.
I will create some new tickets based on @vrajmohan suggestion and close this ticket soon.

wjiangFEC · 2018-06-08T19:51:35Z

Paul has opened some tickets to address incorrect data (#2068, #2069, #2070, #2071, #2072, #2073, #2074). These tickets will lead to some query redesign to use base FECP tables, and resulting in some replacement of these MVs. So I will not open additional tickets for item #2 and #4 in @vrajmohan suggestion above.

wjiangFEC · 2018-06-08T19:57:43Z

A new ticket (#3202) has been open for item #3 in @vrajmohan suggestion above.

#3202

I am closing out this Data endpoint strategy and data model plan discussion ticket.

wjiangFEC · 2018-06-08T20:01:25Z

All the items come out of the discussion has been addressed.

vrajmohan assigned ccostino, vrajmohan, rjayasekera, fecjjeng and wjiangFEC May 29, 2018

vrajmohan added this to the Sprint 6.1 milestone May 29, 2018

vrajmohan assigned fec-jli, hcaofec and ChingKuan May 29, 2018

wjiangFEC closed this as completed Jun 8, 2018

AmyKort mentioned this issue Jul 6, 2018

Ongoing - Document existing database model #3188

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discuss Data endpoint strategy and data model plan #3165

Discuss Data endpoint strategy and data model plan #3165

vrajmohan commented May 29, 2018

ccostino commented Jun 4, 2018

wjiangFEC commented Jun 5, 2018

wjiangFEC commented Jun 7, 2018

wjiangFEC commented Jun 8, 2018

wjiangFEC commented Jun 8, 2018

wjiangFEC commented Jun 8, 2018

Discuss Data endpoint strategy and data model plan #3165

Discuss Data endpoint strategy and data model plan #3165

Comments

vrajmohan commented May 29, 2018

ccostino commented Jun 4, 2018

wjiangFEC commented Jun 5, 2018

wjiangFEC commented Jun 7, 2018

wjiangFEC commented Jun 8, 2018

wjiangFEC commented Jun 8, 2018

wjiangFEC commented Jun 8, 2018