Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example 1: Disbursements missing from /schedule_b/by_recipient/ #3390

Closed
1 of 2 tasks
lbeaufort opened this issue Sep 19, 2018 · 12 comments
Closed
1 of 2 tasks

Example 1: Disbursements missing from /schedule_b/by_recipient/ #3390

lbeaufort opened this issue Sep 19, 2018 · 12 comments
Assignees
Labels
Milestone

Comments

@lbeaufort
Copy link
Member

lbeaufort commented Sep 19, 2018

Some disbursements are missing from /schedule_b/by_recipient/.

From an API user:

There are a few cases where the vendor names I’m searching with yield no results through the API but yield results through the manual search on the FEC website.

Example 1:
First, considering the vendor DRL Publications

The following request, which is trying to get a list of every committee that spent money on DRL Publications yields no results:
https://api.open.fec.gov/v1/schedules/schedule_b/by_recipient/?api_key=DEMO_KEY&cycle=2016&per_page=20&recipient_name=DRL%20PUBLICATIONS&page=1

However, the same search on /schedule_b/ and the front end returns one result:
https://api.open.fec.gov/v1/schedules/schedule_b/?api_key=DEMO_KEY&sort_hide_null=true&data_type=processed&recipient_name=drl+publications&two_year_transaction_period=2016&min_date=01%2F01%2F2015&max_date=12%2F31%2F2016&sort=-disbursement_date&per_page=30

https://www.fec.gov/data/disbursements/?two_year_transaction_period=2016&data_type=processed&recipient_name=drl+publications&min_date=01%2F01%2F2015&max_date=12%2F31%2F2016

Example 2 has a different cause: #3429

@lbeaufort lbeaufort added the Bug label Sep 19, 2018
@lbeaufort lbeaufort added this to the Sprint 7.3 milestone Sep 27, 2018
@lbeaufort
Copy link
Member Author

The logic preventing this from appearing seems to be in disclosure.dsc_sched_b_aggregate_recipient and I'm looking into getting that logic (It's currently on the Oracle side).

I'm guessing at least for example 1, we may be excluding memos. Example 2 is on line 29 - I want to see what other filters are on the aggregate table.

@lbeaufort
Copy link
Member Author

One possible solution @fecjjeng and I discussed is showing $0 totals for recipients that have only memoed transactions.

Example:
$100 Itemized:
Jean $50
Laura $50
Paul $50 X Memo

$100 By recipient aggregate:
Name, Amount, "Memo/Description"
Jean $50
Laura $50
Paul $0 Memo

logic for disclosure.dsc_sched_b_aggregate_recipient:

select cmte_id, rpt_yr + mod(rpt_yr, 2) as cycle,
              name as recipient_nm, sum(TRANSACTION_AMT) as total, count(TRANSACTION_AMT) as count
from DISCLOSURE.F_ITEM_RECEIPT_OR_EXP
where SCHED_TP_CD = 'SB' AND cmte_id = r1.cmte_id
  and (rpt_yr + mod(rpt_yr, 2)) = r1.cycle
  and TRANSACTION_AMT is not null and TRANSACTION_AMT <> 0
  and (memo_cd != 'X' or memo_cd is null)
  --TODO: only sum where not memoed
group by cmte_id, (rpt_yr + mod(rpt_yr, 2)), name);

@lbeaufort
Copy link
Member Author

lbeaufort commented Oct 9, 2018

After giving it more thought, a separate column for memo_total and memo_count makes the most sense here.

SQL:

SELECT
   cmte_id,
   cycle,
   recipient_nm,
   sum(non_memo_amount) as total,
   sum(non_memo_count) as count,
   sum(memo_amount) as memo_total,
   sum(memo_count) as memo_count
FROM (
   SELECT cmte_id, rpt_yr + mod(rpt_yr, 2) as cycle, name as recipient_nm,
   CASE WHEN memo_cd = 'X' or memo_cd is not null then TRANSACTION_AMT ELSE 0 END AS memo_amount,
   CASE WHEN memo_cd = 'X' or memo_cd is not null then 1 ELSE 0 END AS memo_count,
   CASE WHEN memo_cd !='X' or memo_cd is null then TRANSACTION_AMT ELSE 0 END AS non_memo_amount,
   CASE WHEN memo_cd !='X' or memo_cd is null then 1 ELSE 0 END AS non_memo_count
            from DISCLOSURE.F_ITEM_RECEIPT_OR_EXP
           where SCHED_TP_CD = 'SB'
                 --and (rpt_yr + mod(rpt_yr, 2)) = r1.cycle
                 and TRANSACTION_AMT is not null and TRANSACTION_AMT <> 0
                 and cmte_id='C00575795'
                 and name='DRL PUBLICATIONS'
)
group by cmte_id, cycle, recipient_nm;

Output looks like:

cycle cmte_id recipient_nm total count memo_total memo_count
2016 C00575795 DRL PUBLICATIONS 0 0 1138.55 1

@lbeaufort
Copy link
Member Author

For the second example, we're using filter_multi which is requiring an exact match. A python change to use filter_fulltext allows for partial text searches. I also added test coverage for this behavior.

@fecjjeng
Copy link
Contributor

fecjjeng commented Oct 10, 2018

The original logic is to only include the non-memo entries (i.e. memo_cd != ‘X’ or memo_cd is null).

The proposed business logic updates separated the entries into non-memo and memo items, with two additional columns of memo_total and memo_count.
Non-memo:
memo_cd !='X' or memo_cd is null

Memo:
memo_cd = 'X' or memo_cd is not null
I will suggest to change the second logic (for memo entries) to memo_cd = ‘X’, omit the memo_cd is not null part. Otherwise if memo_cd is a not NULL value, but != ‘X’, it will be count in both places (I know it is extremely rare, but I seen several rows like that).

@lbeaufort @PaulClark2 @jwchumley This change will involve updates in Oracle procedures, in Java script, and in Postgresql database table structure, re-load the data, and API, we would like to be sure before we proceed.

  1. We want to separate them into memo/non-memo and have extra columns with the business logic described above (with my minor modification).

If #1 holds true,
2. All the scheduled_b aggregate (sched_b_aggregate_recipient, schedule_b_aggregate_purpose, sched_b_aggregate_receipient_id) currently use the logic of only include the non-memo entries (i.e. memo_cd != ‘X’ or memo_cd is null). Do we want to follow the same pattern for all three of them?

If #2 holds true,
3. Should we implement all changes for these three tables in this ticket?
Or separate into 3 tickets, each one handle one table?
Or a separate ticket to do the API work and use the original ticket to do the API work?

@lbeaufort
Copy link
Member Author

lbeaufort commented Oct 10, 2018

Thanks so much for taking a look, @fecjjeng! I agree with your SQL edit. It also makes sense for me to keep all three Schedule B aggregates consistent. I have bandwidth to write SQL for the other two tables, get testing examples, and make the API and SQL migration changes necessary this sprint.

@fecjjeng
Copy link
Contributor

fecjjeng commented Oct 10, 2018

@lbeaufort

  • I can make the sql change, don't worry about it. I was just confirming that this is the business logic we want to follow.
  • Yes, some testing example would be helpful when you have time. Thank you.
  • I will open a ticket to do all the DB side works (including Oracle, Java, Postgresql) for the 3 SB aggregates tables.
  • Since the two problems mentioned in this ticket has different nature, do you think maybe we should split Example 1: Disbursements missing from /schedule_b/by_recipient/ #3390 into two tickets, one deal with the full_text search which has no data dependency and you already implement the API change; the other one that relying on this memo data. That way one of these two problems that does not need to wait for this database change can go in this Sprint. Thoughts?

@lbeaufort lbeaufort changed the title Disbursements missing from /schedule_b/by_recipient/ Example 1: Disbursements missing from /schedule_b/by_recipient/ Oct 10, 2018
@lbeaufort
Copy link
Member Author

@fecjjeng I agree that this issue should be split into examples 1 and 2 because they have entirely different causes.

I created example 2 here: #3429 and linked it to my PR: #3423

@fecjjeng
Copy link
Contributor

fecjjeng commented Oct 17, 2018

@lbeaufort
Updates on issue #3431:

Three "new" tables had been created in the DEV database:
disclosure.dsc_sched_b_aggregate_recipient_new
disclosure.dsc_sched_b_aggregate_recipient_id_new
disclosure.dsc_sched_b_aggregate_purpose_new

These tables had been loaded with data for the past three days. It can be used as template for the API work for issue #3390. Once API tasks had been worked out, a complete reload (which will take some time) will be executed to provide full sets of data.

These tables had renamed the following two columns:
total -> non_memo_total
count -> non_memo_count

And added two new columns
memo_total
memo_count

@fecjjeng
Copy link
Contributor

We plan to do dual daily updates on both new and current sched_b_agg_xxxxx tables after the initial loading until API in all environments start using the new tables. Then another issue will be used to drop the "old" tables.

@hcaofec
Copy link
Contributor

hcaofec commented Nov 2, 2018

API change is completed but can't be merged until the CMS work is done. Because the variable names returned by endpoint have been changed, the CMS needs to capture this change. Otherwise, the page will broke.

http://127.0.0.1:8000/data/committee/C00580100/?tab=spending
screen shot 2018-11-02 at 9 49 32 am

@fec-jli fec-jli changed the title Example 1: Disbursements missing from /schedule_b/by_recipient/ [Don't Merge] Example 1: Disbursements missing from /schedule_b/by_recipient/ Nov 2, 2018
@hcaofec hcaofec changed the title [Don't Merge] Example 1: Disbursements missing from /schedule_b/by_recipient/ Example 1: Disbursements missing from /schedule_b/by_recipient/ Nov 6, 2018
@JonellaCulmer
Copy link
Contributor

Issue has been merged, so closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants