Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix zipcode search - limit match to left(5) of parameter value. #2980

Closed
2 tasks
lbeaufort opened this issue Mar 5, 2018 · 18 comments
Closed
2 tasks

Fix zipcode search - limit match to left(5) of parameter value. #2980

lbeaufort opened this issue Mar 5, 2018 · 18 comments
Assignees
Milestone

Comments

@lbeaufort
Copy link
Member

lbeaufort commented Mar 5, 2018

See related cms issue: fecgov/fec-cms#1391

When a 9-digit zip is passed to the /schedules/schedule_a/ endpoint, we should truncate it to 5 characters. Not all filers report 9-digit zips, so searching by zip+4 returns unreliable data. In addition, currently searching by 5-digit zips is excluding records that have 9-digit zips. Also, we should add some automated tests to test_itemized to confirm this behavior.

  • schedule_a.py - match the inputted zip value to left 5 characters to the left(5) of contributor_zip column to in the ScheduleA model - do this in the resource with if kwargs.get('contributor_zip'):...
  • test_itemized.py - add some automated tests for zip+4 searches

Should all schedules work this way? only the ones that search by zip. Maybe next look at ScheduleAByZip - looks like it doesn't return results when you pass +4.
https://api.open.fec.gov/v1/schedules/schedule_a/by_zip/?api_key=DEMO_KEY&zip=200162409&page=1&per_page=20

Example

@lbeaufort lbeaufort added this to the Sprint 5.2 milestone Mar 5, 2018
@lbeaufort lbeaufort assigned lbeaufort and hcaofec and unassigned lbeaufort Mar 5, 2018
@PaulClark2
Copy link
Contributor

PaulClark2 commented Mar 5, 2018

For context:

The ZIP code data in the itemized transactions of scheudles is inconsistent. Some filers provide 5-digit ZIP codes and some filers provide 9-digit ZIP codes. Currently, if a user is searching for ZIP code 20009-2415, the app's behavior varies depending if the user searches for 20009 or for 20009-2415.

The underlying data might be 200092415 or 20009. We do not standardize ZIP codes to make them all ZIP+4. When a user enters a 5-digit ZIP code, we do not return any transactions with 9-digit ZIP codes even if first 5-digits match. If a user enters a 9-digit ZIP code, we do not return any transactions with 5-digit ZIP codes even if first 5-digits match. To make the results returned to users predictable, we plan to match left(5) of the requested ZIP code to left(5) of the ZIP code in the database.

@hcaofec
Copy link
Contributor

hcaofec commented Mar 6, 2018

My thoughts:

first, make sure any non numeric character is removed from zip argument ( by endpoint );

when user enter a 5-digit zip ( or 6, 7, 8 by error ), the endpoint should return all transactions with first 5-digit match no matter how many digits of the zip code they have in database ( could be 5 ,6,7,8, 9, or more in table );

when user enter a 9-digit zip, the endpoint should return all transactions with exact 9-digit match. The reason for this approach is in this scenario, the user actually is looking for contributors from a more narrow geographical area. If endpoint returns a more broad results, it does not serve what user wants here.

@lbeaufort @PaulClark2

@jwchumley
Copy link
Contributor

Is it simply a matter of always using "starts like" and requiring at least 5 characters?

22655 would get any zip that starts with 22655.
226551 would return any zip that starts 226551 etc.

@hcaofec
Copy link
Contributor

hcaofec commented Mar 6, 2018

This could be another approach to consider. So 22655 gets most result, 226551111 gets the least result. others would get number between the two.

The database probably has most zip code with either 5 or 9 digits since these two are the standard.

@LindsayYoung
Copy link
Contributor

We could make a new multi-starts-with filter or use the standard filters and have a 5 digit column for matching.

Also, we should be able to handle non-numeric zips because there are citizens and green card holders that can legally donate when they live in other countries.

@hcaofec
Copy link
Contributor

hcaofec commented Mar 9, 2018

I think we should make a new multi-starts-with filter, and in sqlalchemy, there is like() method

like method

@lbeaufort @LindsayYoung

@lbeaufort
Copy link
Member Author

lbeaufort commented Mar 9, 2018

@hcaofec - that won't fix the problem with the existing search not working properly. Searching for a 5-digit zip should return all zipcodes that start with those 5 digits. Currently, it's excluding 9-digit zips that match on the first 5, which is problematic.

Searching for 20463 should return:
Laura 20463
Laura 204630923

Instead, it's only returning:
Laura 20463

@PaulClark2
Copy link
Contributor

Let's start with matching on the first 5 digits (characters). Matching on the first 5 digits might return more results than a user is expecting but that's better then returning nothing because currently the search is doing an exact match.

@jwchumley
Copy link
Contributor

Right. I think there are a number of ways we can consider improving on that later but we really need to fix it now and that is the easiest approach.

@hcaofec
Copy link
Contributor

hcaofec commented Mar 9, 2018

column.like() method works similar as the like operator in sql, so when searching for 5-digit zip, the sql will be
where zip_code like '20463%',
which will return all zipcodes that start with those 5 digits, including 9-digit zips that match on the first 5

match() method serves the purpose here too.

@lbeaufort @LindsayYoung

@lbeaufort
Copy link
Member Author

lbeaufort commented Mar 9, 2018

@hcaofec that would solve our 5-digit search problem, but I'm not sure it's a good solution for our 9-digit search problem. It's my understanding that when a user searches "204630923" they're looking for all contributions from that area. But the data isn't consistent.

Data
1 | Laura | 999 E St | 20463
2 | Laura | 999 E St | 204630923

User searching for 204630923 would expect it to return:
1 | Laura | 999 E St | 204630923
2 | Laura | 999 E St | 204630923

..But because of inconsistent data, it only returns:
2 | Laura | 999 E St | 204630923

and is missing
1 | Laura | 999 E St | 20463
...even though that address is in that area

@PaulClark2
Copy link
Contributor

Agreed. Even if a user enters a 9-digit ZIP code we should return rows that match on the first 5 digits. Our data is too inconsistent and the results returned would be misleading. We can explain this in with a tool tip on the user interface.

@hcaofec
Copy link
Contributor

hcaofec commented Mar 9, 2018

when a user enter 9-digit, the query would return rows that match on the first 5 digits using sql like this:
where zip like 20463% --- which will return two rows in Laura's example

How about foreign zip code like Canada? If a user enter ON N0R 1A0, should query return where zip like ON N0%?

@PaulClark2 @jwchumley @lbeaufort

@jwchumley
Copy link
Contributor

For now I would say don't worry about anything that doesn't conform. We can take another pass at this later. I think best to make it always like 20163%

@AmyKort
Copy link

AmyKort commented Mar 19, 2018

Should we merge fecgov/fec-cms#1391 without this? Or wait?

@hcaofec
Copy link
Contributor

hcaofec commented Mar 19, 2018

Thanks @AmyKort for linking this to the new PR. Since @patphongs marked this as a blocker for fecgov/fec-cms#1391, it's probably better to wait till this is merged.

@AmyKort
Copy link

AmyKort commented Mar 19, 2018

Thanks @hcaofec !

@PaulClark2 PaulClark2 modified the milestones: Sprint 5.2, Sprint 5.3 Mar 22, 2018
@hcaofec
Copy link
Contributor

hcaofec commented Mar 22, 2018

PR has been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants