Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialize finite data columns #189

Closed
4 tasks done
gregpawin opened this issue Jan 30, 2021 · 13 comments
Closed
4 tasks done

Serialize finite data columns #189

gregpawin opened this issue Jan 30, 2021 · 13 comments
Assignees
Labels
complexity: large task is assuming a high level of knowledge about how to approach and what tools to use. feature: database missing: feature role: data scientist role: dev size: 2pt Can be done in 7-12 hours

Comments

@gregpawin
Copy link
Member

gregpawin commented Jan 30, 2021

Overview

The amount of data transfer is very inefficient. Including making changes on the frontend, there are plenty of backend efficiencies that can be taken advantage of. Including: keying the categorical data, extracting the coordinates from the geometry, making SQL queries of only the pertinent data, extracting day of week/hour/etc from the datetime.

Action items

  • Analyze datasets to see which columns can be thrown out or made finite by picking most frequent
  • Create data cleaning rules to reduce misspellings, group similar codes/values
  • Create new data set with threshold number of possible values
  • Create key, value pair and code new column into new mapping

Resources/Instructions

Confer with Greg

@gregpawin
Copy link
Member Author

gregpawin commented Apr 5, 2021

Created a new geojson dataset and uploaded it to the PostGIS database with top 70 car makes as keyed values and everything else as "MISC." Replaced table references to new dataset. Code now needed to swap the values on the front end. If all goes well, I will key all the other values including the column names.

Notes: Used a 20% sampling of the data. While trying to upload, ran out of memory on my local computer. Had to create EC2, setup environment, download and clean the data, made sure PostGIS database security settings was open to receiving data from new EC2, and finally upload via psychopg2. There was was a slight problem with dotenv and psychopg2 dependencies.

@gregpawin
Copy link
Member Author

Matthew was able to integrate serialization of car makes. Will serialize other data columns

@gregpawin
Copy link
Member Author

Update: will serialize body type. Need to find out if day of week can be pulled out of time/date data format in JS.

@MichelleRT
Copy link

@gregpawin This issue has not had an update since June 7, 2021. If you are no longer working on this issue please let us know.
If you are able to give any closing comments related to why this issue stopped being worked on or if there are any other notes that never got added to the issue. We would appreciate it. If you are still working on the issue, please provide update using these guidelines

  1. Progress: "What is the current status of your project? What have you completed and what is left to do?"
  2. Blockers: "Difficulties or errors encountered."
  3. Availability: "How much time will you have this week to work on this issue?"
  4. ETA: "When do you expect this issue to be completed?"
  5. Pictures (if necessary): "Add any pictures that will help illustrate what you are working on."

@gregpawin
Copy link
Member Author

Progress: Serialized the car makes, which has been implemented. Will serialize the other categorical info.
Blockers: Will need to have discussions with Dev team about implementation
Availability: I'll have a couple hours this week
ETA: Next month
Pictures (if necessary): None.

@gregpawin
Copy link
Member Author

Progress: Added more code to introduce serialization. Will continue to work on it.
Blockers: Will need to have discussions with Dev team about implementation
Availability: I'll have a couple hours this week
ETA: Next month
Pictures (if necessary): None.

@gregpawin
Copy link
Member Author

Progress: Analyzing other data columns to see which labels to keep.
Blockers: None.
Availability: I'll have a couple hours this week
ETA: Next month
Pictures (if necessary): None.

@gregpawin
Copy link
Member Author

Progress: Made changes to make_serial_data.py. Removed location, violation_code.
Blockers: None.
Availability: I'll have a couple hours this week
ETA: Next month
Pictures (if necessary): None.

@gregpawin
Copy link
Member Author

Progress: Did some data exploration--will keep state_plate, body_style, color, fine_amount as raw text data. Will serialize violation codes and possibly remove lat/lon
Blockers: None.
Availability: I'll have a couple hours this week
ETA: Next month
Pictures (if necessary): None.

@gregpawin gregpawin added role: data scientist role: dev complexity: large task is assuming a high level of knowledge about how to approach and what tools to use. size: 2pt Can be done in 7-12 hours feature: database and removed missing: size missing: complexity missing: role missing: feature labels Apr 26, 2022
@gregpawin
Copy link
Member Author

Progress: Added regex rules to clean up violation codes. Will take top 20-30 and serialize it.
Blockers: Devs working on how to convert all code to use geometry column instead of lat/lon columns
Availability: I'll have a couple hours this week
ETA: Next month
Pictures (if necessary): None.

@gregpawin gregpawin changed the title Optimize data efficiency on backend Serialize finite data columns Apr 26, 2022
@gregpawin
Copy link
Member Author

Progress: Fixed bugs with code that cleans violation_codes af016a1
Blockers: Need to discuss with devs
Availability: I'll have a couple hours this week
ETA: Next month
Pictures (if necessary): None.

@gregpawin
Copy link
Member Author

Fixed bugs in data cleaning code and added more serial data columns https://github.com/hackforla/lucky-parking/tree/e6a750db47fcbe8e61d0f47e40e8910785f48d17. Will hand it off to devs to test.

@gregpawin
Copy link
Member Author

New data format:

{ "type": "Feature", "properties": { "index": 1, "state_plate": "CA", "make": "Toyota", "body_style": "PA", "color": "SI", "location": "7610 WOODLEY", "violation_code": "4000A1", "violation_description": "MISC.", "fine_amount": 50, "datetime": "2015/12/14 11:30:00", "make_ind": 0, "vio_desc_ind": 59, "vio_code_ind": 24, "latitude": 34.208643535440721, "longitude": -118.48366825453003, "weekday": 0 }, "geometry": { "type": "Point", "coordinates": [ -118.483668254530031, 34.208643535440721 ] } }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
complexity: large task is assuming a high level of knowledge about how to approach and what tools to use. feature: database missing: feature role: data scientist role: dev size: 2pt Can be done in 7-12 hours
Projects
Development

No branches or pull requests

5 participants