Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⚙️ Address Data Divergence and Time Series Analysis #41

Open
colelloa opened this issue Sep 22, 2023 · 0 comments
Open

⚙️ Address Data Divergence and Time Series Analysis #41

colelloa opened this issue Sep 22, 2023 · 0 comments

Comments

@colelloa
Copy link

colelloa commented Sep 22, 2023

Project Background:

In this project, you will work with two versions of address data, each containing addresses from the same source but with slight differences in coordinates. These variations simulate data updates or corrections that occur over time. Your goal is to develop metrics to quantify these changes and create a time series step in the pipeline to visualize how the two datasets diverge over time.

The data for this project is accessible here. Some descriptions of the files:

  • ms_hinds_location: A CSV file from Hinds County containing a list of location addresses in the county. These addresses include their un-enhanced coordinates, which typically fall outside of buildings.

  • ms_hinds_location_v2: Another version of the above file with the same schema but not necessarily the same data content.

  • ms_hinds_buildings: A GeoJSON file containing all the building footprints in Hinds County.

  • ms_hinds_parcels: A GeoJSON file containing information about all the parcels available in Hinds County. This data is optional but may be used to enhance your analysis. It can be downloaded
    Input Data:

You will work with the following input data:

  • LOCATIONS (CSV): This CSV file contains location addresses in Hinds County sourced from three different sources: U.S. county-collected parcels, U.S. secondary addresses, and Points of Interest (POI). Each location's raw address has been verified and standardized using the Lob platform. The results are provided in the following columns:

    • lob_addr1: The primary address line.
    • lob_lat: The latitude coordinate.
    • lob_lon: The longitude coordinate.
    • lob_zipcode: The ZIP code obtained from address verification.
  • Additional Columns: Columns starting with "f_" contain the most present and verified data and are the ones you should focus on for your analysis. Additionally:

    • lob_addr1 actually contains full addresses, including the primary address line and secondary unit information. It's further divided into f_addr1 and f_unit columns.
    • f_ziploc is a zipcode or location identifier inherited from lob_zipcode by default, truncated to a length of 5 digits.
    • When Lob fails to return zipcodes for an address (i.e., lob_zipcode is NULL), f_ziploc takes values from raw source zipcodes (also limited to 5 digits).
    • If raw zipcodes are not available, f_ziploc contains Uber's H3 hexID, which is 15 characters long and computed based on the location's raw coordinates (latitude and longitude).

Tasks:

  1. Metric Development:

    • Your task is to devise innovative metrics that can effectively quantify the differences between the two address versions. Consider metrics such as:
      • Geospatial Drift: Calculate the average distance or displacement between corresponding addresses in the two versions.
      • Accuracy Improvement: Measure the percentage of addresses that have improved in accuracy between the two versions.
      • Temporal Trends: Explore how coordinates change over time by calculating the rate of change and identifying addresses that consistently change or remain stable.
      • Coverage Change: Analyze how the coverage of building footprints has changed over time in each version.
  2. Time Series Step:

    • Create a time series step within the data pipeline to demonstrate how the two address datasets diverge over time. This involves:
      • Processing both versions of the data.
      • Applying the metrics you developed at each time point.
      • Generating visualizations or reports that illustrate the changing geospatial patterns.
  3. Automation and Reporting:

    • Ensure that the pipeline automates the entire process of comparing and visualizing the address versions. Set up the pipeline to run at regular intervals or whenever new data versions are available.
  4. Deliverables:

    • Your entire codebase must be in Python, and your database must use PostgreSQL. You may choose to containerize your application, but it is not required. We must be able to run your application given a README and some requirements.

Note:

This project will not only test your geospatial data engineering skills but also your ability to develop meaningful metrics and create a dynamic time series analysis component within a data pipeline. You should aim to showcase your problem-solving, analytical, and automation capabilities throughout the project.

Feel free to reach out if you have any questions or need further clarification on the project requirements. Good luck!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant