You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In this project, you will work with two versions of address data, each containing addresses from the same source but with slight differences in coordinates. These variations simulate data updates or corrections that occur over time. Your goal is to develop metrics to quantify these changes and create a time series step in the pipeline to visualize how the two datasets diverge over time.
The data for this project is accessible here. Some descriptions of the files:
ms_hinds_location: A CSV file from Hinds County containing a list of location addresses in the county. These addresses include their un-enhanced coordinates, which typically fall outside of buildings.
ms_hinds_location_v2: Another version of the above file with the same schema but not necessarily the same data content.
ms_hinds_buildings: A GeoJSON file containing all the building footprints in Hinds County.
ms_hinds_parcels: A GeoJSON file containing information about all the parcels available in Hinds County. This data is optional but may be used to enhance your analysis. It can be downloaded Input Data:
You will work with the following input data:
LOCATIONS (CSV): This CSV file contains location addresses in Hinds County sourced from three different sources: U.S. county-collected parcels, U.S. secondary addresses, and Points of Interest (POI). Each location's raw address has been verified and standardized using the Lob platform. The results are provided in the following columns:
lob_addr1: The primary address line.
lob_lat: The latitude coordinate.
lob_lon: The longitude coordinate.
lob_zipcode: The ZIP code obtained from address verification.
Additional Columns: Columns starting with "f_" contain the most present and verified data and are the ones you should focus on for your analysis. Additionally:
lob_addr1 actually contains full addresses, including the primary address line and secondary unit information. It's further divided into f_addr1 and f_unit columns.
f_ziploc is a zipcode or location identifier inherited from lob_zipcode by default, truncated to a length of 5 digits.
When Lob fails to return zipcodes for an address (i.e., lob_zipcode is NULL), f_ziploc takes values from raw source zipcodes (also limited to 5 digits).
If raw zipcodes are not available, f_ziploc contains Uber's H3 hexID, which is 15 characters long and computed based on the location's raw coordinates (latitude and longitude).
Tasks:
Metric Development:
Your task is to devise innovative metrics that can effectively quantify the differences between the two address versions. Consider metrics such as:
Geospatial Drift: Calculate the average distance or displacement between corresponding addresses in the two versions.
Accuracy Improvement: Measure the percentage of addresses that have improved in accuracy between the two versions.
Temporal Trends: Explore how coordinates change over time by calculating the rate of change and identifying addresses that consistently change or remain stable.
Coverage Change: Analyze how the coverage of building footprints has changed over time in each version.
Time Series Step:
Create a time series step within the data pipeline to demonstrate how the two address datasets diverge over time. This involves:
Processing both versions of the data.
Applying the metrics you developed at each time point.
Generating visualizations or reports that illustrate the changing geospatial patterns.
Automation and Reporting:
Ensure that the pipeline automates the entire process of comparing and visualizing the address versions. Set up the pipeline to run at regular intervals or whenever new data versions are available.
Deliverables:
Your entire codebase must be in Python, and your database must use PostgreSQL. You may choose to containerize your application, but it is not required. We must be able to run your application given a README and some requirements.
Note:
This project will not only test your geospatial data engineering skills but also your ability to develop meaningful metrics and create a dynamic time series analysis component within a data pipeline. You should aim to showcase your problem-solving, analytical, and automation capabilities throughout the project.
Feel free to reach out if you have any questions or need further clarification on the project requirements. Good luck!
The text was updated successfully, but these errors were encountered:
Project Background:
In this project, you will work with two versions of address data, each containing addresses from the same source but with slight differences in coordinates. These variations simulate data updates or corrections that occur over time. Your goal is to develop metrics to quantify these changes and create a time series step in the pipeline to visualize how the two datasets diverge over time.
The data for this project is accessible here. Some descriptions of the files:
ms_hinds_location: A CSV file from Hinds County containing a list of location addresses in the county. These addresses include their un-enhanced coordinates, which typically fall outside of buildings.
ms_hinds_location_v2: Another version of the above file with the same schema but not necessarily the same data content.
ms_hinds_buildings: A GeoJSON file containing all the building footprints in Hinds County.
ms_hinds_parcels: A GeoJSON file containing information about all the parcels available in Hinds County. This data is optional but may be used to enhance your analysis. It can be downloaded
Input Data:
You will work with the following input data:
LOCATIONS (CSV): This CSV file contains location addresses in Hinds County sourced from three different sources: U.S. county-collected parcels, U.S. secondary addresses, and Points of Interest (POI). Each location's raw address has been verified and standardized using the Lob platform. The results are provided in the following columns:
Additional Columns: Columns starting with "f_" contain the most present and verified data and are the ones you should focus on for your analysis. Additionally:
Tasks:
Metric Development:
Time Series Step:
Automation and Reporting:
Deliverables:
Note:
This project will not only test your geospatial data engineering skills but also your ability to develop meaningful metrics and create a dynamic time series analysis component within a data pipeline. You should aim to showcase your problem-solving, analytical, and automation capabilities throughout the project.
Feel free to reach out if you have any questions or need further clarification on the project requirements. Good luck!
The text was updated successfully, but these errors were encountered: