Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SIP-28] Proposal for Geocoding #8544

Closed
Sascha-Gschwind opened this issue Nov 12, 2019 · 14 comments
Closed

[SIP-28] Proposal for Geocoding #8544

Sascha-Gschwind opened this issue Nov 12, 2019 · 14 comments
Assignees
Labels
enhancement:request Enhancement request submitted by anyone from the community inactive Inactive for >= 30 days sip Superset Improvement Proposal

Comments

@Sascha-Gschwind
Copy link

Sascha-Gschwind commented Nov 12, 2019

[SIP] Proposal for Geocoding

Motivation

A Superset user wants to use address-based data to generate charts like deck.gl Scatterplot. They then first need to convert their address based data using an external source so it has the required latitude/longitude columns.

Many other BI tools can convert addresses automatically.

Proposed Change

We want to implement a feature using the GeoPy package with the Mapbox Geocoding API that can convert addresses to latitude/longitude and save those values as additional columns or overwrite certain columns in the same table.

To make the API calls we plan to use the same API-Key that is already used for the background maps (Mapbox API Key).

The feature will be available under the menu "Sources" > "Geocode Addresses" and will be implemented asynchronously. Only one geocoding can be in progress at once though. There are multiple reasons for this decision:

  • Most geocoding API's limit the amount of requests per second
  • Most geocoding API's limit the amount of requests that can be made over a certain time period
  • Depending on the amount of data in the table the process can take a very long time

If the geocoding is in process and the user navigates to the "Geocode Addresses" URL he will see a progressbar and will have the ability to cancel the process. If no process is ongoing the geocoding form will be shown.

The user can decide what happens if anything goes wrong (for ex. call limit reached, connection issues, etc.) or the process is interrupted. He can choose to save the already converted data or discard it.

New or Changed Public Interfaces

  • There will be a new form for the Geocoding in React
    image
  • There will be a new REST API that geocodes the address based data on a specific table and adds or overwrites columns
  • There will be a new REST API that informs the caller if a geocoding is already progress (boolean, and an integer representing the progress (%))
  • There will be a new REST API with wich a user can interrupt a geocoding progress
  • There will be a new REST API where you can get a list of columns for a selected table

New dependencies

  • We do not need a new dependency, because Superset is already using GeoPy

Migration Plan and Compatibility

The documentation will likely need to be added which describes the usage of this new feature once this is merged into master

Rejected Alternatives

  • We thought about on-the-fly geocoding and accepting address-data in certain charts like the deck.gl Scatterplot but rejected the idea since geocoding itself is an expensive operation and should only be done once on a specific dataset.
@issue-label-bot issue-label-bot bot added the enhancement:request Enhancement request submitted by anyone from the community label Nov 12, 2019
@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the label #enhancement to this issue, with a confidence of 0.92. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

@willbarrett
Copy link
Member

Hi @Sascha-Gschwind - thanks for submitting the SIP. Do you plan on implementing the actual geocoding workflow as a Celery job?

@mistercrunch
Copy link
Member

https://www.mapbox.com/legal/tos/#[GAGA]
Screen Shot 2019-11-12 at 10 11 41 AM

Are you assuming doing atomic updates against the database or processing a block of data in-memory?

I'm not sure whether data munging / enrichment is in scope for Superset. We spoke about a geo service in the past, but enrichment wasn't in scope.

@Sascha-Gschwind
Copy link
Author

Thanks @willbarrett and @mistercrunch for your feedback! We have to admit we did not check the mapbox ToS. We are now looking into other geocoding services that we could use where we do not infringe any ToS. One possible geocoding service would be Maptiler: https://www.maptiler.com/blog/2019/09/announcing-maptiler-cloud-geocoding-api.html. Our bachelor thesis examinator has a very good contact to the people in charge of Maptiler. Maptiler Geocoding is not yet implemented in GeoPy though, but could definetely be achieved meaning in a first step there would be an implementation without GeoPy which later can be changed.

Regarding implementing the actual geocoding workflow as a Celery job: Yes it is planned that the geocoding workflow is implemented this way in the final version. First we want to create a proof of concept and since we are not very familiar with Celery yet we planned to implement that later in the process.

@willbarrett
Copy link
Member

Hi @Sascha-Gschwind. Thanks for the response. It sounds to me like this isn't a great fit for inclusion in Superset at this point. As Max brought up, data enrichment isn't part of the scope of this project. You may want to check out one of our related projects, Apache Airflow. Airflow is designed for building data pipelines, and you could implement what you need for data enrichment on top of that platform without needing to extend the core. After enriching your data, Superset can be used to visualize and explore the data. Were I tasked with a similar problem in my day to day work, that's how I would implement it. I hope this recommendation helps!

@Sascha-Gschwind
Copy link
Author

@willbarrett Thanks for clarifying! We really thought this is a useful feature for Superset since many other BI tools support this kind of functionality, but can totally see where you guys are coming from since there seems to be an already well established workflow using Apache tools!
Thanks again for the feedback!

@sfkeller
Copy link

sfkeller commented Nov 13, 2019

@willbarrett: (I'm the advisor of Sascha-Geschwind and his two student colleagis).

I respect the principle to keep a narrow scope of this nice BI tool. But I'm unsure why you insist calling Geocoding an "enrichment", since "Geocoding is the computational process of transforming a physical address description to a location on the Earth's surface (spatial representation in numerical coordinates)." Without geocoding you miss many opportunities of location intelligence to visualize spatial data.

Obviously, behind an adress Geocoding API there's a data set of building addresses. But involving other data is also the case for many other geocoding systems, like plus code (see https://en.wikipedia.org/wiki/Geocode ). And the result of this "transformation" is being stored just for computational reasons - very much like aggregations are also often materialized.

Are there other reasons of this kind of geocoding which bother you?

@suddjian
Copy link
Member

Data Enrichment is the process of supplementing incomplete raw data with additional data from an authoritative source. Geocoding is a classic example. Data enrichment isn't something that Superset currently does at all, so it's a big proposal and should be thought about in the context of other possible data enrichment adds.

I am generally of the opinion that whatever format people's data is in, Superset should make an attempt to "just work". While the changes proposed above are probably not a great fit, there might be another approach that would fit better. Maybe adding a Geocode option to the deck.gl charts could work? This wouldn't include the optimizations around things like requests per second or caching, but it would be a simpler implementation, and would stay within the MapBox TOS.

@willbarrett
Copy link
Member

I agree that adding some form of geocoding to the front-end would be more acceptable than a back-end solution. I concur with @suddjian's definition of data enrichment, and concur that geocoding a list of addresses in a database is a classic example of data enrichment. It is also an activity that is against the terms of service of the majority of map vendors, or those vendors provide rules for the allowed duration of retention. I have a few concerns:

  1. Creating a back-end system for data enrichment is a large increase in scope for Superset.
  2. There are other projects in the Apache ecosystem (primarily one from the founder of this project) that are specifically designed for this type of task, and are frequently used in conjunction with Superset.
  3. There are a wide variety of data vendors that provide geocoding services, and I do not want us to closely integrate this open source project with one specific vendor.

@thunter009
Copy link
Contributor

Geocoding using GeoPy and the public Nominatim (OpenStreetMaps geocoding service) should work without the TOS limitations of Mapbox or other commercial providers.

That said, I agree with @willbarrett that it seems outside of Superset's core features and scope. I've prepared a PR which adds a geocoder component to some of the Deck.gl chart types but it is not exactly what you're looking for. It simply takes in user text input and geocodes that input to the current map but does not persist this data anywhere. The existing PR also needs to be refactored to the new plugins approach before it can be merged in.

You can find that here

@Sascha-Gschwind Sascha-Gschwind changed the title [SIP] Proposal for Geocoding [SIP-28] Proposal for Geocoding Nov 20, 2019
@stale
Copy link

stale bot commented Jan 19, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. For admin, please label this issue .pinned to prevent stale bot from closing the issue.

@stale stale bot added the inactive Inactive for >= 30 days label Jan 19, 2020
@stale stale bot closed this as completed Jan 26, 2020
@villebro
Copy link
Member

@Sascha-Gschwind adding support for this in the regular viz UI flow is still some ways off, but geocoding functionality was recently added to the backend as a chart data request post processing operation: #9661 . Currently only gehoash decoding/encoding and parsing of geodesic point strings is supported, but we could potentially add geocoding of addresses to the flow once this feature matures.

@rusackas rusackas added the sip Superset Improvement Proposal label Jun 13, 2022
@rusackas rusackas moved this to INACTIVE DISCUSSION in SIPs (Superset Improvement Proposals) Dec 8, 2022
@rusackas
Copy link
Member

rusackas commented Dec 8, 2022

@Sascha-Gschwind is there any intention to carry this proposal forward in discussion/voting or implementation? If not, shall we close it out?

@Sascha-Gschwind
Copy link
Author

@rusackas We don't plan to carry on with the proposal since we all moved on to different companies etc. where we are working. You can definitely close it out.

@rusackas rusackas moved this from INACTIVE DISCUSSION to DENIED / CLOSED in SIPs (Superset Improvement Proposals) Jan 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement:request Enhancement request submitted by anyone from the community inactive Inactive for >= 30 days sip Superset Improvement Proposal
Projects
Status: Denied / Closed / Discarded
Development

No branches or pull requests

8 participants