Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use geopandas for GEOGRAPHY columns if geopandas is installed #792

Closed
tswast opened this issue Jul 21, 2021 · 18 comments · Fixed by #848
Closed

feat: use geopandas for GEOGRAPHY columns if geopandas is installed #792

tswast opened this issue Jul 21, 2021 · 18 comments · Fixed by #848
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. semver: major Hint for users that this is an API breaking change. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@tswast
Copy link
Contributor

tswast commented Jul 21, 2021

This would technically be a breaking change, but it might make sense to do while we are changing default dtypes in #786 for https://issuetracker.google.com/144712110

If the GeoPandas library is installed (meaning, GeoPandas should be considered an optional "extra"), it may make sense to use the extension dtypes provided by GeoPandas by default on GEOGRAPHY columns.

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Jul 21, 2021
@tswast tswast added semver: major Hint for users that this is an API breaking change. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed api: bigquery Issues related to the googleapis/python-bigquery API. labels Jul 21, 2021
@tswast
Copy link
Contributor Author

tswast commented Jul 21, 2021

Since GeoSeries using the extension type (https://jorisvandenbossche.github.io/blog/2019/08/13/geopandas-extension-array-refactor/), I believe that means it can be used in a normal pandas DataFrame (ignoring the GeoDataFrame class)

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Jul 22, 2021
@tswast
Copy link
Contributor Author

tswast commented Jul 26, 2021

This should be easier once #786 is in, as that pull request adds logic to introspect the BigQuery schema to determine what dtype to use + do any conversion steps on the dataframe (e.g. time zone updates) before returning it.

@tswast
Copy link
Contributor Author

tswast commented Jul 26, 2021

Oh, and I did some more reading regarding GeoDataFrame vs GeoSeries. To use GeoDataFrame requires selecting one specific column to be the "geometry" column. I don't really think it makes sense for us to add arguments just for that. Instead, we can create regular DataFrame with any GEOGRAPHY columns converted to GeoSeries. The developer should then be able to construct a GeoDataFrame from that if they desire.

@jimfulton
Copy link
Contributor

+1

@jimfulton
Copy link
Contributor

jimfulton commented Jul 30, 2021

Oh, and I did some more reading regarding GeoDataFrame vs GeoSeries. To use GeoDataFrame requires selecting one specific column to be the "geometry" column. I don't really think it makes sense for us to add arguments just for that. Instead, we can create regular DataFrame with any GEOGRAPHY columns converted to GeoSeries. The developer should then be able to construct a GeoDataFrame from that if they desire.

Actually, not so much. If you try to put a GeoSeries in a regular Pandas data frame, it gets converted to a Pandas series.

GeoDataFrames go though a bunch of hijinks to give the illusion of storing a GeoSeries.

@jimfulton
Copy link
Contributor

Digging a little deeper:

  • A GeoSeries is a wrapper for a Series containing shapely objects (of possibly various kinds) and providing lots of geographic analytical capabilities. A series also carries along a projection.
  • A GeoDataFrame lets you store a GeoSeries along with a bunch of other attributes. It provides analytic functions that effectively delegate to the series. (I'm sure I'm simplifying somewhat, but not a lot.)
    • GeoDataFrames intrinsically only allow one series. The model is that a row in a table models a feature consisting of a geography and some other attributes.
  • If we don't wan't to deal with picking a single geography column, we could provide some value by converting shapely objects.
  • I imagine that it's very common for a table to have a single geography column, in which case, we could determine the geographic highlander ourselves.

I have a suggestion, but I'll ponder it a bit more. :)

@jimfulton
Copy link
Contributor

jimfulton commented Aug 2, 2021

IMO the leveraging of geography types should be explicit. (The behavior should not change based on what's installed other than providing a clear error when what's requested isn't possible due to a lack of a dependency.)

Some options (not all mutually exclusive):

  1. Add a to_geodataframe query-job method.

    • This would have the same arguments as to_dataframe.
    • There must be at least one GEOGRAPHY column.
    • If there's more than one GEOGRAPHY column, then the geography_column option must be used to control which column is used to support the geographic analysis methods.
      • Other GEOGRAPHY columns are output as shapely objects and can be easily converted to GeoSeries.
    • Error if geopandas isn't installed.
  2. Add a geodataframe option to to_dataframe. If there is only one GEOGRAPHY column, then this can be True. If there is more than one, then this must be the name of the column to use, in which the others are converted as shapely objects.

    • Error if geopandas isn't installed.
  3. Add a geography_as_object option to to_dataframe. If true, then geography objects are output as shapely objects.

    • Error if shapley isn't installed.

BTW, there should be some non-reference documentation on working with pandas and geopandas. It's elegant but non-obvious, IMO, that you get data into Pandas via job objects, which are elegant and strange in themselves. :)

@jimfulton
Copy link
Contributor

jimfulton commented Aug 2, 2021

For loading data with load_table_from_dataframe, I propose:

  1. When using parquet format and loading a geography column, inspect the first value (if there are any):

    • If it's binary, use the parquet binary type. If the values are WKB, this will just work. I tested it. :)
    • If shapely is installed and it's a shapely object, convert the pandas column to WKB and use a pandas binary column.
  2. When using CSV and loading a geography column, inspect the first value (if there are any):

    • If shapely is installed and it's a shapely object, convert the pandas column to wkt.

I think this will make load_table_from_dataframe just work with GeoDataFrames.

@jimfulton
Copy link
Contributor

Do we want to_geodataframe_iterable? It seems that to_dataframe_iterable is missing at least one to_dataframe feature, date_as_object=False.

@jimfulton
Copy link
Contributor

FTR: I've implemented the geography_as_object argument to to_dataframe, as well as to_geodataframe.

@tswast
Copy link
Contributor Author

tswast commented Aug 3, 2021

Re: #792 (comment)

IMO the leveraging of geography types should be explicit.

Good point. Explicit is better than implicit.

(1) to_geodataframe or (3) geography_as_object appeal to me, as they don't change the return type of to_dataframe. I suspect GeoPandas will optimize GeoDataFrames at some point, so I prefer (1) to_geodataframe.

Re: load_table_from_dataframe #792 (comment)

This all sounds good to me. I wonder how we will detect shapely objects. Do GeoDataFrames give us a nice dtype for these columns?

Re: to_geodataframe_iterable #792 (comment)

We might get this for "free" depending on how we implement pagination, but I'd prefer to leave this private until we get demand for it. I don't know how many of our users want smaller dataframes for streaming-style processing.

Let's file an issue for date_as_object=False. With our use of Arrow to concatenate to a larger dataframe, it has meant to_dataframe_iterable has a risk of getting out-of-sync.

@jimfulton
Copy link
Contributor

Re: load_table_from_dataframe #792 (comment)

This all sounds good to me. I wonder how we will detect shapely objects. Do GeoDataFrames give us a nice dtype for these columns?

I noticed today that GeoSeries have a dtype of "geometry", however, I think we also want to handle regular series that contain shapely or wkb data. I was thinking of inspecting the first non-null value in a series for bytes or shapely objects. Of course, if the dtype is "geometry", we can skip that.

@jimfulton
Copy link
Contributor

We might get this for "free" depending on how we implement pagination

I don't understand this. There's no pagination involved in to_dataframe or to_geodataframe AFAICT.

@jimfulton
Copy link
Contributor

I'd prefer to leave this private until we get demand for it.

I'm not sure what this means. Does this mean, don't implement to_geodataframe_iterable?

@jimfulton
Copy link
Contributor

I don't know how many of our users want smaller dataframes for streaming-style processing.

When you say "streaming-style processing", I imagine getting new chunks continuously as new data are streamed in. That isn't what you mean, is it?

@jimfulton
Copy link
Contributor

All of this talk of breaking data up makes me think of BQ DASK and Spark support.

Is BQ DASK still an unsolved problem?

Does the Spark connector leverage BQ storage's multiple streams?

@tswast
Copy link
Contributor Author

tswast commented Aug 4, 2021

I noticed today that GeoSeries have a dtype of "geometry", however, I think we also want to handle regular series that contain shapely or wkb data. I was thinking of inspecting the first non-null value in a series for bytes or shapely objects. Of course, if the dtype is "geometry", we can skip that.

Excellent news.

There are load / insert from dataframe cases where we have a BigQuery schema available too, in which case we can skip any discovery steps.

Bytes might be tough, since there's also a BYTES data type in BigQuery. If shapely is found, that does make sense to treat as a GEOGRAPHY column.

I don't understand this. There's no pagination involved in to_dataframe or to_geodataframe AFAICT.

There used to be. Now it's been moved to to_arrow. See:

def _download_table_bqstorage_stream(
download_state, bqstorage_client, session, stream, worker_queue, page_to_item
):
reader = bqstorage_client.read_rows(stream.name)
# Avoid deprecation warnings for passing in unnecessary read session.
# https://github.com/googleapis/python-bigquery-storage/issues/229
if _helpers.BQ_STORAGE_VERSIONS.is_read_session_optional:
rowstream = reader.rows()
else:
rowstream = reader.rows(session)
for page in rowstream.pages:
if download_state.done:
return
item = page_to_item(page)
worker_queue.put(item)
and
def download_dataframe_row_iterator(pages, bq_schema, dtypes):
"""Use HTTP JSON RowIterator to construct a DataFrame.
Args:
pages (Iterator[:class:`google.api_core.page_iterator.Page`]):
An iterator over the result pages.
bq_schema (Sequence[Union[ \
:class:`~google.cloud.bigquery.schema.SchemaField`, \
Mapping[str, Any] \
]]):
A decription of the fields in result pages.
dtypes(Mapping[str, numpy.dtype]):
The types of columns in result data to hint construction of the
resulting DataFrame. Not all column types have to be specified.
Yields:
:class:`pandas.DataFrame`
The next page of records as a ``pandas.DataFrame`` record batch.
"""
bq_schema = schema._to_schema_fields(bq_schema)
column_names = [field.name for field in bq_schema]
for page in pages:
yield _row_iterator_page_to_dataframe(page, column_names, dtypes)

I'm not sure what this means. Does this mean, don't implement to_geodataframe_iterable?

Correct

When you say "streaming-style processing", I imagine getting new chunks continuously as new data are streamed in. That isn't what you mean, is it?

Nothing that sophisticated. I mean writing an ETL job that processes data one chunk at a time, compared to trying to process the table as a whole.

Is BQ DASK still an unsolved problem?

There's some community progress here: dask/dask#3121 It would be nice to have a to_dask method like there is for to_dataframe (pandas). I was hoping for something more sophisticated like pushing down filters to the backend API, but I don't think the Dask interface supports that in the way I thought it might.

Does the Spark connector leverage BQ storage's multiple streams?

Looks like there is support for that: https://github.com/GoogleCloudDataproc/spark-bigquery-connector/blob/master/connector/src/main/java/com/google/cloud/bigquery/connector/common/StreamCombiningIterator.java

@jimfulton
Copy link
Contributor

Bytes might be tough, since there's also a BYTES data type in BigQuery. If shapely is found, that does make sense to treat as a GEOGRAPHY column.

When determining BQ column types (because there's no existing table, or load-job schema) then I only answer GEOGRAPHY if the column is a GeoSeries or the first valid value is a shapely object.

Later, when computing arrow arrays, I'll use a binary arrow array if the BQ type is GEOGRAPHY and the series is a GeoSeries, or contains shapely objects or contains bytes.

Supporting bytes if the BQ type is GEOGRAPHY lets you round trip queries like: select name, st_asbinary(geog) from foo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. semver: major Hint for users that this is an API breaking change. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants