feat: use geopandas for GEOGRAPHY columns if geopandas is installed #792

tswast · 2021-07-21T19:00:16Z

This would technically be a breaking change, but it might make sense to do while we are changing default dtypes in #786 for https://issuetracker.google.com/144712110

If the GeoPandas library is installed (meaning, GeoPandas should be considered an optional "extra"), it may make sense to use the extension dtypes provided by GeoPandas by default on GEOGRAPHY columns.

tswast · 2021-07-21T19:02:53Z

Since GeoSeries using the extension type (https://jorisvandenbossche.github.io/blog/2019/08/13/geopandas-extension-array-refactor/), I believe that means it can be used in a normal pandas DataFrame (ignoring the GeoDataFrame class)

tswast · 2021-07-26T16:11:42Z

This should be easier once #786 is in, as that pull request adds logic to introspect the BigQuery schema to determine what dtype to use + do any conversion steps on the dataframe (e.g. time zone updates) before returning it.

tswast · 2021-07-26T16:14:18Z

Oh, and I did some more reading regarding GeoDataFrame vs GeoSeries. To use GeoDataFrame requires selecting one specific column to be the "geometry" column. I don't really think it makes sense for us to add arguments just for that. Instead, we can create regular DataFrame with any GEOGRAPHY columns converted to GeoSeries. The developer should then be able to construct a GeoDataFrame from that if they desire.

jimfulton · 2021-07-26T16:17:42Z

+1

jimfulton · 2021-07-30T21:38:38Z

Oh, and I did some more reading regarding GeoDataFrame vs GeoSeries. To use GeoDataFrame requires selecting one specific column to be the "geometry" column. I don't really think it makes sense for us to add arguments just for that. Instead, we can create regular DataFrame with any GEOGRAPHY columns converted to GeoSeries. The developer should then be able to construct a GeoDataFrame from that if they desire.

Actually, not so much. If you try to put a GeoSeries in a regular Pandas data frame, it gets converted to a Pandas series.

GeoDataFrames go though a bunch of hijinks to give the illusion of storing a GeoSeries.

jimfulton · 2021-07-30T21:55:34Z

Digging a little deeper:

A GeoSeries is a wrapper for a Series containing shapely objects (of possibly various kinds) and providing lots of geographic analytical capabilities. A series also carries along a projection.
A GeoDataFrame lets you store a GeoSeries along with a bunch of other attributes. It provides analytic functions that effectively delegate to the series. (I'm sure I'm simplifying somewhat, but not a lot.)
- GeoDataFrames intrinsically only allow one series. The model is that a row in a table models a feature consisting of a geography and some other attributes.
If we don't wan't to deal with picking a single geography column, we could provide some value by converting shapely objects.
I imagine that it's very common for a table to have a single geography column, in which case, we could determine the geographic highlander ourselves.

I have a suggestion, but I'll ponder it a bit more. :)

jimfulton · 2021-08-02T13:53:18Z

IMO the leveraging of geography types should be explicit. (The behavior should not change based on what's installed other than providing a clear error when what's requested isn't possible due to a lack of a dependency.)

Some options (not all mutually exclusive):

Add a to_geodataframe query-job method.
- This would have the same arguments as to_dataframe.
- There must be at least one GEOGRAPHY column.
- If there's more than one GEOGRAPHY column, then the geography_column option must be used to control which column is used to support the geographic analysis methods.
  - Other GEOGRAPHY columns are output as shapely objects and can be easily converted to GeoSeries.
- Error if geopandas isn't installed.
Add a geodataframe option to to_dataframe. If there is only one GEOGRAPHY column, then this can be True. If there is more than one, then this must be the name of the column to use, in which the others are converted as shapely objects.
- Error if geopandas isn't installed.
Add a geography_as_object option to to_dataframe. If true, then geography objects are output as shapely objects.
- Error if shapley isn't installed.

BTW, there should be some non-reference documentation on working with pandas and geopandas. It's elegant but non-obvious, IMO, that you get data into Pandas via job objects, which are elegant and strange in themselves. :)

jimfulton · 2021-08-02T15:37:06Z

For loading data with load_table_from_dataframe, I propose:

When using parquet format and loading a geography column, inspect the first value (if there are any):
- If it's binary, use the parquet binary type. If the values are WKB, this will just work. I tested it. :)
- If shapely is installed and it's a shapely object, convert the pandas column to WKB and use a pandas binary column.
When using CSV and loading a geography column, inspect the first value (if there are any):
- If shapely is installed and it's a shapely object, convert the pandas column to wkt.

I think this will make load_table_from_dataframe just work with GeoDataFrames.

jimfulton · 2021-08-03T12:48:32Z

Do we want to_geodataframe_iterable? It seems that to_dataframe_iterable is missing at least one to_dataframe feature, date_as_object=False.

jimfulton · 2021-08-03T15:00:15Z

FTR: I've implemented the geography_as_object argument to to_dataframe, as well as to_geodataframe.

tswast · 2021-08-03T16:04:43Z

Re: #792 (comment)

IMO the leveraging of geography types should be explicit.

Good point. Explicit is better than implicit.

(1) to_geodataframe or (3) geography_as_object appeal to me, as they don't change the return type of to_dataframe. I suspect GeoPandas will optimize GeoDataFrames at some point, so I prefer (1) to_geodataframe.

Re: load_table_from_dataframe #792 (comment)

This all sounds good to me. I wonder how we will detect shapely objects. Do GeoDataFrames give us a nice dtype for these columns?

Re: to_geodataframe_iterable #792 (comment)

We might get this for "free" depending on how we implement pagination, but I'd prefer to leave this private until we get demand for it. I don't know how many of our users want smaller dataframes for streaming-style processing.

Let's file an issue for date_as_object=False. With our use of Arrow to concatenate to a larger dataframe, it has meant to_dataframe_iterable has a risk of getting out-of-sync.

jimfulton · 2021-08-03T16:47:52Z

Re: load_table_from_dataframe #792 (comment)

This all sounds good to me. I wonder how we will detect shapely objects. Do GeoDataFrames give us a nice dtype for these columns?

I noticed today that GeoSeries have a dtype of "geometry", however, I think we also want to handle regular series that contain shapely or wkb data. I was thinking of inspecting the first non-null value in a series for bytes or shapely objects. Of course, if the dtype is "geometry", we can skip that.

jimfulton · 2021-08-04T13:14:48Z

We might get this for "free" depending on how we implement pagination

I don't understand this. There's no pagination involved in to_dataframe or to_geodataframe AFAICT.

jimfulton · 2021-08-04T13:16:01Z

I'd prefer to leave this private until we get demand for it.

I'm not sure what this means. Does this mean, don't implement to_geodataframe_iterable?

jimfulton · 2021-08-04T13:22:09Z

I don't know how many of our users want smaller dataframes for streaming-style processing.

When you say "streaming-style processing", I imagine getting new chunks continuously as new data are streamed in. That isn't what you mean, is it?

jimfulton · 2021-08-04T13:37:40Z

All of this talk of breaking data up makes me think of BQ DASK and Spark support.

Is BQ DASK still an unsolved problem?

Does the Spark connector leverage BQ storage's multiple streams?

tswast · 2021-08-04T17:43:27Z

I noticed today that GeoSeries have a dtype of "geometry", however, I think we also want to handle regular series that contain shapely or wkb data. I was thinking of inspecting the first non-null value in a series for bytes or shapely objects. Of course, if the dtype is "geometry", we can skip that.

Excellent news.

There are load / insert from dataframe cases where we have a BigQuery schema available too, in which case we can skip any discovery steps.

Bytes might be tough, since there's also a BYTES data type in BigQuery. If shapely is found, that does make sense to treat as a GEOGRAPHY column.

I don't understand this. There's no pagination involved in to_dataframe or to_geodataframe AFAICT.

There used to be. Now it's been moved to to_arrow. See:

python-bigquery/google/cloud/bigquery/_pandas_helpers.py

Lines 593 to 609 in 7016f69

    
           def _download_table_bqstorage_stream( 
        
               download_state, bqstorage_client, session, stream, worker_queue, page_to_item 
        
           ): 
        
               reader = bqstorage_client.read_rows(stream.name) 
        
               # Avoid deprecation warnings for passing in unnecessary read session. 
        
               # https://github.com/googleapis/python-bigquery-storage/issues/229 
        
               if _helpers.BQ_STORAGE_VERSIONS.is_read_session_optional: 
        
                   rowstream = reader.rows() 
        
               else: 
        
                   rowstream = reader.rows(session) 
        
               for page in rowstream.pages: 
        
                   if download_state.done: 
        
                       return 
        
                   item = page_to_item(page) 
        
                   worker_queue.put(item)

and

python-bigquery/google/cloud/bigquery/_pandas_helpers.py

Lines 559 to 580 in 7016f69

    
           def download_dataframe_row_iterator(pages, bq_schema, dtypes): 
        
               """Use HTTP JSON RowIterator to construct a DataFrame. 
        
               Args: 
        
                   pages (Iterator[:class:`google.api_core.page_iterator.Page`]): 
        
                       An iterator over the result pages. 
        
                   bq_schema (Sequence[Union[ \ 
        
                       :class:`~google.cloud.bigquery.schema.SchemaField`, \ 
        
                       Mapping[str, Any] \ 
        
                   ]]): 
        
                       A decription of the fields in result pages. 
        
                   dtypes(Mapping[str, numpy.dtype]): 
        
                       The types of columns in result data to hint construction of the 
        
                       resulting DataFrame. Not all column types have to be specified. 
        
               Yields: 
        
                   :class:`pandas.DataFrame` 
        
                   The next page of records as a ``pandas.DataFrame`` record batch. 
        
               """ 
        
               bq_schema = schema._to_schema_fields(bq_schema) 
        
               column_names = [field.name for field in bq_schema] 
        
               for page in pages: 
        
                   yield _row_iterator_page_to_dataframe(page, column_names, dtypes)

I'm not sure what this means. Does this mean, don't implement to_geodataframe_iterable?

Correct

When you say "streaming-style processing", I imagine getting new chunks continuously as new data are streamed in. That isn't what you mean, is it?

Nothing that sophisticated. I mean writing an ETL job that processes data one chunk at a time, compared to trying to process the table as a whole.

Is BQ DASK still an unsolved problem?

There's some community progress here: dask/dask#3121 It would be nice to have a to_dask method like there is for to_dataframe (pandas). I was hoping for something more sophisticated like pushing down filters to the backend API, but I don't think the Dask interface supports that in the way I thought it might.

Does the Spark connector leverage BQ storage's multiple streams?

Looks like there is support for that: https://github.com/GoogleCloudDataproc/spark-bigquery-connector/blob/master/connector/src/main/java/com/google/cloud/bigquery/connector/common/StreamCombiningIterator.java

jimfulton · 2021-08-04T17:57:16Z

Bytes might be tough, since there's also a BYTES data type in BigQuery. If shapely is found, that does make sense to treat as a GEOGRAPHY column.

When determining BQ column types (because there's no existing table, or load-job schema) then I only answer GEOGRAPHY if the column is a GeoSeries or the first valid value is a shapely object.

Later, when computing arrow arrays, I'll use a binary arrow array if the BQ type is GEOGRAPHY and the series is a GeoSeries, or contains shapely objects or contains bytes.

Supporting bytes if the BQ type is GEOGRAPHY lets you round trip queries like: select name, st_asbinary(geog) from foo.

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Jul 21, 2021

tswast added semver: major Hint for users that this is an API breaking change. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed api: bigquery Issues related to the googleapis/python-bigquery API. labels Jul 21, 2021

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Jul 22, 2021

tswast assigned jimfulton Jul 26, 2021

jimfulton mentioned this issue Aug 3, 2021

feat: Support using GeoPandas for GEOGRAPHY columns #848

Merged

4 tasks

jimfulton closed this as completed in #848 Aug 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: use geopandas for GEOGRAPHY columns if geopandas is installed #792

feat: use geopandas for GEOGRAPHY columns if geopandas is installed #792

tswast commented Jul 21, 2021 •

edited

Loading

tswast commented Jul 21, 2021

tswast commented Jul 26, 2021

tswast commented Jul 26, 2021

jimfulton commented Jul 26, 2021

jimfulton commented Jul 30, 2021 •

edited

Loading

jimfulton commented Jul 30, 2021

jimfulton commented Aug 2, 2021 •

edited

Loading

jimfulton commented Aug 2, 2021 •

edited

Loading

jimfulton commented Aug 3, 2021

jimfulton commented Aug 3, 2021

tswast commented Aug 3, 2021

jimfulton commented Aug 3, 2021

jimfulton commented Aug 4, 2021

jimfulton commented Aug 4, 2021

jimfulton commented Aug 4, 2021

jimfulton commented Aug 4, 2021

tswast commented Aug 4, 2021

jimfulton commented Aug 4, 2021

feat: use geopandas for GEOGRAPHY columns if geopandas is installed #792

feat: use geopandas for GEOGRAPHY columns if geopandas is installed #792

Comments

tswast commented Jul 21, 2021 • edited Loading

tswast commented Jul 21, 2021

tswast commented Jul 26, 2021

tswast commented Jul 26, 2021

jimfulton commented Jul 26, 2021

jimfulton commented Jul 30, 2021 • edited Loading

jimfulton commented Jul 30, 2021

jimfulton commented Aug 2, 2021 • edited Loading

jimfulton commented Aug 2, 2021 • edited Loading

jimfulton commented Aug 3, 2021

jimfulton commented Aug 3, 2021

tswast commented Aug 3, 2021

jimfulton commented Aug 3, 2021

jimfulton commented Aug 4, 2021

jimfulton commented Aug 4, 2021

jimfulton commented Aug 4, 2021

jimfulton commented Aug 4, 2021

tswast commented Aug 4, 2021

jimfulton commented Aug 4, 2021

tswast commented Jul 21, 2021 •

edited

Loading

jimfulton commented Jul 30, 2021 •

edited

Loading

jimfulton commented Aug 2, 2021 •

edited

Loading

jimfulton commented Aug 2, 2021 •

edited

Loading