-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: use geopandas for GEOGRAPHY columns if geopandas is installed #792
Comments
Since GeoSeries using the extension type (https://jorisvandenbossche.github.io/blog/2019/08/13/geopandas-extension-array-refactor/), I believe that means it can be used in a normal pandas DataFrame (ignoring the GeoDataFrame class) |
This should be easier once #786 is in, as that pull request adds logic to introspect the BigQuery schema to determine what |
Oh, and I did some more reading regarding |
+1 |
Actually, not so much. If you try to put a
|
Digging a little deeper:
I have a suggestion, but I'll ponder it a bit more. :) |
IMO the leveraging of geography types should be explicit. (The behavior should not change based on what's installed other than providing a clear error when what's requested isn't possible due to a lack of a dependency.) Some options (not all mutually exclusive):
BTW, there should be some non-reference documentation on working with pandas and geopandas. It's elegant but non-obvious, IMO, that you get data into Pandas via job objects, which are elegant and strange in themselves. :) |
For loading data with
I think this will make |
Do we want |
FTR: I've implemented the |
Re: #792 (comment)
Good point. Explicit is better than implicit. (1) Re: This all sounds good to me. I wonder how we will detect Re: We might get this for "free" depending on how we implement pagination, but I'd prefer to leave this private until we get demand for it. I don't know how many of our users want smaller dataframes for streaming-style processing. Let's file an issue for |
I noticed today that GeoSeries have a dtype of "geometry", however, I think we also want to handle regular series that contain shapely or wkb data. I was thinking of inspecting the first non-null value in a series for bytes or shapely objects. Of course, if the dtype is "geometry", we can skip that. |
I don't understand this. There's no pagination involved in |
I'm not sure what this means. Does this mean, don't implement |
When you say "streaming-style processing", I imagine getting new chunks continuously as new data are streamed in. That isn't what you mean, is it? |
All of this talk of breaking data up makes me think of BQ DASK and Spark support. Is BQ DASK still an unsolved problem? Does the Spark connector leverage BQ storage's multiple streams? |
Excellent news. There are load / insert from dataframe cases where we have a BigQuery schema available too, in which case we can skip any discovery steps. Bytes might be tough, since there's also a BYTES data type in BigQuery. If
There used to be. Now it's been moved to python-bigquery/google/cloud/bigquery/_pandas_helpers.py Lines 593 to 609 in 7016f69
python-bigquery/google/cloud/bigquery/_pandas_helpers.py Lines 559 to 580 in 7016f69
Correct
Nothing that sophisticated. I mean writing an ETL job that processes data one chunk at a time, compared to trying to process the table as a whole.
There's some community progress here: dask/dask#3121 It would be nice to have a
Looks like there is support for that: https://github.com/GoogleCloudDataproc/spark-bigquery-connector/blob/master/connector/src/main/java/com/google/cloud/bigquery/connector/common/StreamCombiningIterator.java |
When determining BQ column types (because there's no existing table, or load-job schema) then I only answer GEOGRAPHY if the column is a GeoSeries or the first valid value is a shapely object. Later, when computing arrow arrays, I'll use a binary arrow array if the BQ type is GEOGRAPHY and the series is a GeoSeries, or contains shapely objects or contains bytes. Supporting bytes if the BQ type is GEOGRAPHY lets you round trip queries like: |
This would technically be a breaking change, but it might make sense to do while we are changing default dtypes in #786 for https://issuetracker.google.com/144712110
If the GeoPandas library is installed (meaning, GeoPandas should be considered an optional "extra"), it may make sense to use the extension dtypes provided by GeoPandas by default on GEOGRAPHY columns.
The text was updated successfully, but these errors were encountered: