Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved WKB geometry parsing #16

Open
kylebarron opened this issue Jun 1, 2022 · 5 comments
Open

Improved WKB geometry parsing #16

kylebarron opened this issue Jun 1, 2022 · 5 comments

Comments

@kylebarron
Copy link
Member

It seems that the geozero API only supports Vec<u8> input. It would be ideal to find a way that geozero can read an Arrow u8 array without first copying into a Vec<u8>

@kylebarron kylebarron changed the title Zero copy WKB geometry parsing Improved WKB geometry parsing Jun 7, 2022
@kylebarron
Copy link
Member Author

This is partially solved in georust/geozero#39, where it removes one copy by parsing WKB from the arrow array reference directly instead of a Vec<u8>... but it still copies data into geo structs. So it's still not fully zero-copy.

@jrasband-dev
Copy link

jrasband-dev commented Dec 26, 2023

I noticed that when I read a geojson file into geopolars that the shapely geometry is missing. it just says [binary data]. Is this the same issue?

If so, I think I've built a successful way to parse the geometry (see image below). I might need some help with creating a pull request. I'm not a pro with git.

image

@kylebarron
Copy link
Member Author

There's been a lot of progress in https://github.com/geoarrow/geoarrow-rs that hasn't quite been integrated into here, including a new WKB parser. So I have most of the groundwork done, but haven't had time yet to integrate it into geopolars

@jrasband-dev
Copy link

Gald to hear there's a solution on the way! I know you're planning to implement the geoarrow parser that you've put together, but I was wondering if you could provide some feedback on this solution as well. I'm assuming the main reason you are making the change is because pyogrio is maintained by geopandas and you want to remove that dependency.

from __future__ import annotations
from typing import TYPE_CHECKING, cast
import polars as pl
from polars import DataFrame
from geopolars.geodataframe import GeoDataFrame


if TYPE_CHECKING:
    from pathlib import Path

def read_file(
    path_or_buffer: Path | str | bytes,
    /,
    layer: int | str | None = None,
    engine='arrow',
    encoding: str | None = None,
    columns=None,
    read_geometry: bool = True,
    force_2d: bool = False,
    skip_features: int = 0,
    max_features: int | None = None,
    where: str | None = None,
    bbox: tuple[float, float, float, float] | None = None,
    fids=None,
    sql=None,
    sql_dialect=None,
    return_fids=False,
    **kwargs
) -> DataFrame | GeoDataFrame:

if engine == 'arrow':
        from pyogrio.raw import read_arrow as _read_arrow
        from shapely.wkb import loads
        import pyarrow as pa

        metadata, table = _read_arrow(
            path_or_buffer,
            layer=layer,
            encoding=encoding,
            columns=columns,
            read_geometry=read_geometry,
            force_2d=force_2d,
            skip_features=skip_features,
            max_features=max_features,
            where=where,
            bbox=bbox,
            fids=fids,
            sql=sql,
            sql_dialect=sql_dialect,
            return_fids=return_fids,
        )
        # # TODO: check for metadata['geometry_type'] not Unknown 
        geom=[]
        for i in table['wkb_geometry']:
            wkb_geometry_binary = pa.array([i])
            wkb_hex = wkb_geometry_binary[0].as_py().hex()
            # Convert WKB to Shapely geometry
            shapely_geometry = loads(bytes.fromhex(wkb_hex))
            geom.append(shapely_geometry)
            # print(shapely_geometry)
        df = pl.from_arrow(table)
        df = df.replace('wkb_geometry',pl.Series(geom))
        

        return GeoDataFrame(df)

@kylebarron
Copy link
Member Author

kylebarron commented Dec 29, 2023

I'm assuming the main reason you are making the change is because pyogrio is maintained by geopandas and you want to remove that dependency.

That's not exactly right

  • In this initial prototype, geometries were stored as WKB. Storing geometries as WKB is very inefficient, because you need O(N) search to access any coordinate, rather than constant-time O(1). I didn't know at the time 18 months ago how to implement it, but I've since done it in geoarrow-rs.
  • The goal is actually to integrate closely with pyogrio (see Python GeoArrow Module Proposal geoarrow/geoarrow-python#38 for background), but we still need a conversion from WKB to arrow-native geometries. Integrating with pyogrio means that geopolars doesn't need to integrate with GDAL directly.
  • When you use shapely.loads, you're converting from WKB to a GEOS geometry, which later has to be converted back to GeoArrow. It's not so much that we want to remove the GEOS dependency, as much as converting from WKB to GEOS doesn't really help that much when we really want WKB -> GeoArrow.

There will most likely be an initial release in January of the python bindings to geoarrow-rs, and then I can focus on geopolars

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants