-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zarr serialisation #48
Comments
I think your first proposal makes it harder to read & write Zarr, as the de/encoder needs to read Zarr AND GeoParquet; we had this discussion before when the idea came up to use WKB. I don't think the GDAL Multidimensional Array API can do that, can Given the geoarrow proposal you point to, there seem to be a strong reasons not to interleave ( The CF convention arrays are not ragged but contiguous in memory/on disk: only when you're cutting them into the polygons or linestrings they (obviously) become ragged of some sort. CF seems the most natural to me, because those who write Zarr are aware of that spec. |
A good way to approach this would be to mirror Xarray's existing encoding / decoding mechanism. The vector indexes could exist in two states:
It would be nice if xarray allowed plugins to define custom coders. Until that happens, the coders could be implemented following the existing coding API and called manually from xvec, e.g. ds = xr.open_dataset(netcdf_or_zarr_file)
ds = ds.xvec.decode() # -> create xvec GeometryIndex from x, y, offset arrays
ds = ds.xvec.encode() # -> back to raw form
ds.to_zarr(new_file) |
Yes we can but should we? I wonder what the motivation is for drawing in non-native en/decoders: the payload is really the data cube cells, the vector geometries are only associated with (typically) a single dimension, meaning orders of magnitude smaller data than the array. Is this bikeshedding? |
I'm not sure what you mean by "non native". Could you elaborate? I'm proposing to mirror xarray's existing CF encoding / decoding mechanisms. |
Maybe I'm overthinking things.
Then we're on the same page! |
Agree with both of you.
Yes, we just mentioned that it is theoretically possible but went quickly over to GeoArrow and CF conventions as a preferable solution.
This would also be my preference. The NetCDF serialisation to match {stars} is still on the roadmap in parallel to Zarr and having both using either the same or very close representation of geometries would allow us to reuse big parts of the machinery for one in the other.
+1 |
I was taking a look at this at SciPy sprints today and tried to extend the I was eventually pointed to https://github.com/twhiteaker/CFGeom! I'm not sure what the maintenance situation is on that, but it could either be revived or used as a reference of how to convert back and forth between cf and shapely. And they already have a |
We could probably use it to ensure that the CF representation of geometries is according to spec but due to its age, the implementation itself will likely be very different now that we have shapely 2. I assume that their code may still work but it will not be as efficient as we can be these days. We could also generate reference NetCDF files using |
Ok yeah that makes sense. I'll open a PR with what I've got. Not sure when I'll get a chance to come back to it. |
I have tried to go the simple path and encode geometries using geo-arrow-like encoding to store in Zarr and it seems to work nicely (plus it was approved by @MSanKeys963). Below is the draft of def decode(ds):
"""Decode Dataset with Xvec indexes into GeoArrow arrays
Parameters
----------
ds : xarray.Dataset
Dataset with Xvec-backed coordinates
Returns
-------
xarray.Dataset
Dataset with geometries represented as GeoArrow-like arrays
"""
decoded = ds.copy()
decoded = decoded.assign_attrs(geometry={})
for geom_coord in ds.xvec.geom_coords:
geom_type, coords, offsets = shapely.to_ragged_array(ds[geom_coord])
decoded.attrs["geometry"][geom_coord] = {
"geometry_type": geom_type.name,
"crs": ds[geom_coord].crs.to_json_dict(),
}
to_assign = {
geom_coord: numpy.arange(ds[geom_coord].shape[0]),
f"{geom_coord}_coords_x": coords[:, 0],
f"{geom_coord}_coords_y": coords[:, 1],
}
if coords.shape[1] == 3:
to_assign[f"{geom_coord}_coords_z"] = coords[:, 2]
for i, offsets_arr in enumerate(offsets):
to_assign[f"{geom_coord}_offsets_{i}"] = offsets_arr
return decoded.assign(to_assign)
def encode(ds):
"""Encode Dataset indexed by geometries represented as GeoArrow arrays into Xvec indexes
Parameters
----------
ds : xarray.Dataset
Dataset with geometries represented as GeoArrow-like arrays
Returns
-------
xarray.Dataset
Dataset with Xvec-backed coordinates
"""
encoded = ds.copy()
for geom_coord in encoded.attrs["geometry"].keys():
offsets = [c for c in encoded.coords if f"{geom_coord}_offsets_" in c]
geoms = shapely.from_ragged_array(
getattr(
shapely.GeometryType,
encoded.attrs["geometry"][geom_coord]["geometry_type"],
),
coords=numpy.vstack(
[encoded[f"{geom_coord}_coords_x"], encoded[f"{geom_coord}_coords_y"]]
).T,
offsets=tuple([encoded[c] for c in offsets]),
)
encoded = encoded.drop_vars(
[
f"{geom_coord}_coords_x",
f"{geom_coord}_coords_y",
]
+ offsets
)
encoded[geom_coord] = geoms
encoded = encoded.xvec.set_geom_indexes(
geom_coord, crs=encoded.attrs["geometry"][geom_coord]["crs"]
)
del encoded.attrs["geometry"]
return encoded Using the example from our docs: import geopandas as gpd
import xarray as xr
import xvec
import shapely
import numpy
from geodatasets import get_path
counties = gpd.read_file(get_path("geoda.natregimes"))
cube = xr.Dataset(
data_vars=dict(
population=(["county", "year"], counties[["PO60", "PO70", "PO80", "PO90"]]),
unemployment=(["county", "year"], counties[["UE60", "UE70", "UE80", "UE90"]]),
divorce=(["county", "year"], counties[["DV60", "DV70", "DV80", "DV90"]]),
age=(["county", "year"], counties[["MA60", "MA70", "MA80", "MA90"]]),
),
coords=dict(county=counties.geometry, year=[1960, 1970, 1980, 1990]),
).xvec.set_geom_indexes("county", crs=counties.crs)
cube
arrow_like = decode(cube)
arrow_like
This form can be saved using encode(arrow_like)
The only change during the roundtrip is casting of the Polygon to MultiPolygon since GeoArrow does not support mixed single and multi-part geometry types and all are casted to multi-type if there's at least one multi-type. It does deviate from CF conventions but uses readily available tooling in shapely. And the format is derived from GeoArrow, so there is a spec to follow. If you find this plausible, I will convert it to a PR. |
Another option would be to encode geometries as WKB (well-known binary) instead of a Geo-Arrow-like ragged array. That allows us to keep the dimensionality and have the decoded object look like the original one. USing the cube["county"] = shapely.to_wkb(cube["county"])
cube["county"].attrs["crs"] = cube["county"].attrs["crs"].to_json()
cube.to_zarr("wkb.zarr")
cube
from_wkb = xr.open_zarr("wkb.zarr")
crs = from_wkb["county"].attrs["crs"]
from_wkb["county"] = shapely.from_wkb(from_wkb["county"])
from_wkb = from_wkb.xvec.set_geom_indexes("county", crs=crs) Anyone has a reason why we should not use WKB here? The one reason I can think of is the need to have the WKB geometry reader but if you want to create geometries you probably already have that (it is part of GEOS). |
I understand why this might not be ideal / optimal but I don't see any reason not to support it. It is convenient and it "scales well" (i.e., the number variables, dimensions & attributes doesn't explode) in the case of a dataset with several n-d geometry arrays (data variables), assuming that this case is plausible. Looking at the geoparquet spec, WKB support doesn't seem to go anywhere anytime soon and will likely co-exist with other encoding options (geoarrow) in the near future. Unless it represents a big burden to maintain, IMHO it would make sense to support those options here too (and in the geozarr-spec if it includes vector data) in addition to the CF option (via cf-xarray).
Or Xvec could provide an I/O backend, which Xarray already allows as plugin. One thing I'm not sure is whether it could be built on top of Xarray's built-in zarr backend (I mean reusing the implementation of raw zarr I/O in a new backend, not extend the API of xarray's zarr backend). |
#69 builds on the hard work of @jsignell @aulemahal and I to allow encoding of vector cubes into CF compliant form that can be serialised to netCDF/Zarr or any other format that Xarray can write to really. The core geometry encoding logic lives in cf-xarray: (docs) and CRS encoding logic lives here in xvec. It isn't very configurable at the moment but seems to work, and even includes support for multiple geometry arrays! |
Here is an interesting detail.
but AFAICT shapely supports either order for the exterior ring, and that's the order we end up writing to disk. So the files may be non-compliant if the input geometries are non-compliant. |
GEOS does not care but if we want to ensure the ordering, we can check it via |
That is some progress at least; the clockwise/anticlockwise mess exists because shapely (and GEOS, and simple features) only works with two-dimensional flat spaces (projected data): there, every valid polygon has an unambiguous inside, regardless node ordering. When considered on the sphere or ellipsoid, every polygon divides the area in two parts, and what is considered inside the polygon needs to be found out by node ordering.
This requirement still doesn't work on the sphere: suppose the polygon is the equator or a meridian: where is "above"? The logic used in s2geometry is to state that when traversing the nodes, the area to the left is "inside". That works on bounded spaces like spheres. Another issue CF would need to clarify is how edges are defined for geodetic coordinates: are they great circles (like s2geometry does) or straight Cartesian lines (like GeoJSON rfc)? |
Nice. I think the current approach is probably fine, it ensures roundtripping of in-memory objects.The CF checker will complain, and anyone that wants fully compliant files can manually reorder
hah nice. |
Splitting this from #26 for a more focused discussion.
During a discussion with @rabernat earlier today, we talked about the possibility to serialise vector data cubes into a Zarr format, so I want to share a few notes and links.
Zarr is flexible enough to support this in many ways, so a few that has been discussed so far:
[x, y, x, y, ...]
, pay also attention to Allow storing points/coordinates as struct arrays in addition to interleaved layout geoarrow/geoarrow#26.It may be worth ensuring whatever we prototype is done in sync with the GeoZarr efforts.
The text was updated successfully, but these errors were encountered: