ENH: add xvec accessor and to_crs, set_crs, geom_coords_names #11

martinfleis · 2022-11-23T23:00:53Z

Closes #10 and resolves #5.

This is very much WIP. Implementing .xvec accessor and a coords_to_crs method, that practically copies the code from GeoPandas. We may consider going through GeometryArray.to_crs instead but I'd wait until shapely 2.0 is the only supported engine. Right now, it may cause a bit of trouble and unnecessary conversions.

It currently supports only a single coords. While drop_indexes support dropping multiple indexes at the same time, set_xindex seems to support only a single one, so we will need a workaround. I think that it may be useful to use a single coords_to_crs call to consolidate projections of all geometries instead of going one by one.

To-do:

support multiple coords in coords_to_crs
check that the index is GeometryIndex
tests
documentation

codecov · 2022-11-23T23:03:46Z

Codecov Report

Base: 97.72% // Head: 98.53% // Increases project coverage by +0.80% 🎉

Coverage data is based on head (b8ef560) compared to base (5cc1a26).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #11      +/-   ##
==========================================
+ Coverage   97.72%   98.53%   +0.80%     
==========================================
  Files           1        2       +1     
  Lines         132      205      +73     
==========================================
+ Hits          129      202      +73     
  Misses          3        3

Impacted Files	Coverage Δ
xvec/accessor.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

benbovy · 2022-11-24T08:16:55Z

Great!

A signature very familiar to Xarray users is this:

Dataset.xvec.coords_to_crs(coord_crs=None, **coord_crs_kwargs)

# e.g.
ds.xvec.coords_to_crs(geom1=crs, geom2=other_crs)
# or
ds.xvec.coords_to_crs({"geom1": crs, "geom2": other_crs})

But I also agree that setting a common CRS for all geometry coordinates may be convenient and is consistent with geopandas. What about a to_crs() method that automatically transforms all geometry coordinates found in a Dataset / DataArray, just like in geopandas? This might co-exist with the coords_to_crs() signature suggested above.

set_xindex seems to support only a single one, so we will need a workaround.

I think it is fine to loop through the coordinates and chain the set_xindex calls, it is reasonably cheap to build new DataArray / Dataset objects. Otherwise, we usually call Xarray's internal (fastpath) constructors with new or copied variables, indexes, coordinate names and metadata.

martinfleis · 2022-12-01T21:21:56Z

I am not sure about the coords_to_crs and to_crs living next to each other. I can see it quite confusing for newcomers. I'll use the syntax you suggest above and we can later think about many-to-one option. I don't expect many use cases with a lot of coordinates in different projections, so it may not be a priority.

benbovy · 2022-12-02T09:14:39Z

I don't expect many use cases with a lot of coordinates in different projections, so it may not be a priority.

Since I wrote https://github.com/martinfleis/xvec/pull/11#issuecomment-1326097578 I read from discussions in GeoPandas and other places that a single CRS is used for the whole Dataframe / object. Rioxarray / odc-geo also consider one global CRS per DataArray / Dataset.

So I agree and changed my mind regarding https://github.com/martinfleis/xvec/pull/7#discussion_r1029284508, we might consider a single CRS for the whole Dataset / DataArray too.

How to enforce that, though? Unlike GeoPandas, there's no concept of an active geometry coordinate in xvec and it probably shouldn't be one? I'm uncertain on whether we want to add a coords parameter to pretty much all methods of the xvec accessor or simply detect all geometry coordinates and iterating through it. We could also add a coords: Hashable | Interable[Hashable] | None = None parameter: apply the operation on all geometry coordinates by default or only on the explicitly provided coordinates.. I'd rather lean on the latter, with something like this for the accessor API:

# internal attribute for getting all geometry coordinate names
# (set in the accessor __init__)
Dataset.xvec._geom_coord_names

# property returning all geometry coordinates (subset of Dataset.coords)
Dataset.xvec.geom_coords

# read (write?) property returning a single CRS
# or raising an error if conflicting CRS are found
Dataset.xvec.crs

# (re)assign the CRS of all or given geometry coordinates without re-projection
Dataset.xvec.set_crs(coords=None, crs=None, ...)

# (re)assign the CRS of all or given geometry coordinates with re-projection
Dataset.xvec.to_crs(coords=None, crs=None...)

martinfleis · 2022-12-02T09:47:32Z

I tend to disagree here. GeoPandas does not have a concept of a single CRS for the whole DataFrame. You can easily have many GeoSeries in the gdf each with its own CRS and there are use cases for that I've used myself. The feeling that there is one per gdf si coming from the active geometry concept only. Treating each coords as a single GeoSeries, hence allowing it to have a different CRS makes sense to me. I am not sure how the single CRS per DataArray here would work and why we should do it.

I think that with the transformation, we will have two use cases - 1) change a CRS of a single coords array, likely to match the CRS or the other or to use 3857 for easy plotting (that is the exact use case where you may want to have multiple projections in one DataArray and use one coords for geometry ops and the other for plotting); 2) harmonise all projections among all geometry coords.

Having an API that supports both at the same time feels a bit unclean and can be confusing.

So I would use the API you suggested above for the first case and figure out another one for the second one. Resulting CRS api would then be:

# internal attribute for getting all geometry coordinate names
# (set in the accessor __init__)
Dataset.xvec._geom_coord_names

# property returning all geometry coordinates (subset of Dataset.coords)
Dataset.xvec.geom_coords

# (re)assign the CRS of all or given geometry coordinates without re-projection
Dataset.xvec.set_crs(coords=None, crs=None, ...)

# transform one of more coords to a CRS set for each
Dataset.xvec.coords_to_crs(coord_crs=None, **coord_crs_kwargs)

# (re)assign the CRS of all or given geometry coordinates with re-projection
Dataset.xvec.harmonize_crs(coords=None, crs=None...)

benbovy · 2022-12-02T10:24:58Z

Ah ok, I must have misunderstood something (so many great discussions happening in a lot a different places!) and I guess I was confused by the description (header) of GeoDataFrame.set_crs. You are re-changing my mind then 🙂.

I like harmonize_crs, it makes sense.

I'm confused by set_crs, though. It makes the three accessor CRS methods confusing when coming from the GeoPandas API. I'd either closely match GeoPandas method names or have a more distinct API.

I guess one of the following would confuse me less:

change set_crs to coords_set_crs(coord_crs=None, **coord_crs_kwargs)
remove set_crs in favor of harmonize_crs(coords=None, crs=None, reproject=False)
rename harmonize_crs to to_crs (back to my original suggestion)

martinfleis · 2022-12-02T10:30:29Z

Ah ok, I must have misunderstood something (so many great discussions happening in a lot a different places!) and I guess I was confused by the description (header) of GeoDataFrame.set_crs. You are re-changing my mind then 🙂.

Is see. It seems to be an artefact from the past when more geometry columns were not supported. That line should change to "Set the Coordinate Reference System (CRS) of the active geometry column". But the line below slightly specifies it:

If there are multiple geometry columns within the GeoDataFrame, only the CRS of the active geometry column is set.

I see the confusion on set_crs

change set_crs to coords_set_crs(coord_crs=None, **coord_crs_kwargs)

this seems to be a reasonable and consistent solution. Removing it in favour of harmonising will not work as for the transformation you need to know the original CRS and you need a way to set it if you don't have one.

keewis · 2022-12-02T10:34:40Z

As far as I understand it, geometry / geography variables don't have to be coordinates (though most often they will be, to be able to use indexes). As such, I wonder if it would make sense to change the api to something like this:

Dataset.xvec.assign_crs(variables=None, **variable_kwargs)
Dataset.xvec.to_crs(variables=None, **variable_kwargs)

This is what pint-xarray is using and has worked out great so far. Basically, you can either pass a mapping of variable names to crs values, or do the same as keyword arguments. And if we want to allow converting all data vars, all coords, or just every variable, we can add special keyword arguments for that (all_data_vars, all_coords, and all).

Edit: I now see that this is somewhat implied in the suggestions for coords_to_crs, but I guess I'm proposing this for both set_crs / assign_crs and to_crs. harmonize_crs might be a special case for reproject(all=crs)?

edzer · 2022-12-02T10:38:42Z

I am not sure how the single CRS per DataArray here would work and why we should do it.

I would try to avoid taking this complexity (which I believe R's sf and geopandas inherited from PostGIS) into vector data cubes. Are there other cases where for data cube dimension labels you have two sets, e.g. a time dimension that has expressions for two locales or two time zones?

There should be the ability to convert geometries associated with a dimension from one CRS to another, but I would try to avoid the complexity of catering for (and having to manage) multiple of them and have the user specify which one is active here. The geometry data associated with the dimension is typically small compared to the payload of the values in the data cube.

martinfleis · 2022-12-02T11:06:25Z

As far as I understand it, geometry / geography variables don't have to be coordinates (though most often they will be, to be able to use indexes).

theoretically, they can be also data but it is currently unclear how operations on N-D array of shapely geometries would work. Shapely currently assumes one dimension and we would likely need to flatten and then reshape the arrays to do spatial ops on top. And given we wouldn't have a GeometryIndex on top of data, we would need to find another way of storing the CRS.

That said, I think it is in scope of xvec so it make sense to think forward and set the API future-proof.

benbovy · 2022-12-02T11:48:02Z

@edzer I agree there's no need to specify an active geometry coordinate for vector data cubes, we can just treat those cubes as collections with none, one or more geometry variables and iterate over all or a given subset of them.

which I believe R's sf and geopandas inherited from PostGIS

Is it because there is usually only one (spatial) index in a Table or DataFrame? Xarray supports well co-existing indexes for different dimensions (and now even for different coordinates of the same dimension), so having an active geometry coordinate is not very relevant indeed.

Shapely currently assumes one dimension and we would likely need to flatten and then reshape the arrays to do spatial ops on top.

@martinfleis this is only for certain features like STRTree, right?

For all element-wise operations (predicates, measurement, etc.), there's no particular reason not to support them for any n-dimensional coordinate or data variable.

And given we wouldn't have a GeometryIndex on top of data, we would need to find another way of storing the CRS.

Two possible ways are (1) as an Xarray variable attribute and (2) by wrapping the variable data in a "duck" nd-array with a CRS attribute. We could define some general rules to extract the CRS from coordinates or data variables (https://github.com/martinfleis/xvec/issues/5#issuecomment-1316603760).

we can add special keyword arguments for that (all_data_vars, all_coords, and all).

I would rather avoid methods with too many arguments. If we want to update the CRS for all (data | coordinate) variables, perhaps harmonize_crs would be enough? I would even provide a harmonize_crs(crs=None) without a coords argument to make it simpler. I guess it is mainly useful for harmonizing CRS across the whole Dataset / DataArray.

martinfleis · 2022-12-02T12:12:24Z

this is only for certain features like STRTree, right?

Yeah, it is a small subset but not only STRtree. Functions like get_coordinates return flatten array though so we would need to do some reshaping when doing coordinate transformation on N-D arrays.

So, compiling all the suggestion made above, does this API look reasonable?

# internal attribute for getting all geometry coordinate names
# (set in the accessor __init__)
Dataset.xvec._geom_coord_names

# property returning all geometry coordinates (subset of Dataset.coords)
Dataset.xvec.geom_coords

# (re)assign the CRS one of more variables without re-projection (allow_override to match GeoPandas API)
# in case of coordinates this updates GeometryIndex
Dataset.xvec.set_crs(variables=None, allow_override=False, **variable_kwargs)

# transform one of more variables to a CRS set for each
# in case of coordinates this updates GeometryIndex
Dataset.xvec.to_crs(variables=None, **variable_kwargs)

I've removed harmonize_crs because that can be done as Dataset.xvec.to_crs({name: 4326 for name in Dataset.xvec.geom_coords) which is easy enough. We can potentially also add Dataset.xvec.geom_variables property to track also data, not only coords and use that.

benbovy · 2022-12-02T12:22:52Z

So, compiling all the suggestion made above, does this API look reasonable?

👍 (nit: I'd rather use crss=None, ..., **crs_kwargs or variable_crs=None, ..., **variable_crs_kwargs, like Dataset.pint.to).

martinfleis · 2022-12-03T18:44:40Z

This should now be ready for a review. Thanks all for the fruitful discussion!

benbovy

Looks great @martinfleis! I left a couple of comments.

benbovy · 2022-12-05T11:15:33Z

xvec/accessor.py

+    def __init__(self, xarray_obj: Union[xr.Dataset, xr.DataArray]):
+        """xvec init, nothing to be done here."""
+        self._obj = xarray_obj
+        self._geom_coords_names = [


Couldn't we return all geometry coordinates even those which don't have a GeometryIndex (yet)? This would be convenient, e.g., for setting a new GeometryIndex for each geometry coordinate.

I guess it could be possible to check for valid coordinates at a reasonable cost?

is_geom_variable(name, obj): if isinstance(obj.xindexes.get(name), GeometryIndex): return True elif obj[name].dtype is np.dtype("O"): # try first on a small subset subset = obj[name].data[0:10] if np.all(shapely.is_valid_input(subset)): return np.all(shapely.is_valid_input(obj[name].data)) return False

possibly, but I also find it useful to return just those backed by the GeometryIndex. I am using it below myself. What about having geom_coords returning all and geom_indexes returning the list (or a collection as you suggest below) of those backed by the index?

xvec/accessor.py

doc/source/api.rst

Co-authored-by: Benoit Bovy <[email protected]>

martinfleis · 2022-12-05T14:54:27Z

@benbovy I went ahead and implemented you suggestions.

renamed geom_coords_names to geom_coords and returning a collection instead of a list - showing all coords backed by shapely.Geometry
added geom_coords_indexed doing the same but filtering only for those coords backed by the GeometryIndex
added .xvec.is_geom_variable() doing the underlying check for those above because there's no reason to keep it private I guess and may be useful to others.

benbovy · 2022-12-06T12:39:25Z

LGTM! Thanks @martinfleis

martinfleis added 2 commits November 23, 2022 23:52

WIP: add xvec accessor and coords_to_crs

b774e63

fix the shortcut if crs is equal

394bc97

martinfleis mentioned this pull request Dec 1, 2022

assign_coords downgrades GeometryIndex to PandasIndex #13

Closed

adapt for standard syntax

88c8379

benbovy mentioned this pull request Dec 2, 2022

Accessor #10

Closed

martinfleis added 6 commits December 2, 2022 14:41

implement the agreed API

468fba8

use the fix from xarray 2022.12

6d6e143

test coverage

543e76a

Merge branch 'main' into accessor

2625e88

remove match due to change in py3.11

67a6535

add proj to CI from conda-forge

0bfaa45

martinfleis changed the title ~~WIP: add xvec accessor and coords_to_crs~~ WIP: add xvec accessor and to_crs, set_crs, geom_coords_names Dec 3, 2022

docs

823425f

martinfleis changed the title ~~WIP: add xvec accessor and to_crs, set_crs, geom_coords_names~~ ENH: add xvec accessor and to_crs, set_crs, geom_coords_names Dec 3, 2022

martinfleis requested a review from benbovy December 3, 2022 18:44

benbovy reviewed Dec 5, 2022

View reviewed changes

martinfleis and others added 2 commits December 5, 2022 14:23

Update doc/source/api.rst

c25839f

Co-authored-by: Benoit Bovy <[email protected]>

use is_geom_variable, geom_coords and geom_coords_indexed

b8ef560

martinfleis merged commit 94d1a8c into main Dec 6, 2022

martinfleis deleted the accessor branch December 6, 2022 12:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add xvec accessor and to_crs, set_crs, geom_coords_names #11

ENH: add xvec accessor and to_crs, set_crs, geom_coords_names #11

martinfleis commented Nov 23, 2022 •

edited

Loading

codecov bot commented Nov 23, 2022 •

edited

Loading

benbovy commented Nov 24, 2022 •

edited

Loading

martinfleis commented Dec 1, 2022

benbovy commented Dec 2, 2022

martinfleis commented Dec 2, 2022

benbovy commented Dec 2, 2022

martinfleis commented Dec 2, 2022

keewis commented Dec 2, 2022 •

edited

Loading

edzer commented Dec 2, 2022

martinfleis commented Dec 2, 2022

benbovy commented Dec 2, 2022 •

edited

Loading

martinfleis commented Dec 2, 2022

benbovy commented Dec 2, 2022 •

edited

Loading

martinfleis commented Dec 3, 2022

benbovy left a comment

benbovy Dec 5, 2022

martinfleis Dec 5, 2022

martinfleis commented Dec 5, 2022

benbovy commented Dec 6, 2022

ENH: add xvec accessor and to_crs, set_crs, geom_coords_names #11

ENH: add xvec accessor and to_crs, set_crs, geom_coords_names #11

Conversation

martinfleis commented Nov 23, 2022 • edited Loading

codecov bot commented Nov 23, 2022 • edited Loading

Codecov Report

benbovy commented Nov 24, 2022 • edited Loading

martinfleis commented Dec 1, 2022

benbovy commented Dec 2, 2022

martinfleis commented Dec 2, 2022

benbovy commented Dec 2, 2022

martinfleis commented Dec 2, 2022

keewis commented Dec 2, 2022 • edited Loading

edzer commented Dec 2, 2022

martinfleis commented Dec 2, 2022

benbovy commented Dec 2, 2022 • edited Loading

martinfleis commented Dec 2, 2022

benbovy commented Dec 2, 2022 • edited Loading

martinfleis commented Dec 3, 2022

benbovy left a comment

Choose a reason for hiding this comment

benbovy Dec 5, 2022

Choose a reason for hiding this comment

martinfleis Dec 5, 2022

Choose a reason for hiding this comment

martinfleis commented Dec 5, 2022

benbovy commented Dec 6, 2022

martinfleis commented Nov 23, 2022 •

edited

Loading

codecov bot commented Nov 23, 2022 •

edited

Loading

benbovy commented Nov 24, 2022 •

edited

Loading

keewis commented Dec 2, 2022 •

edited

Loading

benbovy commented Dec 2, 2022 •

edited

Loading

benbovy commented Dec 2, 2022 •

edited

Loading