Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add xvec accessor and to_crs, set_crs, geom_coords_names #11

Merged
merged 12 commits into from
Dec 6, 2022

Conversation

martinfleis
Copy link
Member

@martinfleis martinfleis commented Nov 23, 2022

Closes #10 and resolves #5.

This is very much WIP. Implementing .xvec accessor and a coords_to_crs method, that practically copies the code from GeoPandas. We may consider going through GeometryArray.to_crs instead but I'd wait until shapely 2.0 is the only supported engine. Right now, it may cause a bit of trouble and unnecessary conversions.

It currently supports only a single coords. While drop_indexes support dropping multiple indexes at the same time, set_xindex seems to support only a single one, so we will need a workaround. I think that it may be useful to use a single coords_to_crs call to consolidate projections of all geometries instead of going one by one.

To-do:

  • support multiple coords in coords_to_crs
  • check that the index is GeometryIndex
  • tests
  • documentation

@codecov
Copy link

codecov bot commented Nov 23, 2022

Codecov Report

Base: 97.72% // Head: 98.53% // Increases project coverage by +0.80% 🎉

Coverage data is based on head (b8ef560) compared to base (5cc1a26).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #11      +/-   ##
==========================================
+ Coverage   97.72%   98.53%   +0.80%     
==========================================
  Files           1        2       +1     
  Lines         132      205      +73     
==========================================
+ Hits          129      202      +73     
  Misses          3        3              
Impacted Files Coverage Δ
xvec/accessor.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@benbovy
Copy link
Member

benbovy commented Nov 24, 2022

Great!

A signature very familiar to Xarray users is this:

Dataset.xvec.coords_to_crs(coord_crs=None, **coord_crs_kwargs)

# e.g.
ds.xvec.coords_to_crs(geom1=crs, geom2=other_crs)
# or
ds.xvec.coords_to_crs({"geom1": crs, "geom2": other_crs})

But I also agree that setting a common CRS for all geometry coordinates may be convenient and is consistent with geopandas. What about a to_crs() method that automatically transforms all geometry coordinates found in a Dataset / DataArray, just like in geopandas? This might co-exist with the coords_to_crs() signature suggested above.

set_xindex seems to support only a single one, so we will need a workaround.

I think it is fine to loop through the coordinates and chain the set_xindex calls, it is reasonably cheap to build new DataArray / Dataset objects. Otherwise, we usually call Xarray's internal (fastpath) constructors with new or copied variables, indexes, coordinate names and metadata.

@martinfleis
Copy link
Member Author

I am not sure about the coords_to_crs and to_crs living next to each other. I can see it quite confusing for newcomers. I'll use the syntax you suggest above and we can later think about many-to-one option. I don't expect many use cases with a lot of coordinates in different projections, so it may not be a priority.

@benbovy
Copy link
Member

benbovy commented Dec 2, 2022

I don't expect many use cases with a lot of coordinates in different projections, so it may not be a priority.

Since I wrote https://github.com/martinfleis/xvec/pull/11#issuecomment-1326097578 I read from discussions in GeoPandas and other places that a single CRS is used for the whole Dataframe / object. Rioxarray / odc-geo also consider one global CRS per DataArray / Dataset.

So I agree and changed my mind regarding https://github.com/martinfleis/xvec/pull/7#discussion_r1029284508, we might consider a single CRS for the whole Dataset / DataArray too.

How to enforce that, though? Unlike GeoPandas, there's no concept of an active geometry coordinate in xvec and it probably shouldn't be one? I'm uncertain on whether we want to add a coords parameter to pretty much all methods of the xvec accessor or simply detect all geometry coordinates and iterating through it. We could also add a coords: Hashable | Interable[Hashable] | None = None parameter: apply the operation on all geometry coordinates by default or only on the explicitly provided coordinates.. I'd rather lean on the latter, with something like this for the accessor API:

# internal attribute for getting all geometry coordinate names
# (set in the accessor __init__)
Dataset.xvec._geom_coord_names

# property returning all geometry coordinates (subset of Dataset.coords)
Dataset.xvec.geom_coords

# read (write?) property returning a single CRS
# or raising an error if conflicting CRS are found
Dataset.xvec.crs

# (re)assign the CRS of all or given geometry coordinates without re-projection
Dataset.xvec.set_crs(coords=None, crs=None, ...)

# (re)assign the CRS of all or given geometry coordinates with re-projection
Dataset.xvec.to_crs(coords=None, crs=None...)

@benbovy benbovy mentioned this pull request Dec 2, 2022
@martinfleis
Copy link
Member Author

I tend to disagree here. GeoPandas does not have a concept of a single CRS for the whole DataFrame. You can easily have many GeoSeries in the gdf each with its own CRS and there are use cases for that I've used myself. The feeling that there is one per gdf si coming from the active geometry concept only. Treating each coords as a single GeoSeries, hence allowing it to have a different CRS makes sense to me. I am not sure how the single CRS per DataArray here would work and why we should do it.

I think that with the transformation, we will have two use cases - 1) change a CRS of a single coords array, likely to match the CRS or the other or to use 3857 for easy plotting (that is the exact use case where you may want to have multiple projections in one DataArray and use one coords for geometry ops and the other for plotting); 2) harmonise all projections among all geometry coords.

Having an API that supports both at the same time feels a bit unclean and can be confusing.

So I would use the API you suggested above for the first case and figure out another one for the second one. Resulting CRS api would then be:

# internal attribute for getting all geometry coordinate names
# (set in the accessor __init__)
Dataset.xvec._geom_coord_names

# property returning all geometry coordinates (subset of Dataset.coords)
Dataset.xvec.geom_coords

# (re)assign the CRS of all or given geometry coordinates without re-projection
Dataset.xvec.set_crs(coords=None, crs=None, ...)

# transform one of more coords to a CRS set for each
Dataset.xvec.coords_to_crs(coord_crs=None, **coord_crs_kwargs)

# (re)assign the CRS of all or given geometry coordinates with re-projection
Dataset.xvec.harmonize_crs(coords=None, crs=None...)

@benbovy
Copy link
Member

benbovy commented Dec 2, 2022

Ah ok, I must have misunderstood something (so many great discussions happening in a lot a different places!) and I guess I was confused by the description (header) of GeoDataFrame.set_crs. You are re-changing my mind then 🙂.

I like harmonize_crs, it makes sense.

I'm confused by set_crs, though. It makes the three accessor CRS methods confusing when coming from the GeoPandas API. I'd either closely match GeoPandas method names or have a more distinct API.

I guess one of the following would confuse me less:

  • change set_crs to coords_set_crs(coord_crs=None, **coord_crs_kwargs)
  • remove set_crs in favor of harmonize_crs(coords=None, crs=None, reproject=False)
  • rename harmonize_crs to to_crs (back to my original suggestion)

@martinfleis
Copy link
Member Author

Ah ok, I must have misunderstood something (so many great discussions happening in a lot a different places!) and I guess I was confused by the description (header) of GeoDataFrame.set_crs. You are re-changing my mind then 🙂.

Is see. It seems to be an artefact from the past when more geometry columns were not supported. That line should change to "Set the Coordinate Reference System (CRS) of the active geometry column". But the line below slightly specifies it:

If there are multiple geometry columns within the GeoDataFrame, only the CRS of the active geometry column is set.

I see the confusion on set_crs

change set_crs to coords_set_crs(coord_crs=None, **coord_crs_kwargs)

this seems to be a reasonable and consistent solution. Removing it in favour of harmonising will not work as for the transformation you need to know the original CRS and you need a way to set it if you don't have one.

@keewis
Copy link

keewis commented Dec 2, 2022

As far as I understand it, geometry / geography variables don't have to be coordinates (though most often they will be, to be able to use indexes). As such, I wonder if it would make sense to change the api to something like this:

Dataset.xvec.assign_crs(variables=None, **variable_kwargs)
Dataset.xvec.to_crs(variables=None, **variable_kwargs)

This is what pint-xarray is using and has worked out great so far. Basically, you can either pass a mapping of variable names to crs values, or do the same as keyword arguments. And if we want to allow converting all data vars, all coords, or just every variable, we can add special keyword arguments for that (all_data_vars, all_coords, and all).

Edit: I now see that this is somewhat implied in the suggestions for coords_to_crs, but I guess I'm proposing this for both set_crs / assign_crs and to_crs. harmonize_crs might be a special case for reproject(all=crs)?

@edzer
Copy link

edzer commented Dec 2, 2022

I am not sure how the single CRS per DataArray here would work and why we should do it.

I would try to avoid taking this complexity (which I believe R's sf and geopandas inherited from PostGIS) into vector data cubes. Are there other cases where for data cube dimension labels you have two sets, e.g. a time dimension that has expressions for two locales or two time zones?

There should be the ability to convert geometries associated with a dimension from one CRS to another, but I would try to avoid the complexity of catering for (and having to manage) multiple of them and have the user specify which one is active here. The geometry data associated with the dimension is typically small compared to the payload of the values in the data cube.

@martinfleis
Copy link
Member Author

As far as I understand it, geometry / geography variables don't have to be coordinates (though most often they will be, to be able to use indexes).

theoretically, they can be also data but it is currently unclear how operations on N-D array of shapely geometries would work. Shapely currently assumes one dimension and we would likely need to flatten and then reshape the arrays to do spatial ops on top. And given we wouldn't have a GeometryIndex on top of data, we would need to find another way of storing the CRS.

That said, I think it is in scope of xvec so it make sense to think forward and set the API future-proof.

@benbovy
Copy link
Member

benbovy commented Dec 2, 2022

@edzer I agree there's no need to specify an active geometry coordinate for vector data cubes, we can just treat those cubes as collections with none, one or more geometry variables and iterate over all or a given subset of them.

which I believe R's sf and geopandas inherited from PostGIS

Is it because there is usually only one (spatial) index in a Table or DataFrame? Xarray supports well co-existing indexes for different dimensions (and now even for different coordinates of the same dimension), so having an active geometry coordinate is not very relevant indeed.

Shapely currently assumes one dimension and we would likely need to flatten and then reshape the arrays to do spatial ops on top.

@martinfleis this is only for certain features like STRTree, right?

For all element-wise operations (predicates, measurement, etc.), there's no particular reason not to support them for any n-dimensional coordinate or data variable.

And given we wouldn't have a GeometryIndex on top of data, we would need to find another way of storing the CRS.

Two possible ways are (1) as an Xarray variable attribute and (2) by wrapping the variable data in a "duck" nd-array with a CRS attribute. We could define some general rules to extract the CRS from coordinates or data variables (https://github.com/martinfleis/xvec/issues/5#issuecomment-1316603760).

we can add special keyword arguments for that (all_data_vars, all_coords, and all).

I would rather avoid methods with too many arguments. If we want to update the CRS for all (data | coordinate) variables, perhaps harmonize_crs would be enough? I would even provide a harmonize_crs(crs=None) without a coords argument to make it simpler. I guess it is mainly useful for harmonizing CRS across the whole Dataset / DataArray.

@martinfleis
Copy link
Member Author

this is only for certain features like STRTree, right?

Yeah, it is a small subset but not only STRtree. Functions like get_coordinates return flatten array though so we would need to do some reshaping when doing coordinate transformation on N-D arrays.

So, compiling all the suggestion made above, does this API look reasonable?

# internal attribute for getting all geometry coordinate names
# (set in the accessor __init__)
Dataset.xvec._geom_coord_names

# property returning all geometry coordinates (subset of Dataset.coords)
Dataset.xvec.geom_coords

# (re)assign the CRS one of more variables without re-projection (allow_override to match GeoPandas API)
# in case of coordinates this updates GeometryIndex
Dataset.xvec.set_crs(variables=None, allow_override=False, **variable_kwargs)

# transform one of more variables to a CRS set for each
# in case of coordinates this updates GeometryIndex
Dataset.xvec.to_crs(variables=None, **variable_kwargs)

I've removed harmonize_crs because that can be done as Dataset.xvec.to_crs({name: 4326 for name in Dataset.xvec.geom_coords) which is easy enough. We can potentially also add Dataset.xvec.geom_variables property to track also data, not only coords and use that.

@benbovy
Copy link
Member

benbovy commented Dec 2, 2022

So, compiling all the suggestion made above, does this API look reasonable?

👍 (nit: I'd rather use crss=None, ..., **crs_kwargs or variable_crs=None, ..., **variable_crs_kwargs, like Dataset.pint.to).

@martinfleis martinfleis changed the title WIP: add xvec accessor and coords_to_crs WIP: add xvec accessor and to_crs, set_crs, geom_coords_names Dec 3, 2022
@martinfleis martinfleis changed the title WIP: add xvec accessor and to_crs, set_crs, geom_coords_names ENH: add xvec accessor and to_crs, set_crs, geom_coords_names Dec 3, 2022
@martinfleis martinfleis requested a review from benbovy December 3, 2022 18:44
@martinfleis
Copy link
Member Author

This should now be ready for a review. Thanks all for the fruitful discussion!

Copy link
Member

@benbovy benbovy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @martinfleis! I left a couple of comments.

xvec/accessor.py Outdated
def __init__(self, xarray_obj: Union[xr.Dataset, xr.DataArray]):
"""xvec init, nothing to be done here."""
self._obj = xarray_obj
self._geom_coords_names = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we return all geometry coordinates even those which don't have a GeometryIndex (yet)? This would be convenient, e.g., for setting a new GeometryIndex for each geometry coordinate.

I guess it could be possible to check for valid coordinates at a reasonable cost?

is_geom_variable(name, obj):
    if isinstance(obj.xindexes.get(name), GeometryIndex):
        return True
    elif obj[name].dtype is np.dtype("O"):
        # try first on a small subset
        subset = obj[name].data[0:10]
        if np.all(shapely.is_valid_input(subset)):
            return np.all(shapely.is_valid_input(obj[name].data))
    return False

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly, but I also find it useful to return just those backed by the GeometryIndex. I am using it below myself. What about having geom_coords returning all and geom_indexes returning the list (or a collection as you suggest below) of those backed by the index?

xvec/accessor.py Outdated Show resolved Hide resolved
doc/source/api.rst Outdated Show resolved Hide resolved
@martinfleis
Copy link
Member Author

@benbovy I went ahead and implemented you suggestions.

  • renamed geom_coords_names to geom_coords and returning a collection instead of a list - showing all coords backed by shapely.Geometry
  • added geom_coords_indexed doing the same but filtering only for those coords backed by the GeometryIndex
  • added .xvec.is_geom_variable() doing the underlying check for those above because there's no reason to keep it private I guess and may be useful to others.

@benbovy
Copy link
Member

benbovy commented Dec 6, 2022

LGTM! Thanks @martinfleis

@martinfleis martinfleis merged commit 94d1a8c into main Dec 6, 2022
@martinfleis martinfleis deleted the accessor branch December 6, 2022 12:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Accessor Support re-projecting of coordinates
4 participants