Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

geopandas: Mapping int/int64 to int32 for OGR_GMT format #2592

Merged
merged 13 commits into from
Aug 21, 2023
13 changes: 11 additions & 2 deletions pygmt/helpers/tempfile.py
Original file line number Diff line number Diff line change
Expand Up @@ -126,18 +126,27 @@ def tempfile_from_geojson(geojson):
E.g. '1a2b3c4d5e6.gmt'.
"""
with GMTTempFile(suffix=".gmt") as tmpfile:
# pylint: disable=import-outside-toplevel
import geopandas as gpd

os.remove(tmpfile.name) # ensure file is deleted first
ogrgmt_kwargs = {"filename": tmpfile.name, "driver": "OGR_GMT", "mode": "w"}
try:
# Map int/int64 to int32 since OGR_GMT only supports 32-bit integer
# https://github.com/geopandas/geopandas/issues/967#issuecomment-842877704
# https://github.com/GenericMappingTools/pygmt/issues/2497
schema = gpd.io.file.infer_schema(geojson)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, looking into this infer_schema function and geopandas/geopandas#967 (comment) a bit more, I almost think part of the fix should go upstream into modifying infer_schema to output int32 values when the column is of int32 dtype, instead of forcing things to int/int64.

Of course, we'll still need to force the OGR_GMT output to int32 in PyGMT, but the code would be much simplified with some upstream fixes.

Copy link
Member Author

@seisman seisman Jul 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, looking into this infer_schema function and geopandas/geopandas#967 (comment) a bit more, I almost think part of the fix should go upstream into modifying infer_schema to output int32 values when the column is of int32 dtype, instead of forcing things to int/int64.

There is already an open PR geopandas/geopandas#2265, but it has been inactive for one year.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already an open PR geopandas/geopandas#2265, but it has been inactive for one year.

Ok, and I saw geopandas/geopandas#2464 as well, also inactive 😆

Started a small PR at geopandas/geopandas#2950 to just handle int32, let's see if their test suite passes, I took a look at https://fiona.readthedocs.io/en/latest/manual.html#keeping-schemas-simple and fiona.FIELD_TYPES_MAP seem to have the following mapping:

{'int32': int,
 'float': float,
 'str': str,
 'date': fiona.rfc3339.FionaDateType,
 'time': fiona.rfc3339.FionaTimeType,
 'datetime': fiona.rfc3339.FionaDateTimeType,
 'bytes': bytes,
 'int64': int,
 'int': int,
 'List[str]': typing.List[str]}

So not sure if the geopandas default driver (ESRI Shapefile?) supports int32 schema. It it doesn't and the test suite breaks, then we might need to handle everything fully in PyGMT.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still unclear if there will be any changes in GDAL, GeoPandas, Fiona or GMT, but as we are still using old versions of these packages, I believe we still need these workarounds in the next few years, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the patch in geopandas has been merged at geopandas/geopandas#2950, but will need to wait for geopandas v0.13.3 or newer to be released. Even so, int64 will still need to be converted to int32, so yes, we'll need to keep this workaround for a while.

for col, dtype in schema["properties"].items():
if dtype in ("int", "int64"):
schema["properties"][col] = "int32"
ogrgmt_kwargs["schema"] = schema
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only write schema if schema["properties"] is not None. Should fix that error about ValueError: Record does not match collection schema: KeysView(<fiona.model.Properties object at 0x7f3265f03a10>) != []

Suggested change
ogrgmt_kwargs["schema"] = schema
if schema["properties"]:
ogrgmt_kwargs["schema"] = schema

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not the root cause of the error.

Here is a modified version of the test_geopandas_info_geodataframe test to understand why it fails:

import geopandas as gpd
import shapely

linestring = shapely.geometry.LineString([(20, 15), (30, 15)])
polygon = shapely.geometry.Polygon([(20, 10), (23, 10), (23, 14), (20, 14)])
multipolygon = shapely.geometry.shape(
    {
        "type": "MultiPolygon",
        "coordinates": [
            [
                [[0, 0], [20, 0], [10, 20], [0, 0]],  # Counter-clockwise
                [[3, 2], [10, 16], [17, 2], [3, 2]],  # Clockwise
            ],
            [[[6, 4], [14, 4], [10, 12], [6, 4]]],  # Counter-clockwise
            [[[25, 5], [30, 10], [35, 5], [25, 5]]],
        ],
    }
)
# Multipolygon first so the OGR_GMT file has @GMULTIPOLYGON in the header
gdf = gpd.GeoDataFrame(
    index=["multipolygon", "polygon", "linestring"],
    geometry=[multipolygon, polygon, linestring],
)

schema = gpd.io.file.infer_schema(gdf)
print(gdf)
print(schema)

gdf.to_file("test.gmt", driver="OGR_GMT", mode="w")

The output of the two print statements is:

                                                       geometry
multipolygon  MULTIPOLYGON (((0.00000 0.00000, 20.00000 0.00...
polygon       POLYGON ((20.00000 10.00000, 23.00000 10.00000...
linestring    LINESTRING (20.00000 15.00000, 30.00000 15.00000)
{'geometry': ['MultiPolygon', 'Polygon', 'LineString'], 'properties': OrderedDict()}

To understand what geopandas/fiona does, I also added the same two print statements to source code geopandas/io/file.py at line https://github.com/geopandas/geopandas/blob/b4b10313ab57bf2c55592a28fb99687c9a538fc2/geopandas/io/file.py#L591.

The two print statements in the geopandas source code give:

          index                                           geometry
0  multipolygon  MULTIPOLYGON (((0.00000 0.00000, 20.00000 0.00...
1       polygon  POLYGON ((20.00000 10.00000, 23.00000 10.00000...
2    linestring  LINESTRING (20.00000 15.00000, 30.00000 15.00000)
{'geometry': ['MultiPolygon', 'Polygon', 'LineString'], 'properties': OrderedDict([('index', 'str')])}

As you can see, both the gdf object and the schema are different. This is because geopandas by defaults modifies the index column (see lines https://github.com/geopandas/geopandas/blob/b4b10313ab57bf2c55592a28fb99687c9a538fc2/geopandas/io/file.py#L551-L556).

In our case, our inferred schema has a empty property dict OrderedDict(), but geopandas/fiona expects a property dict like OrderedDict([('index', 'str')]).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.to_file.html#geopandas.GeoDataFrame.to_file

index: bool, default None

If True, write index into one or more columns (for MultiIndex). Default None writes the index into one or more columns only if the index is named, is a MultiIndex, or has a non-integer data type. If False, no index is written.

I think writing index into one column or not has no effects to the output of OGR_GMT driver, so maybe we should just set index=False.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think writing index into one column or not has no effects to the output of OGR_GMT driver, so maybe we should just set index=False.

Setting index=True (the default) will produce this in the output *.gmt file:

# @VGMT1.0
# @R0/35/0/20                                                             
# @GMULTIPOLYGON
# @Nindex
# @Tstring
# FEATURE_DATA
>
# @Dmultipolygon
# @P
0 0
20 0
10 20
0 0
>
# @H
3 2
10 16
17 2
3 2
>
# @P
6 4
14 4
10 12
6 4
>
# @P
25 5
30 10
35 5
25 5
>
# @Dpolygon
# @P
20 10
23 10
23 14
20 14
20 10
>
# @Dlinestring
20 15
30 15

Notice the lines with the comments # @Nindex and # @Dmultipolygon which are from the index. So the 'index' is output as just a regular column in the OGR_GMT file, and could potentially be used by someone doing aspatial="Z=index". We probably do need to keep it just in case.

Is there another way to handle this? E.g. cast the 'index' column explicitly to a string type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there another way to handle this? E.g. cast the 'index' column explicitly to a string type?

This won't always work, because the index property in a GeoDataFrame object can be integers. For example, in the above example, the object can be defined like

gdf = gpd.GeoDataFrame(
    #index=["multipolygon", "polygon", "linestring"],
    index=[0, 1, 2],
    geometry=[multipolygon, polygon, linestring],
)

Copy link
Member

@weiji14 weiji14 Aug 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    if index:
        df = df.reset_index(drop=False)

Probably just need these two lines, call reset_index if an index exists?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.to_file.html#geopandas.GeoDataFrame.to_file

index has a default value of None. So df.reset_index is only called if the GeoDataFrame object's index is named or is non-integer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, I thought index was referring to the df.index attribute. We could do something like this to force an index name:

if gdf.index.name is None:
    gdf.index.name = "index"
    # print(gdf.index.names) 
    # FrozenList(['index'])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good.

Copy link
Member Author

@seisman seisman Aug 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, I thought index was referring to the df.index attribute. We could do something like this to force an index name:

if gdf.index.name is None:
    gdf.index.name = "index"
    # print(gdf.index.names) 
    # FrozenList(['index'])

Patch applied in 2097add.

# Using geopandas.to_file to directly export to OGR_GMT format
geojson.to_file(**ogrgmt_kwargs)
except AttributeError:
# pylint: disable=import-outside-toplevel
# Other 'geo' formats which implement __geo_interface__
import json

import fiona
import geopandas as gpd

with fiona.Env():
jsontext = json.dumps(geojson.__geo_interface__)
Expand Down
4 changes: 4 additions & 0 deletions pygmt/tests/baseline/test_geopandas_plot_int_dtypes.png.dvc
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
outs:
- md5: c1c6eda2a88adf4d96f18f4e0c5db4d5
size: 43582
path: test_geopandas_plot_int_dtypes.png
54 changes: 53 additions & 1 deletion pygmt/tests/test_geopandas.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,9 @@
Test integration with geopandas.
"""
import numpy.testing as npt
import pandas as pd
import pytest
from pygmt import Figure, info
from pygmt import Figure, info, makecpt, which

gpd = pytest.importorskip("geopandas")
shapely = pytest.importorskip("shapely")
Expand Down Expand Up @@ -131,3 +132,54 @@ def test_geopandas_plot3d_non_default_circle():
style="c0.2c",
)
return fig


@pytest.mark.parametrize(
"dtype",
[
"int32",
"int64",
# Enable Int32/Int64 dtypes when geopandas>=0.13.3 is released with
# patch https://github.com/geopandas/geopandas/pull/2950
# pd.Int32Dtype(),
# pd.Int64Dtype(),
Comment on lines +141 to +144
Copy link
Member

@weiji14 weiji14 Aug 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E TypeError: Cannot interpret 'Int32Dtype()' as a data type

Disabling testing of these nullable dtypes for now, until geopandas>=0.13.3 is out with the patch from geopandas/geopandas#2950.

],
)
@pytest.mark.mpl_image_compare(filename="test_geopandas_plot_int_dtypes.png")
def test_geopandas_plot_int_dtypes(dtype):
"""
Check that plotting a geopandas GeoDataFrame with integer columns works,
including int32 and int64 (non-nullable), Int32 and Int64 (nullable).

This is a regression test for
https://github.com/GenericMappingTools/pygmt/issues/2497
"""
# Read shapefile in geopandas.GeoDataFrame
shapefile = which(
fname="@RidgeTest.shp @RidgeTest.shx @RidgeTest.dbf @RidgeTest.prj",
download="c",
)
gdf = gpd.read_file(shapefile[0])

# Reproject geometry and change dtype of NPOINTS column
gdf["geometry"] = (
gdf.to_crs(crs="EPSG:3857")
.buffer(distance=100000)
.to_crs(crs="OGC:CRS84") # convert to lon/lat to prevent @null in PROJ CRS
)
gdf["NPOINTS"] = gdf.NPOINTS.astype(dtype=dtype)

# Plot figure with three polygons colored based on NPOINTS value
fig = Figure()
makecpt(cmap="lisbon", series=[10, 60, 10], continuous=True)
fig.plot(
data=gdf,
frame=True,
pen="1p,black",
close=True,
fill="+z",
cmap=True,
aspatial="Z=NPOINTS",
)
fig.colorbar()
return fig