Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GMT_DATASET.to_dataframe: Return an empty DataFrame if a file contains no data #3131

Merged
merged 12 commits into from
Mar 29, 2024

Conversation

seisman
Copy link
Member

@seisman seisman commented Mar 21, 2024

Sometimes, GMT may output nothing to a virtual file, for example, gmt select may find no data points that satisfy the specified criteria.

In such cases, GMT_DATASET.to_dataframe raises an exception. As shown below:

In [1]: import pandas as pd

In [2]: pd.concat(objs=[], axis=1)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[2], line 1
----> 1 pd.concat(objs=[], axis=1)

File ~/opt/miniconda/envs/pygmt/lib/python3.12/site-packages/pandas/core/reshape/concat.py:382, in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    379 elif copy and using_copy_on_write():
    380     copy = False
--> 382 op = _Concatenator(
    383     objs,
    384     axis=axis,
    385     ignore_index=ignore_index,
    386     join=join,
    387     keys=keys,
    388     levels=levels,
    389     names=names,
    390     verify_integrity=verify_integrity,
    391     copy=copy,
    392     sort=sort,
    393 )
    395 return op.get_result()

File ~/opt/miniconda/envs/pygmt/lib/python3.12/site-packages/pandas/core/reshape/concat.py:445, in _Concatenator.__init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
    442 self.verify_integrity = verify_integrity
    443 self.copy = copy
--> 445 objs, keys = self._clean_keys_and_objs(objs, keys)
    447 # figure out what our result ndim is going to be
    448 ndims = self._get_ndims(objs)

File ~/opt/miniconda/envs/pygmt/lib/python3.12/site-packages/pandas/core/reshape/concat.py:507, in _Concatenator._clean_keys_and_objs(self, objs, keys)
    504     objs_list = list(objs)
    506 if len(objs_list) == 0:
--> 507     raise ValueError("No objects to concatenate")
    509 if keys is None:
    510     objs_list = list(com.not_none(*objs_list))

ValueError: No objects to concatenate

Instead of raising exceptions, GMT_DATASET.to_dataframe should return a reasonable value. I think an empty DataFrame (i.e., pd.DataFrame()) makes more sense than None.

This PR updates the GMT_DATASET.to_dataframe() method to return an empty DataFrame in such cases and also add two tests (one for benchmark and one for testing the empty DataFrame).

@seisman seisman added the enhancement Improving an existing feature label Mar 21, 2024
@seisman seisman added this to the 0.12.0 milestone Mar 21, 2024
@seisman seisman added the needs review This PR has higher priority and needs review. label Mar 21, 2024
@seisman seisman force-pushed the dataset/empty_dataframe branch 2 times, most recently from 9d4abf9 to 3f3c0c5 Compare March 21, 2024 05:41
@seisman seisman added the run/benchmark Trigger the benchmark workflow in PRs label Mar 21, 2024
@seisman seisman force-pushed the dataset/empty_dataframe branch from 3f3c0c5 to 175ba3c Compare March 21, 2024 05:46
return df


def dataframe_from_gmt(fname):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, GMT provides two special/undocumented modules read and write (their source codes are gmt/src/gmtread.c/gmt/src/gmtwrite.c) that can read a file into a GMT object (e.g, reading a tabular file as GMT_DATASET, or reading a grid as GMT_GRID). Currently, we're frequently using the special read module in the doctest of the pygmt.clib.session module (similar to lines 46-50 below). We may want to make it public in the future as already done in GMT.jl (https://www.generic-mapping-tools.org/GMT.jl/dev/#GMT.gmtread-Tuple{String} and https://www.generic-mapping-tools.org/GMT.jl/dev/#GMT.gmtwrite).

pygmt/tests/test_datatypes_dataset.py Outdated Show resolved Hide resolved
pygmt/tests/test_datatypes_dataset.py Outdated Show resolved Hide resolved
pygmt/tests/test_datatypes_dataset.py Outdated Show resolved Hide resolved
pygmt/tests/test_datatypes_dataset.py Outdated Show resolved Hide resolved
pygmt/tests/test_datatypes_dataset.py Outdated Show resolved Hide resolved
Co-authored-by: Yvonne Fröhlich <[email protected]>
@seisman seisman changed the title GMT_DATASET: Return an empty DataFrame if a file has no data GMT_DATASET: Return an empty DataFrame if a file contains no data Mar 21, 2024
@@ -211,5 +213,5 @@ def to_dataframe(self) -> pd.DataFrame:
pd.Series(data=np.char.decode(textvector), dtype=pd.StringDtype())
)

df = pd.concat(objs=vectors, axis=1)
df = pd.concat(objs=vectors, axis=1) if vectors else pd.DataFrame()
Copy link
Member

@weiji14 weiji14 Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An empty pd.DataFrame() won't have any columns. Should there still be columns returned (even if there are no rows)? How would this work with #3117 for example?

Copy link
Member Author

@seisman seisman Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A DataFrame with columns but no rows is still empty. So I guess it's fine.

In [1]: import pandas as pd

In [2]: df = pd.DataFrame()

In [3]: df
Out[3]:
Empty DataFrame
Columns: []
Index: []

In [4]: df = pd.DataFrame(columns=None)

In [5]: df
Out[5]:
Empty DataFrame
Columns: []
Index: []

In [6]: df = pd.DataFrame(columns=["col1", "col2"])

In [7]: df
Out[7]:
Empty DataFrame
Columns: [col1, col2]
Index: []

In [8]: df.empty
Out[8]: True

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Column names are set here like so:

pygmt/pygmt/clib/session.py

Lines 1853 to 1861 in 1eb6dec

# Read the virtual file as a GMT dataset and convert to pandas.DataFrame
result = self.read_virtualfile(vfname, kind="dataset").contents.to_dataframe()
if output_type == "numpy": # numpy.ndarray output
return result.to_numpy()
# Assign column names
if column_names is not None:
result.columns = column_names
return result # pandas.DataFrame output

So we would do something like:

import pandas as pd

df = pd.DataFrame(columns=None)
assert df.empty
df.columns = ["col1", "col2"]

But setting column names to ["col1", "col2"] errors with:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[8], line 1
----> 1 df.columns = ["col1", "col2"]

File ~/mambaforge/envs/pygmt/lib/python3.12/site-packages/pandas/core/generic.py:6310, in NDFrame.__setattr__(self, name, value)
   6308 try:
   6309     object.__getattribute__(self, name)
-> 6310     return object.__setattr__(self, name, value)
   6311 except AttributeError:
   6312     pass

File properties.pyx:69, in pandas._libs.properties.AxisProperty.__set__()

File ~/mambaforge/envs/pygmt/lib/python3.12/site-packages/pandas/core/generic.py:813, in NDFrame._set_axis(self, axis, labels)
    808 """
    809 This is called from the cython code when we set the `index` attribute
    810 directly, e.g. `series.index = [1, 2, 3]`.
    811 """
    812 labels = ensure_index(labels)
--> 813 self._mgr.set_axis(axis, labels)
    814 self._clear_item_cache()

File ~/mambaforge/envs/pygmt/lib/python3.12/site-packages/pandas/core/internals/managers.py:238, in BaseBlockManager.set_axis(self, axis, new_labels)
    236 def set_axis(self, axis: AxisInt, new_labels: Index) -> None:
    237     # Caller is responsible for ensuring we have an Index object.
--> 238     self._validate_set_axis(axis, new_labels)
    239     self.axes[axis] = new_labels

File ~/mambaforge/envs/pygmt/lib/python3.12/site-packages/pandas/core/internals/base.py:98, in DataManager._validate_set_axis(self, axis, new_labels)
     95     pass
     97 elif new_len != old_len:
---> 98     raise ValueError(
     99         f"Length mismatch: Expected axis has {old_len} elements, new "
    100         f"values have {new_len} elements"
    101     )

ValueError: Length mismatch: Expected axis has 0 elements, new values have 2 elements

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to refactor GMT_DATASET.to_dataframe() to accept more panda-specific parameters (e.g., column_names, index_col). The the virtualfile_to_dataset will be called like:

 result = self.read_virtualfile(vfname, kind="dataset").contents.to_dataframe(columns=column_names) 
 if output_type == "numpy":  # numpy.ndarray output 
     return result.to_numpy() 
  
 return result  # pandas.DataFrame output

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so moving these lines that handle the column names from virtualfile_to_dataset to to_dataframe:

pygmt/pygmt/clib/session.py

Lines 1861 to 1863 in 62eb5d6

# Assign column names
if column_names is not None:
result.columns = column_names

Do you want to just do that in #3117? Or have a separate PR to handle this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to do it in a separate PR so that #3117 can focus on parsing the column names from header.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so moving these lines that handle the column names from virtualfile_to_dataset to to_dataframe:

pygmt/pygmt/clib/session.py

Lines 1861 to 1863 in 62eb5d6

# Assign column names
if column_names is not None:
result.columns = column_names

Do you want to just do that in #3117? Or have a separate PR to handle this?

Done in #3140, so need to refactor this PR after #3140 is merged.

@seisman seisman changed the title GMT_DATASET: Return an empty DataFrame if a file contains no data GMT_DATASET.to_dataframe: Return an empty DataFrame if a file contains no data Mar 27, 2024
@@ -226,8 +228,13 @@ def to_dataframe(
pd.Series(data=np.char.decode(textvector), dtype=pd.StringDtype())
)

# Return an empty DataFrame if no columns are found.
if len(vectors) == 0:
return pd.DataFrame()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, it returns an empty DataFrame without columns and rows, but an empty DataFrame with columns is also allowed, e.g.,

return pd.DataFrame(column=column_names)

I guess either is fine. I think we can use return pd.DataFrame() now and make changes if necessary.

Copy link
Member

@weiji14 weiji14 Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should return the column_names, so that users who want to e.g. do pd.concat on multiple pd.DataFrame outputs from running an algorithm like pygmt.select in a for-loop can do so in a more straightforward way. Note that we should also set the dtypes of the columns properly, even if the rows are empty, otherwise the dtypes will all become object:

df1 = pd.DataFrame(data=[[0, 1, 2]], columns=["x", "y", "z"])
print(df1.dtypes)
# x    int64
# y    int64
# z    int64
# dtype: object
df2 = pd.DataFrame(columns=["x", "y", "z"])
print(df2.dtypes)
# x    object
# y    object
# z    object
# dtype: object

pd.concat(objs=[df1, df2]).dtypes
# x    object
# y    object
# z    object
# dtype: object

See my other suggestion at #3131 (comment) on not returning an empty pd.DataFrame() early, until the dtype is set with df.astype(dtype) below.

@seisman seisman removed the run/benchmark Trigger the benchmark workflow in PRs label Mar 27, 2024
pygmt/datatypes/dataset.py Outdated Show resolved Hide resolved
@@ -226,8 +228,13 @@ def to_dataframe(
pd.Series(data=np.char.decode(textvector), dtype=pd.StringDtype())
)

# Return an empty DataFrame if no columns are found.
if len(vectors) == 0:
return pd.DataFrame()
Copy link
Member

@weiji14 weiji14 Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should return the column_names, so that users who want to e.g. do pd.concat on multiple pd.DataFrame outputs from running an algorithm like pygmt.select in a for-loop can do so in a more straightforward way. Note that we should also set the dtypes of the columns properly, even if the rows are empty, otherwise the dtypes will all become object:

df1 = pd.DataFrame(data=[[0, 1, 2]], columns=["x", "y", "z"])
print(df1.dtypes)
# x    int64
# y    int64
# z    int64
# dtype: object
df2 = pd.DataFrame(columns=["x", "y", "z"])
print(df2.dtypes)
# x    object
# y    object
# z    object
# dtype: object

pd.concat(objs=[df1, df2]).dtypes
# x    object
# y    object
# z    object
# dtype: object

See my other suggestion at #3131 (comment) on not returning an empty pd.DataFrame() early, until the dtype is set with df.astype(dtype) below.

pygmt/datatypes/dataset.py Outdated Show resolved Hide resolved
@seisman seisman force-pushed the dataset/empty_dataframe branch from d5283cd to dbfc2ae Compare March 29, 2024 03:30
Copy link
Member

@weiji14 weiji14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably find a way to test the case where an empty pd.DataFrame is returned with column names and set dtypes, but it looks a bit tricky 🙂

pygmt/datatypes/dataset.py Outdated Show resolved Hide resolved
pygmt/datatypes/dataset.py Outdated Show resolved Hide resolved
@weiji14 weiji14 added final review call This PR requires final review and approval from a second reviewer and removed needs review This PR has higher priority and needs review. labels Mar 29, 2024
@seisman seisman removed the final review call This PR requires final review and approval from a second reviewer label Mar 29, 2024
@seisman seisman merged commit 85d4ed2 into main Mar 29, 2024
19 checks passed
@seisman seisman deleted the dataset/empty_dataframe branch March 29, 2024 05:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improving an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants