-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GMT_DATASET.to_dataframe: Return an empty DataFrame if a file contains no data #3131
Conversation
9d4abf9
to
3f3c0c5
Compare
3f3c0c5
to
175ba3c
Compare
return df | ||
|
||
|
||
def dataframe_from_gmt(fname): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, GMT provides two special/undocumented modules read
and write
(their source codes are gmt/src/gmtread.c
/gmt/src/gmtwrite.c
) that can read a file into a GMT object (e.g, reading a tabular file as GMT_DATASET, or reading a grid as GMT_GRID). Currently, we're frequently using the special read
module in the doctest of the pygmt.clib.session
module (similar to lines 46-50 below). We may want to make it public in the future as already done in GMT.jl (https://www.generic-mapping-tools.org/GMT.jl/dev/#GMT.gmtread-Tuple{String} and https://www.generic-mapping-tools.org/GMT.jl/dev/#GMT.gmtwrite).
Co-authored-by: Yvonne Fröhlich <[email protected]>
pygmt/datatypes/dataset.py
Outdated
@@ -211,5 +213,5 @@ def to_dataframe(self) -> pd.DataFrame: | |||
pd.Series(data=np.char.decode(textvector), dtype=pd.StringDtype()) | |||
) | |||
|
|||
df = pd.concat(objs=vectors, axis=1) | |||
df = pd.concat(objs=vectors, axis=1) if vectors else pd.DataFrame() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An empty pd.DataFrame()
won't have any columns. Should there still be columns returned (even if there are no rows)? How would this work with #3117 for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A DataFrame with columns but no rows is still empty. So I guess it's fine.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame()
In [3]: df
Out[3]:
Empty DataFrame
Columns: []
Index: []
In [4]: df = pd.DataFrame(columns=None)
In [5]: df
Out[5]:
Empty DataFrame
Columns: []
Index: []
In [6]: df = pd.DataFrame(columns=["col1", "col2"])
In [7]: df
Out[7]:
Empty DataFrame
Columns: [col1, col2]
Index: []
In [8]: df.empty
Out[8]: True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Column names are set here like so:
Lines 1853 to 1861 in 1eb6dec
# Read the virtual file as a GMT dataset and convert to pandas.DataFrame | |
result = self.read_virtualfile(vfname, kind="dataset").contents.to_dataframe() | |
if output_type == "numpy": # numpy.ndarray output | |
return result.to_numpy() | |
# Assign column names | |
if column_names is not None: | |
result.columns = column_names | |
return result # pandas.DataFrame output |
So we would do something like:
import pandas as pd
df = pd.DataFrame(columns=None)
assert df.empty
df.columns = ["col1", "col2"]
But setting column names to ["col1", "col2"]
errors with:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[8], line 1
----> 1 df.columns = ["col1", "col2"]
File ~/mambaforge/envs/pygmt/lib/python3.12/site-packages/pandas/core/generic.py:6310, in NDFrame.__setattr__(self, name, value)
6308 try:
6309 object.__getattribute__(self, name)
-> 6310 return object.__setattr__(self, name, value)
6311 except AttributeError:
6312 pass
File properties.pyx:69, in pandas._libs.properties.AxisProperty.__set__()
File ~/mambaforge/envs/pygmt/lib/python3.12/site-packages/pandas/core/generic.py:813, in NDFrame._set_axis(self, axis, labels)
808 """
809 This is called from the cython code when we set the `index` attribute
810 directly, e.g. `series.index = [1, 2, 3]`.
811 """
812 labels = ensure_index(labels)
--> 813 self._mgr.set_axis(axis, labels)
814 self._clear_item_cache()
File ~/mambaforge/envs/pygmt/lib/python3.12/site-packages/pandas/core/internals/managers.py:238, in BaseBlockManager.set_axis(self, axis, new_labels)
236 def set_axis(self, axis: AxisInt, new_labels: Index) -> None:
237 # Caller is responsible for ensuring we have an Index object.
--> 238 self._validate_set_axis(axis, new_labels)
239 self.axes[axis] = new_labels
File ~/mambaforge/envs/pygmt/lib/python3.12/site-packages/pandas/core/internals/base.py:98, in DataManager._validate_set_axis(self, axis, new_labels)
95 pass
97 elif new_len != old_len:
---> 98 raise ValueError(
99 f"Length mismatch: Expected axis has {old_len} elements, new "
100 f"values have {new_len} elements"
101 )
ValueError: Length mismatch: Expected axis has 0 elements, new values have 2 elements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to refactor GMT_DATASET.to_dataframe()
to accept more panda-specific parameters (e.g., column_names
, index_col
). The the virtualfile_to_dataset
will be called like:
result = self.read_virtualfile(vfname, kind="dataset").contents.to_dataframe(columns=column_names)
if output_type == "numpy": # numpy.ndarray output
return result.to_numpy()
return result # pandas.DataFrame output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to do it in a separate PR so that #3117 can focus on parsing the column names from header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so moving these lines that handle the column names from
virtualfile_to_dataset
toto_dataframe
:Lines 1861 to 1863 in 62eb5d6
# Assign column names if column_names is not None: result.columns = column_names Do you want to just do that in #3117? Or have a separate PR to handle this?
Done in #3140, so need to refactor this PR after #3140 is merged.
pygmt/datatypes/dataset.py
Outdated
@@ -226,8 +228,13 @@ def to_dataframe( | |||
pd.Series(data=np.char.decode(textvector), dtype=pd.StringDtype()) | |||
) | |||
|
|||
# Return an empty DataFrame if no columns are found. | |||
if len(vectors) == 0: | |||
return pd.DataFrame() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, it returns an empty DataFrame without columns and rows, but an empty DataFrame with columns is also allowed, e.g.,
return pd.DataFrame(column=column_names)
I guess either is fine. I think we can use return pd.DataFrame()
now and make changes if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should return the column_names, so that users who want to e.g. do pd.concat
on multiple pd.DataFrame
outputs from running an algorithm like pygmt.select
in a for-loop can do so in a more straightforward way. Note that we should also set the dtypes of the columns properly, even if the rows are empty, otherwise the dtypes will all become object
:
df1 = pd.DataFrame(data=[[0, 1, 2]], columns=["x", "y", "z"])
print(df1.dtypes)
# x int64
# y int64
# z int64
# dtype: object
df2 = pd.DataFrame(columns=["x", "y", "z"])
print(df2.dtypes)
# x object
# y object
# z object
# dtype: object
pd.concat(objs=[df1, df2]).dtypes
# x object
# y object
# z object
# dtype: object
See my other suggestion at #3131 (comment) on not returning an empty pd.DataFrame()
early, until the dtype is set with df.astype(dtype)
below.
pygmt/datatypes/dataset.py
Outdated
@@ -226,8 +228,13 @@ def to_dataframe( | |||
pd.Series(data=np.char.decode(textvector), dtype=pd.StringDtype()) | |||
) | |||
|
|||
# Return an empty DataFrame if no columns are found. | |||
if len(vectors) == 0: | |||
return pd.DataFrame() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should return the column_names, so that users who want to e.g. do pd.concat
on multiple pd.DataFrame
outputs from running an algorithm like pygmt.select
in a for-loop can do so in a more straightforward way. Note that we should also set the dtypes of the columns properly, even if the rows are empty, otherwise the dtypes will all become object
:
df1 = pd.DataFrame(data=[[0, 1, 2]], columns=["x", "y", "z"])
print(df1.dtypes)
# x int64
# y int64
# z int64
# dtype: object
df2 = pd.DataFrame(columns=["x", "y", "z"])
print(df2.dtypes)
# x object
# y object
# z object
# dtype: object
pd.concat(objs=[df1, df2]).dtypes
# x object
# y object
# z object
# dtype: object
See my other suggestion at #3131 (comment) on not returning an empty pd.DataFrame()
early, until the dtype is set with df.astype(dtype)
below.
Co-authored-by: Wei Ji <[email protected]>
d5283cd
to
dbfc2ae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably find a way to test the case where an empty pd.DataFrame
is returned with column names and set dtypes, but it looks a bit tricky 🙂
Co-authored-by: Wei Ji <[email protected]>
Sometimes, GMT may output nothing to a virtual file, for example,
gmt select
may find no data points that satisfy the specified criteria.In such cases,
GMT_DATASET.to_dataframe
raises an exception. As shown below:Instead of raising exceptions,
GMT_DATASET.to_dataframe
should return a reasonable value. I think an empty DataFrame (i.e.,pd.DataFrame()
) makes more sense thanNone
.This PR updates the
GMT_DATASET.to_dataframe()
method to return an empty DataFrame in such cases and also add two tests (one for benchmark and one for testing the empty DataFrame).