-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consistent table-like output for PyGMT functions/methods #1318
Comments
Just wanted to say, this is a similar issue to that in the RAPIDS AI
As you can see, this involves more levels of I/O complexity going from phase 1 to phase 2. I would advocate to move to phase '1' for PyGMT v0.5.0 or v0.6.0, and keep PyGMT's default output type for table-like outputs as That said, I'm open to other suggestions, if someone has a good idea on how to deal with I/O in PyGMT. |
Thanks for the helpful reference, @weiji14! I like the mirroring behavior and agree that it would be a good milestone for v0.5.0+. In my opinion, a parameter to control the output type seems a bit simpler for users than a context manager (examples below). The options for Context manager example:
Parameter example:
|
You're right that controlling the output type via a parameter (similar to what @willschlitzer did in bf58dc2) would be simpler than a context manager. I don't have a preference for either style, but after doing a bit of digging, it seems like Anyways, implementation-wise, I'm wondering if we could have a reusable decorator or helper function that can convert any table-like output to the desired format (be it |
Yes, I agree that a helper function is a good idea for the implementation of an output format choice. |
I think the helper function would be our best bet (although I'm not too strong with writing decorators). Borrowing/stealing entirely from my commits mentioned above, I think something like this would could be helpful: def return_table(result, data_format, format_parameter):
if data_format == "s":
return result
data_list = []
for string_entry in result.strip().split("\n"):
float_entry = []
string_list = string_entry.strip().split()
for i in string_list:
float_entry.append(float(i))
data_list.append(float_entry)
data_array = np.array(data_list)
if data_format == "a":
result = data_array
elif data_format == "d":
result = pd.DataFrame(data_array)
else:
raise GMTInvalidInput(f"""Must specify {format_parameter} as either a, d, or s.""")
return result I know it's a little redundant accepting the |
The return_table function looks to be on the right track. We can discuss this in the meeting tomorrow (https://forum.generic-mapping-tools.org/t/pygmt-community-zoom-meeting-tuesday-15-june-2021-utc-1700/1762) if there's time. In terms of your PRs #1284 and #1299 though @willschlitzer, I'd suggest sticking to a Lines 273 to 281 in a10af5d
You're welcome to start a proof of concept PR, but realistically speaking, I don't expect it to be ready for PyGMT v0.4.0. See this blog post by the RAPIDS AI team (https://medium.com/rapids-ai/making-rapids-cudf-io-readers-and-writers-more-reliable-with-fuzz-testing-5d595aa97389) which talks about the complexity of testing different I/O format combinations. Just want to manage expectations a bit 🙂 |
@weiji14 @meghanrjones I know this is different than what we discussed at the PyGMT meeting last night, but after giving it some more thought, I would prefer not getting the table output options for |
Thanks for the detailed explanation and for submitting the proof-of-concept PR. I understand your points. I'll add some comments on #1336, but I agree with you that it's not likely we will reach consensus and update the other methods/functions with table output in the next few days. So, it seems #1318, #1336, #1299, and #1284 will be on track for v0.5.0. |
I support Will wanting to merge in #1299 and #1284 before addressing this issue to keep things moving. At the same time, I would prefer that the argument format for the
When testing out the PR, I found it hard to remember 'a', 'd', and 's' (I kept instinctively trying 'p' for pandas.DataFrame). I would prefer longer names like 'numpy', 'pandas', and 'str'. Here's my question - how do we decide on the implementation when there's differences in the preferred method between developers? Can we have a vote? |
@meghanrjones I'm fine with a vote, but as the developer with the least experience I'm happy to defer to your opinion on this one! |
Agree with having longer names. In this case, I would recommend re-using an already existing convention in the PyData ecosystem (as hinted at #1318 (comment)). Namely that from Specifically. they have an
Happy to vote on whether short or long names is preferred. Let's go with 🎉 for short (a/d/s) and 🚀 for long (numpy/pandas/file). |
@weiji14 No fair, rockets are way cooler so people will vote that way by default! But seriously, I think you and @meghanrjones are making better points than I am; I think long names are the way to go. |
@GenericMappingTools/pygmt-maintainers Has this issue been settled yet? The only two votes are for the long names. |
Yes, I think we can settle it for long names. |
Yes, I also support the long names. |
I think we have reached an agreement (at least partially) about the expected behavior of table-like output. I've revised the top post to better track the progress. |
Just noting that Pandas 2.0 has not just a NumPy-backed arrays, but also PyArrow(C++)-backed arrays (see https://pandas.pydata.org/docs/whatsnew/v2.0.0.html#argument-dtype-backend-to-return-pyarrow-backed-or-numpy-backed-nullable-dtypes). Also according to PDEP10, Pandas 3.0 will use Haven't had time yet to work out what this would mean for our implementation of |
Now most modules already have consistent behaviors for table-like output. There are still four exceptions: Closing the issue. Great to have consistent table-like output behavior across the whole project! |
Sounds great! Thanks @seisman for all your efforts on this! |
Originally posted by @weiji14 in #1284 (comment):
I am opening this issue to find out what the output format/options should be with the additional of x/y/z in blockmean, blockmedian, and grdtrack, as well as for new methods such as grd2xyz.
Edit by @seisman on Oct 9, 2023:
The PyGMT team has reach an agreement that PyGMT functions/methods should have consistent behavior for table-like outputs, i.e., the output depends on the
output_type
parameter.Valid
output_type
values are:file
: output to a file directly (usually need to setoutfile
)numpy
: return a NumPy array (may not always possible because data must be in the same dtype in a 2-D numpy array)pandas
: return a pandas.DataFrame (sometimes need to setnewcolname
)Here are a list of functions/methods that have table-like outputs:
blockm*
pygmt.blockm*: Add 'output_type' parameter for output in pandas/numpy/file formats #3103filter1d
pygmt.filter1d: Improve performance by storing output in virtual files #3085grdinfo
[Will be tracked in Better return values for grdinfo #593]grdtrack
pygmt.grdtrack: Add 'output_type' parameter for output in pandas/numpy/file formats #3106triangulate
pygmt.triangulate.delaunay_triples: Improve performance by storing output in virtual files #3107grdvolume
pygmt.grdvolume: Refactor to store output in virtual files instead of temporary files #3102select
pygmt.select: Improve performance by storing output in virtual files #3108which
[Will be tracked in Finalize the pygmt.which wrapper #3003]grdhisteq
pygmt.grdhisteq.compute_bins: Refactor to store output in virtual files instead of temporary files #3109x2sys_cross
[Will be tracked in x2sys_cross: Refactor to get rid of temporary files and have consistent table-like output behavior #3160]info
[Will be tracked in Better return value for pygmt.info #3159]project
pygmt.project: Add 'output_type' parameter for output in pandas/numpy/file formats #3110grd2xyz
pygmt.grd2xyz: Improve performance by storing output in virtual files #3097We need to make sure they behave consistently.
The text was updated successfully, but these errors were encountered: