Consistent table-like output for PyGMT functions/methods #1318

maxrjones · 2021-06-04T16:01:26Z

Originally posted by @weiji14 in #1284 (comment):

Currently, the blockm*/grdtrack modules simply outputs pandas.DataFrame/filenames (see #1099) automatically depending on the input type (2 options). The implementation @willschlitzer made at bf58dc2 and 68fab7c returns either numpy/pandas/str depending on the user input (3 options). Best to be consistent across PyGMT on the types of outputs users can expect.

I am opening this issue to find out what the output format/options should be with the additional of x/y/z in blockmean, blockmedian, and grdtrack, as well as for new methods such as grd2xyz.

Edit by @seisman on Oct 9, 2023:

The PyGMT team has reach an agreement that PyGMT functions/methods should have consistent behavior for table-like outputs, i.e., the output depends on the output_type parameter.

Valid output_type values are:

file: output to a file directly (usually need to set outfile)
numpy: return a NumPy array (may not always possible because data must be in the same dtype in a 2-D numpy array)
pandas: return a pandas.DataFrame (sometimes need to set newcolname)

Here are a list of functions/methods that have table-like outputs:

We need to make sure they behave consistently.

The text was updated successfully, but these errors were encountered:

weiji14 · 2021-06-06T00:53:22Z

Just wanted to say, this is a similar issue to that in the RAPIDS AI cuml library. See https://docs.rapids.ai/api/cuml/0.19/api.html#output-data-type-configuration. Their solution is to:

By default: 'mirror' the input data format to be the same as the output format as much as possible, i.e. numpy in/numpy out, or pandas.DataFrame in/pandas.DataFrame out.
Override the default behaviour above by having a using_output_type context manager. E.g. set using_output_type=numpy to return numpy regardless of input type.

As you can see, this involves more levels of I/O complexity going from phase 1 to phase 2. I would advocate to move to phase '1' for PyGMT v0.5.0 or v0.6.0, and keep PyGMT's default output type for table-like outputs as pandas.DataFrame (or None if outfile is set) for PyGMT v0.4.0 as per #1099 (i.e phase '0' if you will). Longer term, as PyGMT matures and integrates more with the PyData stack, we may want to consider mimicking cuml's configurable I/O behaviour.

That said, I'm open to other suggestions, if someone has a good idea on how to deal with I/O in PyGMT.

maxrjones · 2021-06-08T01:14:40Z

Thanks for the helpful reference, @weiji14! I like the mirroring behavior and agree that it would be a good milestone for v0.5.0+.

In my opinion, a parameter to control the output type seems a bit simpler for users than a context manager (examples below). The options for pygmt.using_output_type may also depend on the specific function/method, which could become difficult to document. The context manager would be useful if users want to easily specify the output type for a block of code. Are there other benefits to implementing the configurable I/O behavior with a context manager rather than with function/method parameters?

Context manager example:

with pygmt.using_output_type("numpy"):
   output = blockmean(table=dataframe, spacing="5m", region=[245, 255, 20, 30])

Parameter example:

output = blockmean(table=dataframe, spacing="5m", region=[245, 255, 20, 30], output_format="numpy")

weiji14 · 2021-06-08T01:49:49Z

You're right that controlling the output type via a parameter (similar to what @willschlitzer did in bf58dc2) would be simpler than a context manager. I don't have a preference for either style, but after doing a bit of digging, it seems like cuml had an output_type parameter at one point, but they deprecated it in favour of the context manager option in rapidsai/cuml#3040 for some reason (though it still seems to be mentioned in places like at https://github.com/rapidsai/cuml/blob/branch-21.08/wiki/python/ESTIMATOR_GUIDE.md#specifying-the-array-output-type).

Anyways, implementation-wise, I'm wondering if we could have a reusable decorator or helper function that can convert any table-like output to the desired format (be it pandas.DataFrame or otherwise) instead of having to have a big block of code for each PyGMT method that returns table-like outputs. I.e. reuse the if-then block from bf58dc2, but have it in a single standardized place.

maxrjones · 2021-06-10T18:12:49Z

Anyways, implementation-wise, I'm wondering if we could have a reusable decorator or helper function that can convert any table-like output to the desired format (be it pandas.DataFrame or otherwise) instead of having to have a big block of code for each PyGMT method that returns table-like outputs. I.e. reuse the if-then block from bf58dc2, but have it in a single standardized place.

Yes, I agree that a helper function is a good idea for the implementation of an output format choice.

willschlitzer · 2021-06-15T10:59:44Z

I think the helper function would be our best bet (although I'm not too strong with writing decorators). Borrowing/stealing entirely from my commits mentioned above, I think something like this would could be helpful:

def return_table(result, data_format, format_parameter):
    if data_format == "s":
        return result
    data_list = []
    for string_entry in result.strip().split("\n"):
        float_entry = []
        string_list = string_entry.strip().split()
        for i in string_list:
            float_entry.append(float(i))
        data_list.append(float_entry)
    data_array = np.array(data_list)
    if data_format == "a":
        result = data_array
    elif data_format == "d":
        result = pd.DataFrame(data_array)
    else:
        raise GMTInvalidInput(f"""Must specify {format_parameter} as either a, d, or s.""")
    return result

I know it's a little redundant accepting the result argument only to immediately return it if data_format is set as a string, but it avoids the reuse of if statements. Additionally, I put an f-string into the GMTInvalidInput error in the case of different format parameter names in different functions (as is the case between #1284 and #1299). I can submit it in a PR if that's the way we agree that this should be handled.

weiji14 · 2021-06-15T11:26:09Z

The return_table function looks to be on the right track. We can discuss this in the meeting tomorrow (https://forum.generic-mapping-tools.org/t/pygmt-community-zoom-meeting-tuesday-15-june-2021-utc-1700/1762) if there's time.

In terms of your PRs #1284 and #1299 though @willschlitzer, I'd suggest sticking to a str or pandas.DataFrame only output for PyGMT v0.4.0 if you want to get those PRs in on time (and not wait until v0.5.0). I.e. follow the code used by grdtrack and blockm*:

pygmt/pygmt/src/grdtrack.py

Lines 273 to 281 in a10af5d

    
           # Read temporary csv output to a pandas table 
        
           if outfile == tmpfile.name:  # if user did not set outfile, return pd.DataFrame 
        
               try: 
        
                   column_names = points.columns.to_list() + [newcolname] 
        
                   result = pd.read_csv(tmpfile.name, sep="\t", names=column_names) 
        
               except AttributeError:  # 'str' object has no attribute 'columns' 
        
                   result = pd.read_csv(tmpfile.name, sep="\t", header=None, comment=">") 
        
           elif outfile != tmpfile.name:  # return None if outfile set, output in outfile 
        
               result = None

I know it's a little redundant accepting the result argument only to immediately return it if data_format is set as a string, but it avoids the reuse of if statements. Additionally, I put an f-string into the GMTInvalidInput error in the case of different format parameter names in different functions (as is the case between #1284 and #1299). I can submit it in a PR if that's the way we agree that this should be handled.

You're welcome to start a proof of concept PR, but realistically speaking, I don't expect it to be ready for PyGMT v0.4.0. See this blog post by the RAPIDS AI team (https://medium.com/rapids-ai/making-rapids-cudf-io-readers-and-writers-more-reliable-with-fuzz-testing-5d595aa97389) which talks about the complexity of testing different I/O format combinations. Just want to manage expectations a bit 🙂

willschlitzer · 2021-06-16T06:00:51Z

@weiji14 @meghanrjones I know this is different than what we discussed at the PyGMT meeting last night, but after giving it some more thought, I would prefer not getting the table output options for grd2xyz and grdvolume in line with grdtrack for v0.4.0 if the plan is to move towards their current implementation for v0.5.0 (along with the use of a helper function). I don't think grdvolume and grd2xyz are particularly in-demand modules, so I don't think holding off on them for v0.5.0 is especially problematic. My understanding is that the issue with the numpy array is that it cannot accept different data types, while a pandas DataFrame and list can, but it seems like all of the table outputs should be able to be converted to floating numbers, and be used in a numpy array. If I'm wrong on this one, please let me know and I'll make the changes, but I would prefer not to add these modules with the intention of making them backwards incompatible in the next release. I'll make a proof of concept pull request as discussed above.

maxrjones · 2021-06-17T15:26:27Z

@weiji14 @meghanrjones I know this is different than what we discussed at the PyGMT meeting last night, but after giving it some more thought, I would prefer not getting the table output options for grd2xyz and grdvolume in line with grdtrack for v0.4.0 if the plan is to move towards their current implementation for v0.5.0 (along with the use of a helper function). I don't think grdvolume and grd2xyz are particularly in-demand modules, so I don't think holding off on them for v0.5.0 is especially problematic. My understanding is that the issue with the numpy array is that it cannot accept different data types, while a pandas DataFrame and list can, but it seems like all of the table outputs should be able to be converted to floating numbers, and be used in a numpy array. If I'm wrong on this one, please let me know and I'll make the changes, but I would prefer not to add these modules with the intention of making them backwards incompatible in the next release. I'll make a proof of concept pull request as discussed above.

Thanks for the detailed explanation and for submitting the proof-of-concept PR. I understand your points. I'll add some comments on #1336, but I agree with you that it's not likely we will reach consensus and update the other methods/functions with table output in the next few days. So, it seems #1318, #1336, #1299, and #1284 will be on track for v0.5.0.

maxrjones · 2021-08-10T14:08:30Z

I support Will wanting to merge in #1299 and #1284 before addressing this issue to keep things moving. At the same time, I would prefer that the argument format for the output_type parameter is final (i.e., will be used for all applicable functions) before merging these PRs to avoid deprecations. Here's the current section:

    output_type : str
        Determine the format the xyz data will be returned in:
            **a**: numpy array
            **d**: pandas DataFrame [Default option]
            **s**: string

When testing out the PR, I found it hard to remember 'a', 'd', and 's' (I kept instinctively trying 'p' for pandas.DataFrame). I would prefer longer names like 'numpy', 'pandas', and 'str'.

Here's my question - how do we decide on the implementation when there's differences in the preferred method between developers? Can we have a vote?

willschlitzer · 2021-08-10T19:39:10Z

When testing out the PR, I found it hard to remember 'a', 'd', and 's' (I kept instinctively trying 'p' for pandas.DataFrame). I would prefer longer names like 'numpy', 'pandas', and 'str'.

Here's my question - how do we decide on the implementation when there's differences in the preferred method between developers? Can we have a vote?

@meghanrjones I'm fine with a vote, but as the developer with the least experience I'm happy to defer to your opinion on this one!

weiji14 · 2021-08-10T21:44:16Z

When testing out the PR, I found it hard to remember 'a', 'd', and 's' (I kept instinctively trying 'p' for pandas.DataFrame). I would prefer longer names like 'numpy', 'pandas', and 'str'.
Here's my question - how do we decide on the implementation when there's differences in the preferred method between developers? Can we have a vote?

@meghanrjones I'm fine with a vote, but as the developer with the least experience I'm happy to defer to your opinion on this one!

Agree with having longer names. In this case, I would recommend re-using an already existing convention in the PyData ecosystem (as hinted at #1318 (comment)). Namely that from cuml at https://github.com/rapidsai/cuml/blob/e977f3e4194313a857da96fdd810ee5ef1d742b3/python/cuml/common/memory_utils.py#L346-L374:

Specifically. they have an output_type parameter (similar to the data_format used by @willschlitzer in #1336) that can be either ‘input’, ‘cudf’, ‘cupy’, ‘numpy’ (default = ‘input’). For PyGMT, I would recommend the following output_type names:

'input' (default, which mirrors the input data type whenever possible)
'numpy' (i.e. numpy.array)
'pandas' (i.e. pandas.DataFrame)
'file' (i.e. output to a text file, but we need to tie this with the 'outfile' parameter)

Happy to vote on whether short or long names is preferred. Let's go with 🎉 for short (a/d/s) and 🚀 for long (numpy/pandas/file).

willschlitzer · 2021-08-11T05:41:42Z

@weiji14 No fair, rockets are way cooler so people will vote that way by default!

But seriously, I think you and @meghanrjones are making better points than I am; I think long names are the way to go.

willschlitzer · 2021-09-13T10:50:47Z

@GenericMappingTools/pygmt-maintainers Has this issue been settled yet? The only two votes are for the long names.

maxrjones · 2021-09-13T11:48:25Z

@GenericMappingTools/pygmt-maintainers Has this issue been settled yet? The only two votes are for the long names.

Yes, I think we can settle it for long names.

seisman · 2021-09-13T12:26:35Z

Yes, I also support the long names.

seisman · 2023-10-09T15:50:06Z

I think we have reached an agreement (at least partially) about the expected behavior of table-like output. I've revised the top post to better track the progress.

weiji14 · 2023-10-10T22:28:57Z

Just noting that Pandas 2.0 has not just a NumPy-backed arrays, but also PyArrow(C++)-backed arrays (see https://pandas.pydata.org/docs/whatsnew/v2.0.0.html#argument-dtype-backend-to-return-pyarrow-backed-or-numpy-backed-nullable-dtypes). Also according to PDEP10, Pandas 3.0 will use ArrowDtype instead of object dtype for string columns.

Haven't had time yet to work out what this would mean for our implementation of GMT_DATASET at #2729, but we might need some matrix tests to check that both NumPy and PyArrow-backed pandas.DataFrames work.

seisman · 2024-04-08T05:21:20Z

Now most modules already have consistent behaviors for table-like output. There are still four exceptions: x2sys_cross is complicated and need more work; for which/info/grdinfo, file/numpy/pandas output types may make no sense, so they need more discussions and will be tracked in separate issues.

Closing the issue. Great to have consistent table-like output behavior across the whole project!

yvonnefroehlich · 2024-04-09T16:46:54Z

Closing the issue. Great to have consistent table-like output behavior across the whole project!

Sounds great! Thanks @seisman for all your efforts on this!

maxrjones added the question Further information is requested label Jun 4, 2021

maxrjones mentioned this issue Jun 4, 2021

Allow x/y/z input for blockmedian and blockmean #1319

Merged

5 tasks

maxrjones changed the title ~~What should the default table-like output be for PyGMT methods?~~ What should the default table-like output be for PyGMT functions/methods? Jun 4, 2021

willschlitzer mentioned this issue Jun 16, 2021

WIP: Add return_table helper function #1336

Closed

5 tasks

maxrjones added enhancement Improving an existing feature and removed question Further information is requested labels Jun 25, 2021

maxrjones changed the title ~~What should the default table-like output be for PyGMT functions/methods?~~ Consistent table-like output for PyGMT functions/methods Jun 25, 2021

maxrjones mentioned this issue Jul 1, 2021

Wrap grd2xyz #1284

Merged

5 tasks

weiji14 mentioned this issue Aug 2, 2021

Reduce repeated code for managing the outgrid parameter #1400

Closed

weiji14 mentioned this issue Aug 10, 2021

Wrapper for gmtselect #1427

Closed

weiji14 pinned this issue Aug 11, 2021

seisman added this to the 0.5.0 milestone Sep 13, 2021

This was referenced Sep 20, 2021

Wrap grdhisteq #1433

Merged

Handling data processing functions that output to a grid or table #1536

Open

weiji14 mentioned this issue Oct 5, 2021

Wrap gmtselect #1429

Merged

19 tasks

seisman added this to the 0.7.0 milestone Mar 18, 2022

seisman modified the milestones: 0.7.0, 0.8.0 Jun 23, 2022

seisman modified the milestones: 0.8.0, 0.9.0 Dec 11, 2022

seisman modified the milestones: 0.9.0, 0.10.0 Mar 16, 2023

weiji14 removed this from the 0.10.0 milestone Aug 24, 2023

This was referenced Oct 9, 2023

Consistent table-like input for PyGMT functions/methods #2731

Open

pygmt.grdtrack: Support consistent table-like outputs #2733

Merged

weiji14 mentioned this issue Oct 29, 2023

pyarrow: Check compatibility of pyarrow-backed pandas objects with numeric dtypes #2774

Merged

8 tasks

weiji14 pinned this issue Oct 29, 2023

weiji14 mentioned this issue Nov 7, 2023

Support PyArrow arrays and dataframes #2800

Open

seisman unpinned this issue Dec 21, 2023

seisman added this to the 0.12.0 milestone Feb 26, 2024

seisman pinned this issue Feb 26, 2024

seisman mentioned this issue Feb 29, 2024

Improve input/output options for blockm* and grdtrack #1099

Closed

2 tasks

seisman added help wanted Helping hands are appreciated and removed help wanted Helping hands are appreciated labels Mar 13, 2024

seisman removed this from the 0.12.0 milestone Mar 28, 2024

seisman added this to the 0.12.0 milestone Apr 8, 2024

seisman closed this as completed Apr 8, 2024

seisman unpinned this issue Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistent table-like output for PyGMT functions/methods #1318

Consistent table-like output for PyGMT functions/methods #1318

maxrjones commented Jun 4, 2021 •

edited by seisman

Loading

weiji14 commented Jun 6, 2021 •

edited

Loading

maxrjones commented Jun 8, 2021

weiji14 commented Jun 8, 2021 •

edited

Loading

maxrjones commented Jun 10, 2021

willschlitzer commented Jun 15, 2021 •

edited by weiji14

Loading

weiji14 commented Jun 15, 2021

willschlitzer commented Jun 16, 2021

maxrjones commented Jun 17, 2021

maxrjones commented Aug 10, 2021

willschlitzer commented Aug 10, 2021

weiji14 commented Aug 10, 2021

willschlitzer commented Aug 11, 2021

willschlitzer commented Sep 13, 2021

maxrjones commented Sep 13, 2021

seisman commented Sep 13, 2021

seisman commented Oct 9, 2023

weiji14 commented Oct 10, 2023

seisman commented Apr 8, 2024

yvonnefroehlich commented Apr 9, 2024

Consistent table-like output for PyGMT functions/methods #1318

Consistent table-like output for PyGMT functions/methods #1318

Comments

maxrjones commented Jun 4, 2021 • edited by seisman Loading

weiji14 commented Jun 6, 2021 • edited Loading

maxrjones commented Jun 8, 2021

weiji14 commented Jun 8, 2021 • edited Loading

maxrjones commented Jun 10, 2021

willschlitzer commented Jun 15, 2021 • edited by weiji14 Loading

weiji14 commented Jun 15, 2021

willschlitzer commented Jun 16, 2021

maxrjones commented Jun 17, 2021

maxrjones commented Aug 10, 2021

willschlitzer commented Aug 10, 2021

weiji14 commented Aug 10, 2021

willschlitzer commented Aug 11, 2021

willschlitzer commented Sep 13, 2021

maxrjones commented Sep 13, 2021

seisman commented Sep 13, 2021

seisman commented Oct 9, 2023

weiji14 commented Oct 10, 2023

seisman commented Apr 8, 2024

yvonnefroehlich commented Apr 9, 2024

maxrjones commented Jun 4, 2021 •

edited by seisman

Loading

weiji14 commented Jun 6, 2021 •

edited

Loading

weiji14 commented Jun 8, 2021 •

edited

Loading

willschlitzer commented Jun 15, 2021 •

edited by weiji14

Loading