Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x2sys_cross: Refactor to get rid of temporary files and have consistent table-like output behavior #3160

Closed
seisman opened this issue Apr 8, 2024 · 1 comment
Labels
maintenance Boring but important stuff for the core devs
Milestone

Comments

@seisman
Copy link
Member

seisman commented Apr 8, 2024

Originally posted by @weiji14 in #2730 (comment)

I had a look at refactoring x2sys_cross to use virtualfiles instead of temporary files, but it's a little tricky because:

  • Input: Cannot pass in virtualfiles as input as mentioned at Wrap x2sys_init and x2sys_cross #546 (comment) and Passing in virtual files into the supplementary x2sys modules gmt#3717, since GMT doesn't support virtualfiles to X2SYS modules

  • Output: The virtualfile_to_dataset method from clib: Add virtualfile_to_dataset method for converting virtualfile to a dataset #3083 was able to produce a pandas.DataFrame output, but the column names were missing. The logic for handling x2sys_cross's output is actually complicated:

    # Read temporary csv output to a pandas table
    if outfile == tmpfile.name: # if outfile isn't set, return pd.DataFrame
    # Read the tab-separated ASCII table
    date_format_kwarg = (
    {"date_format": "ISO8601"}
    if Version(pd.__version__) >= Version("2.0.0")
    else {}
    )
    table = pd.read_csv(
    tmpfile.name,
    sep="\t",
    header=2, # Column names are on 2nd row
    comment=">", # Skip the 3rd row with a ">"
    parse_dates=[2, 3], # Datetimes on 3rd and 4th column
    **date_format_kwarg, # Parse dates in ISO8601 format on pandas>=2
    )
    # Remove the "# " from "# x" in the first column
    table = table.rename(columns={table.columns[0]: table.columns[0][2:]})
    elif outfile != tmpfile.name: # if outfile is set, output in outfile only
    table = None

Important things to handle are:

  1. Datetime columns need to be parsed correctly as datetime64 dtype
  2. x2sys_cross may output multi-segment parts (see https://docs.generic-mapping-tools.org/6.5/supplements/x2sys/x2sys_cross.html#remarks) when multiple tracks are passed in and -Qe (external COEs) is selected. Unsure how this is handled in GMT virtualfiles (note that we actually just merge all the multi-segments into one table when pandas.DataFrame output is selected, output to file will preserve the segments though).
  3. Last two column names can either be z_X/z_M or z_1/z_2 depending on whether trackvalues/-Z argument is set.

It should be possible to handle 1 and 3 somehow, but I'm not so sure about 2 since it will involve checking how GMT outputs virtualfiles in x2sys_cross. We'll need to do some careful checking to ensure the refactoring doesn't modify the output and makes it incorrect.

@seisman
Copy link
Member Author

seisman commented Nov 19, 2024

Nothing we can do on the PyGMT side. On the GMT side, I guess no one will touch the x2sys_cross source code in the near future.

In #3182, we now parse the colume names from the 2nd row of the output. So the column names should work fine.

result = lib.virtualfile_to_dataset(
vfname=vouttbl, output_type=output_type, header=2
)

Important things to handle are:

  1. Datetime columns need to be parsed correctly as datetime64 dtype

We now call pd.to_datetime to convert the columns to datetime, so the issue should be solved:

result[columns] = result[columns].apply(pd.to_timedelta, unit=unit)

  1. Last two column names can either be z_X/z_M or z_1/z_2 depending on whether trackvalues/-Z argument is set.

Since we're parsing column names from the file header, the column names now should automatically set. This can be verified by checking our existing x2sys_cross tests, in which z_X/z_M or z_1/z_2 are used in different cases.

  1. x2sys_cross may output multi-segment parts (see https://docs.generic-mapping-tools.org/6.5/supplements/x2sys/x2sys_cross.html#remarks) when multiple tracks are passed in and -Qe (external COEs) is selected. Unsure how this is handled in GMT virtualfiles (note that we actually just merge all the multi-segments into one table when pandas.DataFrame output is selected, output to file will preserve the segments though).

It should be possible to handle 1 and 3 somehow, but I'm not so sure about 2 since it will involve checking how GMT outputs virtualfiles in x2sys_cross. We'll need to do some careful checking to ensure the refactoring doesn't modify the output and makes it incorrect.

Currently, the whole PyGMT project is lacking support of multi-segment files. The GMT_DATASET container has all the information about multi-segment files, but all the information are lost when we convert it to pandas.DataFrame. So we need to find another data structure in the Python for holding multi-segment files. And this will be a long-term goal.

In summary, I think we already fix most of the issues mentioned above, so it can be closed. Feel free to reopen if you disagree.

@seisman seisman closed this as completed Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Boring but important stuff for the core devs
Projects
None yet
Development

No branches or pull requests

1 participant