x2sys_cross: Refactor to get rid of temporary files and have consistent table-like output behavior #3160

seisman · 2024-04-08T05:16:01Z

Originally posted by @weiji14 in #2730 (comment)

I had a look at refactoring x2sys_cross to use virtualfiles instead of temporary files, but it's a little tricky because:

Input: Cannot pass in virtualfiles as input as mentioned at Wrap x2sys_init and x2sys_cross #546 (comment) and Passing in virtual files into the supplementary x2sys modules gmt#3717, since GMT doesn't support virtualfiles to X2SYS modules

Output: The virtualfile_to_dataset method from clib: Add virtualfile_to_dataset method for converting virtualfile to a dataset #3083 was able to produce a pandas.DataFrame output, but the column names were missing. The logic for handling x2sys_cross's output is actually complicated:

pygmt/pygmt/src/x2sys_cross.py

Lines 231 to 250 in bcbbcad

    
           # Read temporary csv output to a pandas table 
        
           if outfile == tmpfile.name:  # if outfile isn't set, return pd.DataFrame 
        
               # Read the tab-separated ASCII table 
        
               date_format_kwarg = ( 
        
                   {"date_format": "ISO8601"} 
        
                   if Version(pd.__version__) >= Version("2.0.0") 
        
                   else {} 
        
               ) 
        
               table = pd.read_csv( 
        
                   tmpfile.name, 
        
                   sep="\t", 
        
                   header=2,  # Column names are on 2nd row 
        
                   comment=">",  # Skip the 3rd row with a ">" 
        
                   parse_dates=[2, 3],  # Datetimes on 3rd and 4th column 
        
                   **date_format_kwarg,  # Parse dates in ISO8601 format on pandas>=2 
        
               ) 
        
               # Remove the "# " from "# x" in the first column 
        
               table = table.rename(columns={table.columns[0]: table.columns[0][2:]}) 
        
           elif outfile != tmpfile.name:  # if outfile is set, output in outfile only 
        
               table = None

Important things to handle are:

Datetime columns need to be parsed correctly as datetime64 dtype
x2sys_cross may output multi-segment parts (see https://docs.generic-mapping-tools.org/6.5/supplements/x2sys/x2sys_cross.html#remarks) when multiple tracks are passed in and -Qe (external COEs) is selected. Unsure how this is handled in GMT virtualfiles (note that we actually just merge all the multi-segments into one table when pandas.DataFrame output is selected, output to file will preserve the segments though).
Last two column names can either be z_X/z_M or z_1/z_2 depending on whether trackvalues/-Z argument is set.

It should be possible to handle 1 and 3 somehow, but I'm not so sure about 2 since it will involve checking how GMT outputs virtualfiles in x2sys_cross. We'll need to do some careful checking to ensure the refactoring doesn't modify the output and makes it incorrect.

The text was updated successfully, but these errors were encountered:

seisman · 2024-11-19T10:53:26Z

Input: Cannot pass in virtualfiles as input as mentioned at Wrap x2sys_init and x2sys_cross #546 (comment) and Passing in virtual files into the supplementary x2sys modules gmt#3717, since GMT doesn't support virtualfiles to X2SYS modules

Nothing we can do on the PyGMT side. On the GMT side, I guess no one will touch the x2sys_cross source code in the near future.

Output: The virtualfile_to_dataset method from clib: Add virtualfile_to_dataset method for converting virtualfile to a dataset #3083 was able to produce a pandas.DataFrame output, but the column names were missing.

In #3182, we now parse the colume names from the 2nd row of the output. So the column names should work fine.

pygmt/pygmt/src/x2sys_cross.py

Lines 223 to 225 in ef71a9d

    
           result = lib.virtualfile_to_dataset( 
        
               vfname=vouttbl, output_type=output_type, header=2 
        
           )

Important things to handle are:

Datetime columns need to be parsed correctly as datetime64 dtype

We now call pd.to_datetime to convert the columns to datetime, so the issue should be solved:

pygmt/pygmt/src/x2sys_cross.py

Line 252 in ef71a9d

result[columns] = result[columns].apply(pd.to_timedelta, unit=unit)

Last two column names can either be z_X/z_M or z_1/z_2 depending on whether trackvalues/-Z argument is set.

Since we're parsing column names from the file header, the column names now should automatically set. This can be verified by checking our existing x2sys_cross tests, in which z_X/z_M or z_1/z_2 are used in different cases.

x2sys_cross may output multi-segment parts (see https://docs.generic-mapping-tools.org/6.5/supplements/x2sys/x2sys_cross.html#remarks) when multiple tracks are passed in and -Qe (external COEs) is selected. Unsure how this is handled in GMT virtualfiles (note that we actually just merge all the multi-segments into one table when pandas.DataFrame output is selected, output to file will preserve the segments though).

It should be possible to handle 1 and 3 somehow, but I'm not so sure about 2 since it will involve checking how GMT outputs virtualfiles in x2sys_cross. We'll need to do some careful checking to ensure the refactoring doesn't modify the output and makes it incorrect.

Currently, the whole PyGMT project is lacking support of multi-segment files. The GMT_DATASET container has all the information about multi-segment files, but all the information are lost when we convert it to pandas.DataFrame. So we need to find another data structure in the Python for holding multi-segment files. And this will be a long-term goal.

In summary, I think we already fix most of the issues mentioned above, so it can be closed. Feel free to reopen if you disagree.

seisman mentioned this issue Apr 8, 2024

Consistent table-like output for PyGMT functions/methods #1318

Closed

13 tasks

This was referenced Apr 16, 2024

Get rid of temporary files from pygmt functions and plotting methods #2730

Closed

pygmt.x2sys_cross: Refactor to use virtualfiles for output tables [BREAKING CHANGE: Dummy times in 3rd and 4th columns now have np.timedelta64 type] #3182

Merged

seisman added the maintenance Boring but important stuff for the core devs label Apr 20, 2024

seisman added this to the 0.14.0 milestone Nov 19, 2024

seisman closed this as completed Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x2sys_cross: Refactor to get rid of temporary files and have consistent table-like output behavior #3160

x2sys_cross: Refactor to get rid of temporary files and have consistent table-like output behavior #3160

seisman commented Apr 8, 2024 •

edited by weiji14

Loading

seisman commented Nov 19, 2024

x2sys_cross: Refactor to get rid of temporary files and have consistent table-like output behavior #3160

x2sys_cross: Refactor to get rid of temporary files and have consistent table-like output behavior #3160

Comments

seisman commented Apr 8, 2024 • edited by weiji14 Loading

seisman commented Nov 19, 2024

seisman commented Apr 8, 2024 •

edited by weiji14

Loading