Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full support of numpy dtypes #2848

Open
13 of 24 tasks
seisman opened this issue Dec 4, 2023 · 11 comments
Open
13 of 24 tasks

Full support of numpy dtypes #2848

seisman opened this issue Dec 4, 2023 · 11 comments
Labels
feature request New feature wanted

Comments

@seisman
Copy link
Member

seisman commented Dec 4, 2023

Before starting to support PyArrow arrays (#2800), I think we should first ensure that PyGMT has complete support for NumPy dtypes. Currently, the following NumPy dtypes are supported:

# Dictionary for mapping numpy dtypes to GMT data types.
DTYPES_NUMERIC = {
np.int8: "GMT_CHAR",
np.int16: "GMT_SHORT",
np.int32: "GMT_INT",
np.int64: "GMT_LONG",
np.uint8: "GMT_UCHAR",
np.uint16: "GMT_USHORT",
np.uint32: "GMT_UINT",
np.uint64: "GMT_ULONG",
np.float32: "GMT_FLOAT",
np.float64: "GMT_DOUBLE",
np.timedelta64: "GMT_LONG",
}
DTYPES_TEXT = {
np.str_: "GMT_TEXT",
np.datetime64: "GMT_DATETIME",
}

Here is a simple way to list all the available NumPy dtypes (xref: https://numpy.org/doc/stable/reference/arrays.scalars.html):

>>> import numpy as np
>>> set(np.sctypeDict.values())

As you can see, most NumPy types are already supported in PyGMT. It's clear that GMT and PyGMT can't support float128 and any complex floating-point types, but at least it's possible to support np.float16 (np.half), np.bool_ and np.timedelta.

  • For np.float16: maybe we have to cast the array into np.float32?
  • For np.bool_: maybe convert to 0 and 1 (e.g., array.astype("int8")
  • For np.timedelta: Not sure about this yet.
@seisman seisman added the discussions Need more discussion before taking further actions label Dec 4, 2023
@weiji14
Copy link
Member

weiji14 commented Dec 5, 2023

  • For np.float16: maybe we have to cast the array into np.float32?
  • For np.bool_: maybe convert to 0 and 1 (e.g., array.astype("int8")

Personally use float16 a lot, but if we're just going to cast the array to float32, that would result in an increase in memory usage, and the user should be warned or know about that. Same for bool -> int8. Considering that GMT doesn't yet support these dtypes, maybe it's better to have the user do the dtype casting themselves rather than PyGMT doing it automatically?

For np.timedelta: Not sure about this yet.

This is an interesting one. I've had to make plots for showing how a variable like RMSE changes over time from Day 0 to Day 10. The below one is using seaborn:

rmse_huv500

Note though, the 1e14 scaling at the bottom right. I couldn't quite find a way to remove that...

I think I did try to do this with GMT/PyGMT at first, but couldn't work out if GMT actually supports plotting relative time. Maybe there's a way to do it by changing the configs at https://docs.generic-mapping-tools.org/6.4/gmt.conf.html#calendar-time-parameters? But I'm not sure how we can convert np.timedelta into something GMT understands.

@seisman
Copy link
Member Author

seisman commented Dec 5, 2023

Considering that GMT doesn't yet support these dtypes, maybe it's better to have the user do the dtype casting themselves rather than PyGMT doing it automatically?

The current issue is, the error message is helpless and very confusing, so users don't know that np.float16 is not supported:

>>> import pygmt
>>> import numpy as np
>>> x = np.array([1, 2, 3, 4], dtype=np.float16)
>>>> pygmt.info(x)
File ~/OSS/gmt/pygmt/pygmt/clib/session.py:1275, in Session.virtualfile_from_vectors(self, *vectors)
   1273 # Use put_vector for columns with numerical type data
   1274 for col, array in enumerate(arrays[:columns]):
-> 1275     self.put_vector(dataset, column=col, vector=array)
   1277 # Use put_strings for last column(s) with string type data
   1278 # Have to use modifier "GMT_IS_DUPLICATE" to duplicate the strings
   1279 string_arrays = arrays[columns:]

File ~/OSS/gmt/pygmt/pygmt/clib/session.py:892, in Session.put_vector(self, dataset, column, vector)
    889 vector_pointer = (ctp.c_char_p * len(vector))()
    890 if gmt_type == self["GMT_DATETIME"]:
    891     vector_pointer[:] = np.char.encode(
--> 892         np.datetime_as_string(array_to_datetime(vector))
    893     )
    894 else:
    895     vector_pointer[:] = np.char.encode(vector)

ValueError: Cannot convert a NumPy datetime value other than NaT with generic units

Personally use float16 a lot, but if we're just going to cast the array to float32, that would result in an increase in memory usage, and the user should be warned or know about that. Same for bool -> int8.

I don't think users should be warned for a memory usage increase. GMT does a lot of memory duplication internally and no one cares about it. Instead, we should warn users if we make any changes (e.g., casting types) to the input data.

@seisman
Copy link
Member Author

seisman commented Dec 7, 2023

Considering that GMT doesn't yet support these dtypes, maybe it's better to have the user do the dtype casting themselves rather than PyGMT doing it automatically?

After improving the error messages in PR #2856, I'm convinced that it's better to NOT support dtypes like np.float16, np.bytes_ and np.bool_.

@weiji14
Copy link
Member

weiji14 commented Dec 7, 2023

Ok, shall we rescope this issue to just focus on supporting np.timedelta then?

@seisman seisman changed the title Support more NumPy dtypes Support numpy.timedelta Dec 7, 2023
@seisman seisman changed the title Support numpy.timedelta Support numpy.timedelta64 Dec 7, 2023
@seisman seisman added feature request New feature wanted and removed discussions Need more discussion before taking further actions labels Dec 7, 2023
@seisman
Copy link
Member Author

seisman commented Dec 8, 2023

Here is a GMT CLI script showing how GMT deal with relative times:

gmt begin times png
gmt set TIME_UNIT d TIME_EPOCH 2023-01-01T00:00:00
gmt basemap -R0t/10t/0/1 -JX10c/5c -Bxa2D -Byaf
gmt plot -Sc0.2c -f0t,1y << EOF
1 0.2
2 0.5
3 0.8
EOF
gmt end show

It produces:
times

-f0t tells GMT that the first column contains relative times. Here, 1 means 1 day (TIME_UNIT) since 2023-01-01T00:00:00 (TIME_EPOCH). Then, GMT converts the relative times to absolute times before plotting.

To support np.timedelta64 dtype in PyGMT, we need to:

  1. Get the current value of TIME_EPOCH.
  2. Add the TIME_EPOCH to the np.deltatime64 object.

However, in the Python world, converting np.timedelta64 object to np.datetime64 object is easy and straightforward. Thus, I feel it makes little sense to support np.timedelta64 in PyGMT. Thus, I feel it makes no sense to do the conversion internally in PyGMT.

@seisman
Copy link
Member Author

seisman commented Dec 8, 2023

I've had to make plots for showing how a variable like RMSE changes over time from Day 0 to Day 10. The below one is using seaborn:

rmse_huv500

Note though, the 1e14 scaling at the bottom right. I couldn't quite find a way to remove that...

I think I did try to do this with GMT/PyGMT at first, but couldn't work out if GMT actually supports plotting relative time. Maybe there's a way to do it by changing the configs at https://docs.generic-mapping-tools.org/6.4/gmt.conf.html#calendar-time-parameters? But I'm not sure how we can convert np.timedelta into something GMT understands.

For you case, I think you're expecting a relative-time axis.

>>> import pandas as pd
>>> import numpy as np
>>> data = pd.timedelta_range(start="1 day", periods=10)
>>> data
TimedeltaIndex([ '1 days',  '2 days',  '3 days',  '4 days',  '5 days',
                 '6 days',  '7 days',  '8 days',  '9 days', '10 days'],
               dtype='timedelta64[ns]', freq='D')
>>> data.to_numpy()
array([ 86400000000000, 172800000000000, 259200000000000, 345600000000000,
       432000000000000, 518400000000000, 604800000000000, 691200000000000,
       777600000000000, 864000000000000], dtype='timedelta64[ns]')
>>> data.to_numpy(dtype="timedelta64[D]")
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10], dtype='timedelta64[D]')

I feel we should convert np.timedelta64 dtype to the most suitable int type.

@seisman
Copy link
Member Author

seisman commented Dec 8, 2023

It's straightforward to support np.timedelta64 type. We just need to add np.timedelta64 : "GMT_LONG" to

DTYPES = {
np.int8: "GMT_CHAR",
np.int16: "GMT_SHORT",
np.int32: "GMT_INT",
np.int64: "GMT_LONG",
np.uint8: "GMT_UCHAR",
np.uint16: "GMT_USHORT",
np.uint32: "GMT_UINT",
np.uint64: "GMT_ULONG",
np.float32: "GMT_FLOAT",
np.float64: "GMT_DOUBLE",
np.str_: "GMT_TEXT",
np.datetime64: "GMT_DATETIME",
}

Then the following codes work:

>>> import pygmt
>>> import numpy as np
>>> data = np.arange(np.timedelta64(1, "D"), np.timedelta64(10, "D"))
>>> data
array([1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='timedelta64[D]')
>>> pygmt.info(data)
'<vector memory>: N = 9 <1/9>\n'

>>> data = data.astype("timedelta64[s]")
>>> pygmt.info(data)
'<vector memory>: N = 9 <86400/777600>\n'

>>> data = data.astype("timedelta64[ns]")
>>> pygmt.info(data)
'<vector memory>: N = 9 <8.64e+13/7.776e+14>\n'

@weiji14
Copy link
Member

weiji14 commented Dec 16, 2023

Ok, started a PR for this at #2884, and managed to recreate my example plot above. Example code:

import numpy as np
import pygmt

fig = pygmt.Figure()
fig.basemap(
    projection="X8c/5c",
    region=[0, 8, 0, 10],
    frame=["WSne", "xaf+lForecast Days", "yaf+lRMSE"],
)
fig.plot(
    x=np.arange(np.timedelta64(0, "D"), np.timedelta64(8, "D")),
    y=np.geomspace(start=0.1, stop=9, num=8),
    style="c0.2c",
    pen="1p",
)
fig.show()

produces

forecast_rmse

However, passing np.timedelta64 to the region parameter of basemap doesn't work. It gets interpreted as a datetime.

fig = pygmt.Figure()
fig.basemap(
    projection="X8c/5c",
    region=[np.timedelta64(0, "D"), np.timedelta64(8, "D"), 0, 10],
    frame=["WSne", "xaf+lForecast Days", "yaf+lRMSE"],
)
fig.plot(
    x=np.arange(np.timedelta64(0, "D"), np.timedelta64(8, "D")),
    y=np.geomspace(start=0.1, stop=9, num=8),
    style="c0.2c",
    pen="1p",
)
fig.show()

produces

forecast_rmse2

Not sure why timedelta64 gets cast to a datetime dtype in region. Looks like we might need to do some extra handling?

@weiji14
Copy link
Member

weiji14 commented Dec 21, 2023

Not sure why timedelta64 gets cast to a datetime dtype in region. Looks like we might need to do some extra handling?

Figured out how we could use timedelta64 in region/-R at d5bedc2, but the code is not very nice. Best to handle that logic separately after #2884 is merged, as mentioned in #2884 (comment).


Another thing to consider - do we want to also support Python's built-in datetime.timedelta object? Quoting from #2884 (comment):

Apparently Python has a built-in datetime.timedelta objects, but it's super hard to cast it to an integer type while preserving the orignal unit (e.g. day, hour, minute, etc). E.g.

import datetime

td = datetime.timedelta(hours=24)

print(np.timedelta64(td))
# numpy.timedelta64(86400000000,'us')
print(np.timedelta64(td).astype(int))
# 86400000000

Should we still support datetime.timedelta? Or only np.timedelta64?

@seisman
Copy link
Member Author

seisman commented Dec 21, 2023

Another thing to consider - do we want to also support Python's built-in datetime.timedelta object? Quoting from #2884 (comment):

Apparently Python has a built-in datetime.timedelta objects, but it's super hard to cast it to an integer type while preserving the orignal unit (e.g. day, hour, minute, etc). E.g.

import datetime

td = datetime.timedelta(hours=24)

print(np.timedelta64(td))
# numpy.timedelta64(86400000000,'us')
print(np.timedelta64(td).astype(int))
# 86400000000

Should we still support datetime.timedelta? Or only np.timedelta64?

I guess not.

@seisman
Copy link
Member Author

seisman commented Oct 31, 2024

I've updated the top post (#2848 (comment)) based on what I learned recently about NumPy dtypes. After PR #3566, there are still a few unsupported dtypes that can be supported.

  1. np.float16: Cast to np.float32 and warn about memory increase
  2. np.longdouble: Cast to np.float64 and warn about data loss
  3. np.bool: Cast to np.uint8 and warn about memory increase
  4. np.datetime64: Already supported, but unsure about the support of various date/time units
  5. np.timedelta64: Partially supported in Support timedelta64 dtype as input #2884, but there are two remaining issues:
    • data are passed as 64-bit integers to the GMT C API, and we can't (at least we don't know how) tell GMT that the data is relative times.
    • Support timedelta/np.timedelta64 in the region parameter

For np.float16/np.longdouble/np.bool, currently, we raise an exception saying that the dtype is unsupported but we don't mention what to do. Instead of raising an exception and exiting, I still think we should cast them into recognized dtypes and raise a warning to let users know what PyGMT is doing and what they can do to suppress the error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature wanted
Projects
None yet
Development

No branches or pull requests

2 participants