Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeseries data check failing - coersion issue #169

Closed
lwasser opened this issue Jan 13, 2020 · 10 comments
Closed

Timeseries data check failing - coersion issue #169

lwasser opened this issue Jan 13, 2020 · 10 comments
Assignees
Labels
bug Something isn't working high priority time-series

Comments

@lwasser
Copy link

lwasser commented Jan 13, 2020

# Get data
data = et.data.get_data("colorado-flood")
os.chdir(os.path.join(et.io.HOME, 'earth-analytics'))

import matplotcheck.notebook as nb
import matplotcheck.timeseries as ts
import matplotcheck.raster as ra
# BEGIN SOLUTION
import os
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from matplotlib.dates import DateFormatter
import seaborn as sns
import numpy as np
import pandas as pd
import earthpy as et
import hydrofunctions as hf
import urllib

from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

# prettier plotting with seaborn
sns.set(font_scale=1.5)
sns.set_style("whitegrid")

# import file
f = "data/colorado-flood/discharge/06730200-discharge-daily-1986-2013.txt"
discharge = pd.read_csv(f,
                        skiprows=23,
                        header=[1, 2],
                        sep='\t',
                        parse_dates=[2])
# drop one level of index
discharge.columns = discharge.columns.droplevel(1)
# set the date column as the index
discharge = discharge.set_index(["datetime"])

monthly_max_all = discharge.resample("M").max()
monthly_max = monthly_max_all['1990':'2014']

fig, ax = plt.subplots(figsize=(10, 8))
ax.scatter(x=monthly_max.index,
           y=monthly_max["17663_00060_00003"],
           color="purple")
ax.set_title(
    "HW Plot 1: Stream Discharge - Monthly Max Value\n be sure to add x and y labels (not shown here)")

ax.set(xlabel="Date")

# END SOLUTION

### DO NOT REMOVE LINE BELOW ###
plot_1_ts = nb.convert_axes(plt, which_axes="current")

mpc_plot_1_ts = mts.TimeSeriesTester(plot_1_ts)
results = []


TEST

mts.TimeSeriesTester(plot_1_ts)
mpc_plot_1_ts.assert_xydata(xy_expected = precip.reset_index(),
                           xtime=True,
                           xcol="DATE",
                           ycol="HPCP")

Output error: i suspect this has something to do with NAN values but i am not sure

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-38-1b4d8c2d48dc> in <module>
      4                            xtime=True,
      5                            xcol="DATE",
----> 6                            ycol="HPCP")
      7 
      8 #precip

~/Documents/github/0-python/matplotcheck/matplotcheck/base.py in assert_xydata(self, xy_expected, xcol, ycol, points_only, xtime, xlabels, tolerence, message)
    938             try:
    939                 np.testing.assert_array_max_ulp(
--> 940                     np.array(xy_data["x"]), np.array(xy_expected[xcol])
    941                 )
    942             except AssertionError:

~/miniconda3/envs/earth-analytics-python/lib/python3.7/site-packages/numpy/testing/_private/utils.py in assert_array_max_ulp(a, b, maxulp, dtype)
   1617     __tracebackhide__ = True  # Hide traceback for py.test
   1618     import numpy as np
-> 1619     ret = nulp_diff(a, b, dtype)
   1620     if not np.all(ret <= maxulp):
   1621         raise AssertionError("Arrays are not almost equal up to %g ULP" %

~/miniconda3/envs/earth-analytics-python/lib/python3.7/site-packages/numpy/testing/_private/utils.py in nulp_diff(x, y, dtype)
   1658         y = np.array(y)
   1659 
-> 1660     t = np.common_type(x, y)
   1661     if np.iscomplexobj(x) or np.iscomplexobj(y):
   1662         raise NotImplementedError("_nulp not implemented for complex array")

<__array_function__ internals> in common_type(*args, **kwargs)

~/miniconda3/envs/earth-analytics-python/lib/python3.7/site-packages/numpy/lib/type_check.py in common_type(*arrays)
    719             p = array_precision.get(t, None)
    720             if p is None:
--> 721                 raise TypeError("can't get common type for non-numeric array")
    722         precision = max(precision, p)
    723     if is_complex:

TypeError: can't get common type for non-numeric array
  1. i suspect this has to do with NA values. we need to clear those out
@lwasser lwasser added bug Something isn't working time-series labels Jan 14, 2020
@nkorinek
Copy link
Member

Update: This bug is very complicated.

The initial suspicion that this was due to NA values in the dataset turned out to be incorrect. I've test the dataset against itself, and there are no NA values in the dataset to start with. This bug still comes up when there are no NA values in the datasets and the datasets are identical.

The next guess that was looked into was related to how the data is transformed during the assert_xydata check. It was noticed that the datetime objects were not timezone aware, i.e. the datatype was datetime64[ns], which turned into datetime64[ns, UTC] during the check. Seeing as the error was TypeError: can't get common type for non-numeric array, it seemed this may be the culprit. When importing the data, I added the line dataframe.index = dataframe.index.tz_localize('UTC') to the data right after the import was done. This made the timezone data in the index timezone aware, and in the same timezone as the data that was being checked.

Side note: I tried to put the data in as MST, seeing as that's the timezone it was collected in, however, this did not do what was intended. It changed the time to be 7 hours earlier than UTC time when I tried to import it like that. Additionally, even when the data was timezone aware, it still got changed to the UTC timezone during the check.

Even with all data as timezone aware in the same timezone, both the warning and TypeError are still present.

I'll update this further with what else I try to get around this error.

@nkorinek
Copy link
Member

Alright we found the issue. Turns out the bug and the warning are not as related as we previously suspected. However, they are fixed by the same change to the code.

The warning was being caused by trying to convert a Pandas DataFrame column with a datetime64[ns] datatype into a numpy array. So the line in the assert_xy function in matplotcheck that was trying to convert the datetime columns into a numpy array is what was causing the warning. AKA, on line 940: np.testing.assert_array_max_ulp(np.array(xy_data["x"]), np.array(xy_expected[xcol]).

The error we were getting, TypeError: can't get common type for non-numeric array, was caused by the data type as well. The function np.testing.assert_array_max_ulp checks the datatype of the array passed in. If the datatype is not float or integer, it will auto fail. We were under the impression that if xtime=True is set, than assert_xy would autofix the dates to be integers. Upon reviewing the code, this did not seem to be the case.

So the fix is relatively straight forward. We will be adding in the following chunk of code:

if xtime and 'datetime' in str(xy_data.dtypes['x']):
    xy_data["x"] = mdates.date2num(xy_data["x"])
if xtime and 'datetime' in str(xy_expected.dtypes[xcol]):
    xy_expected[xcol] = mdates.date2num(xy_expected[xcol])

This changes the dates to integers which can than be properly checked by np.testing.assert_array_max_ulp. And since the columns are integers now, it also doesn't throw a warning while being converted to a numpy array. It also checks that the datatypes are indeed datetime objects before making the transformation.

I'll make a pr with these changes and a few tests to make sure they behave properly. @lwasser

@lwasser
Copy link
Author

lwasser commented Jan 17, 2020

awesome @nkorinek thank you!!!

@nkorinek
Copy link
Member

gist explaining my current dilemna: https://gist.github.com/nkorinek/accc2b905145763dc2035300a895204c

@lwasser
Copy link
Author

lwasser commented Feb 3, 2020

it looks like we have a fix with this in #185 !! yay! @ryla5068 will test this fix against homework 1. and if it works, he will update the tests there that don't work now... and then we can merge this PR!! Next step will be writing tests for the time series module!!

@lwasser
Copy link
Author

lwasser commented Apr 6, 2020

notes about this issue

""""
matplotlib stores datetime data in a very... unique way. It stores everything
as the number of days since some epoch. If you plot a dataframe containing
datetime.datetime objects or pandas.Timestamp objects, it will convert it days
since epoch. Sometimes matplotlib chooses Jan 1, 1970 as the epoch. Other
times it chooses Jan 1, 0001. If your data contains time data (i.e. higher
precision than just dates), matplotlib will store it as fractional days since
epoch, down to millisecond precision (or whatever precision your data is in).
For datetime data between these epochs, sometimes it will choose to store it as
negative days since 1970, other times it will store it as positive days since
the year 0001.

matplotlib DOES provide functions for converting data from this weird format
back to datetime.datetime or pandas.Timestamp. However, these functions
always assume that the 1970 epoch was used.

matplotlib's documentation claims that negative values for datetime data are
not supported, and therefore data representing dates before 1970 are not
supported. However, matplotlib will happily plot data before 1970 and its
conversion functions will happily accept negative numbers and try to convert
them.

As you might imagine, this presents a number of issues for comparing datetime
data. Most obviously, it gets unreliable when we have to guess which epoch
matplotlib chose to use. We have tried a few different methods here: different
ways of converting the data, converting using both epochs and comparing both,
etc. All of them were pretty messy.

Additionally, there is the issue of floating point truncation error. matplotlib
stores this data with numpy.float64, which has 52 mantissa bits, or about 15
base-10 digits of accuracy. Since the number of days since epoch is often in
the tens-of-thousands, this means that matplotlib may not be able to accurately
represent data with millisecond precision. (Basically, the datatype isn't able
to store such a huge number with such small precision.) The actual available
precision will depend on the dates being used and the epoch matplotlib chooses.

So to solve these problems, we have done two things:

First, we don't bother to try to convert from matplotlib's data ourselves.
Instead, we require that instructors provide the expected data in matplotlib's
format when using assert_xydata(). The easiest way for instructors to do this
is for them to plot the data themselves, create a matplotcheck object from it,
and then extract the data using get_xy(). One weird quirk is that matplotlib
seems to consistently choose the same epoch when plotting the same dataset.
(However, we are unable to predict which epoch this will be for a given
dataset, and matplotlib's conversion functions don't always choose the same
epoch as when the data is plotted.) This solves the problem of being able to
convert the data.

Second, we use numpy.testing.assert_array_max_ulp() for comparing datetime data
(or any other type of numeric data). This method of comparison ensures that
floating-point roundoff error does not cause the assertion to erroneously fail.
However, this cannot prevent truncation error, and therefore cannot prevent a
loss of precision. Practically, what this means is that assert_xydata() cannot
tell the difference between times with differences of tens of milliseconds. If
it can't tell the difference, it will err on the side of passing.

For more info about the issues we've faced with this, take a look at PR #185
"""

@lwasser
Copy link
Author

lwasser commented Apr 27, 2020

this will be closed via #219 which is almost done!!

@marty-larocque
Copy link
Contributor

@lwasser now that #219 is ready to be merged, this can be closed!

@nkorinek
Copy link
Member

nkorinek commented May 14, 2020

This can be closed @lwasser

@lwasser
Copy link
Author

lwasser commented May 14, 2020

closing!!

@lwasser lwasser closed this as completed May 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working high priority time-series
Projects
None yet
Development

No branches or pull requests

3 participants