Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix NaN handling in calc_wd_mean_radial #68

Merged
merged 4 commits into from
Feb 24, 2023

Conversation

misi9170
Copy link
Collaborator

Issue

calc_wd_mean_radial() produces some inconsistent behavior. When the angles_array_deg input is a numpy array that contains NaNs, the output mean wind direction is NaN. However, when angles_array_deg is a dataframe, with only a few NaN values, the output ignores the NaNs, and when angles_array_deg contains entire rows/columns of NaNs, the output is 0.

Solution

I propose the following approach:

  • If some, but not all, wind directions in the mean are NaNs, those values are ignored and the mean is over the non-NaN values
  • If all wind directions in the mean are NaNs, NaN is returned.

Note that this is slightly different than current behavior, which:

  • returns NaN when any NaNs are present if angles_array_deg is a numpy array; and returns the mean over non-NaN values if angles_array_deg is a dataframe.
  • returns NaN when all wind directions are NaN if angles_array_deg is a numpy array; and returns 0 when all wind directions are NaN if angles_array_deg is a dataframe.

Steps to recreate issue

Run the following code on the current develop branch and compare to this bugfix branch:

from flasc import circular_statistics as cs
import numpy as np
import pandas as pd

print(np.nansum(np.array([3, 2, np.nan]))) # Intuitive
print(np.nansum(np.array([np.nan, np.nan, np.nan]))) # Not particularly intuitive

# generate data centered around 0
np.random.seed(1)
X = 360 + np.random.randn(5, 3)*5

## Behavior for numpy arrays

print(X)
# Normal behavior
print(cs.calc_wd_mean_radial(X))
print(cs.calc_wd_mean_radial(X, 1))

# Behavior when NaNs present
X_nans = X.copy()
X_nans[:,1] = np.nan
X_nans[1,:] = np.nan

print(X_nans)
# All nans due to presence of at least one nan
print(cs.calc_wd_mean_radial(X_nans)) 
print(cs.calc_wd_mean_radial(X_nans, 1)) 

## Behavior when data is a pandas dataframe
df = pd.DataFrame(X)

print(df)
# Runs as expected when NaNs not present
print(cs.calc_wd_mean_radial(df)) 
print(cs.calc_wd_mean_radial(df, 1)) 

df_nans = pd.DataFrame(X_nans)
print(df_nans)
# Runs as expected when not all values are NaN, but produces zeros when all 
# values are NaN
print(cs.calc_wd_mean_radial(df_nans)) 
print(cs.calc_wd_mean_radial(df_nans, 1)) 

I could also turn the above (or something similar) into a unit test for calc_wd_mean_radial().

@codecov-commenter
Copy link

Codecov Report

Base: 33.18% // Head: 30.31% // Decreases project coverage by -2.87% ⚠️

Coverage data is based on head (b9a783d) compared to base (3794509).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop      #68      +/-   ##
===========================================
- Coverage    33.18%   30.31%   -2.87%     
===========================================
  Files           38       40       +2     
  Lines         3779     4262     +483     
===========================================
+ Hits          1254     1292      +38     
- Misses        2525     2970     +445     
Impacted Files Coverage Δ
flasc/circular_statistics.py 97.56% <100.00%> (+0.12%) ⬆️
flasc/energy_ratio/energy_ratio_suite.py 8.74% <0.00%> (-1.46%) ⬇️
...sc/energy_ratio/energy_ratio_wd_bias_estimation.py 14.54% <0.00%> (-0.27%) ⬇️
flasc/turbine_analysis/find_sensor_faults.py 8.64% <0.00%> (-0.06%) ⬇️
flasc/energy_ratio/energy_ratio_gain.py 7.16% <0.00%> (ø)
flasc/turbine_analysis/northing_offset.py 9.72% <0.00%> (ø)
flasc/turbine_analysis/ws_pow_filtering.py 7.50% <0.00%> (+0.19%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@Bartdoekemeijer
Copy link
Collaborator

Nice find @misi9170! Perhaps we can add an option for NaN handling, similar to scipy.stats.circmean, defaulting to omit as you suggest? See https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.circmean.html.

@misi9170
Copy link
Collaborator Author

@Bartdoekemeijer thanks for the suggestion! I had been meaning to test it for speed against scipy.stats.circmean in any case. I found scipy.stats.circmean to be slightly faster than the current implementation. Moreover, setting nan_policy to 'omit' as you suggest provides the expected behavior (ignores NaNs when non-NaNs are present; returns NaN if all NaNs) for both numpy arrays and dataframes. I've therefore updated this function to simply wrap scipy.stats.circmean.

The default nan_policy for circmean is 'propagate', but I've set the default to 'omit' for calc_wd_mean_radial().

@misi9170
Copy link
Collaborator Author

@paulf81 @Bartdoekemeijer OK to merge?

@paulf81
Copy link
Collaborator

paulf81 commented Feb 23, 2023

I was just about to say, but won't this be slower? But happy to hear it's actually faster! Good to go from my end

@misi9170
Copy link
Collaborator Author

misi9170 commented Feb 23, 2023

@paulf81 Thanks for calling that out---I was wrong. I went back and checked, and I had been comparing the times incorrectly. In fact, this proposed implementation (using scipy's circmean) is about 15% slower than the the current implementation. In my opinion, that's a reasonable price to pay for simplicity, but I'm also happy to revert to the explicit calculation.

For reference, on my laptop this operation takes about 3.5 seconds to calculate 1,000,000 means, each containing 100 points, which I think is still fast enough to not be a major hold up for dealing with SCADA records.

@Bartdoekemeijer
Copy link
Collaborator

15% slower for the sake of simplicity seems fine to me!

@misi9170
Copy link
Collaborator Author

Great---squashing and merging.

@misi9170 misi9170 merged commit 17da9bb into NREL:develop Feb 24, 2023
@misi9170 misi9170 deleted the bugfix/circ-mean-nans branch February 24, 2023 15:06
@paulf81
Copy link
Collaborator

paulf81 commented Feb 24, 2023

Thanks guys, and I think agree, simplicity is a good goal. Also was thinking the more we have a practice of using tools versus reinventing probably then we'll also set ourselves up to benefit from improvements made to those tools, so probably in addition to simplicity, this is a good decision from the point of view of good practice, like, don't write any functions that exist already in numpy/scipy. I know this is already merged was just thinking of adding this point to maybe nudge future decisions in this same way. Thanks @misi9170 and @Bartdoekemeijer !

@Bartdoekemeijer
Copy link
Collaborator

Totally agree @paulf81! I think this ties in with #64 too. The more we can rely on OpenOA, the leaner this code becomes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants