Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement expected analysis #236

Draft
wants to merge 35 commits into
base: develop
Choose a base branch
from
Draft

Implement expected analysis #236

wants to merge 35 commits into from

Conversation

paulf81
Copy link
Collaborator

@paulf81 paulf81 commented Nov 6, 2024

Implement expected analysis

Feature or improvement description
This branch will implement the methods of uplift analysis described in AWC validation methodology. It also makes some changes to names of existing functions to better clarify the operation of the seperate methods for analyzing data and quantifying uplift.

Changes to be included:

  • Renaming EnergyRatioInput to AnalysisInput to make clear can be used for both
  • Providing an alias to EnergyRatioInput with deprecation warning to avoid breaking changes
  • Renaming ComputeTotalUplift to TotalUpliftPowerRatio to clarify the method of analysis
  • Add an alias with deprecation warning
  • Updating all references in code
  • Adding new module expected_power_analyisis.py which will implement the methods in AWC validation methodology
  • Add new module expected_power_analysis_output.py for catching the output of functions in expected_power_analysis and provided post-processing operations
  • Adding tests for any new function
  • Add an example
  • Adding documentation

@paulf81 paulf81 added the enhancement An improvement of an existing feature label Nov 6, 2024
@ejsimley
Copy link
Collaborator

@paulf81, I was going to add a test to make sure the uplift variance is correct using _total_uplift_expected_power_with_standard_error but then saw a couple potential issues with the way uncertainty gets calculated.

First, if setting variance_only = True or fill_cov_with_var = True so we can still get an uncertainty estimate even when we have missing covariance terms, what happens when some variance terms are Null (because there is only one sample in a bin for the turbine)? Then it seems we still wouldn't be able to get a non-Null uplift variance. Should we remove bins where there is only one sample (or remove individual turbines in the bin from the rest of the analysis if there is only 1 sample)?

The other issue is when remove_any_null_turbine_bins = False, which is the default case. If a turbine is missing in a particular bin we can still compute the expected farm power in that bin, we just ignore the missing turbines. But when computing the farm power variance for that bin, I think the variance terms for all of the turbine pairs still get used. So the farm power variance will be Null for all of these cases, even though we intended to just not include the missing turbines for those bins. I think this could be addressed relatively easily when computing farm variance, but wanted to see what you think.

Also, let me know if I'm wrong about any of the above. I didn't get a chance to thoroughly check this behavior yet.

test_cols=test_cols,
bin_cols_without_df_name=bin_cols_without_df_name,
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking this could be a good place to call a new utility function that synchronizes nulls between df_cov and df_bin (something like "_synchronize_mean_power_cov_nulls"). Specifically, for each row, if there are any turbines in df_cov with undefined variances or covariances (because count < 2), then the mean power for those turbines would get set to Null as well. This way, we would always be able to return a standard error for the uplift by excluding turbines in a given bin with undefined covariance from both the expected uplift and uncertainty calculations. I like this because we wouldn't end up returning a NaN uncertainty value when calling this function. In most cases, though, we would probably just want to set variance_only = True or fill_cov_with_var = True to maximize the number of turbines that can be used to compute the uplift in each bin.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I added _synchronize_mean_power_cov_nulls to the utilities and test_synchronize_mean_power_cov_nulls to the tests and maybe check the test first but I think it's doing what you describe. I agree this seems to the right approach! Just had a little time so started here, and now the tests of the full function fail but I think that's right. But let me know if this matches what you expected and then I'll got to your later comments and fix the overall result to match


# If any of the cov_cols are null, set pow_farm_var to null
df_bin = df_bin.with_columns(
pl.when(pl.all_horizontal([pl.col(c).is_not_null() for c in cov_cols]))
Copy link
Collaborator

@ejsimley ejsimley Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connected to the comments above, I think we could get rid of this requirement that all cov_cols are defined. When computing the "pow_farm_var" column a few lines above, if my understanding is correct, the summation over cov_cols will just ignore any Null values. So, we'd be calculating the variance of the farm power considering only the turbines that are valid, in the same way that "pow_farm" will only sum the power of the turbines that are not Null. Therefore, the farm power variance will correspond to the same set of turbiens used to compute farm power.

if remove_any_null_turbine_bins:
df_bin = df_bin.filter(
pl.all_horizontal([pl.col(f"{c}_mean").is_not_null() for c in test_cols])
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connected to the comment above, at this point, regardless of the value of "remove_any_null_turbine_bins," we might need to remove rows where all test_cols are Null. Although rare, I think it is possible that if count < 2 for all turbines in a bin, then after synchronizing Nulls between df_cov and df_bin, we could be left with rows that are all Null that should get filtered out from the analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement of an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants