Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rank Data, Correlation, Covariance, R Squared #3484

Merged
merged 14 commits into from
May 30, 2022
Merged
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,7 @@
and made it the default.][3472]
- [Implemented a `Table.from Text` conversion allowing to parse strings
representing `Delimited` files without storing them on the filesystem.][3478]
- [Added rank data, correlation and covariance statistics for `Vector`][3484]

[debug-shortcuts]:
https://github.com/enso-org/enso/blob/develop/app/gui/docs/product/shortcuts.md#debug
Expand Down Expand Up @@ -204,6 +205,7 @@
[3472]: https://github.com/enso-org/enso/pull/3472
[3486]: https://github.com/enso-org/enso/pull/3486
[3478]: https://github.com/enso-org/enso/pull/3478
[3484]: https://github.com/enso-org/enso/pull/3484

#### Enso Compiler

Expand Down
146 changes: 142 additions & 4 deletions distribution/lib/Standard/Base/0.0.0-dev/src/Data/Statistics.enso
Original file line number Diff line number Diff line change
@@ -1,16 +1,25 @@
from Standard.Base import Boolean, True, False, Nothing, Vector, Number, Any, Error, Array, Panic, Illegal_Argument_Error, Unsupported_Argument_Types

from Standard.Base.Data.Vector import Empty_Error

import Standard.Base.Data.Ordering.Comparator

import Standard.Base.Data.Statistics.Rank_Method

polyglot java import org.enso.base.statistics.Moments
polyglot java import org.enso.base.statistics.CountMinMax
polyglot java import org.enso.base.statistics.CorrelationStatistics
polyglot java import org.enso.base.statistics.Rank

polyglot java import java.lang.IllegalArgumentException
polyglot java import java.lang.ClassCastException
polyglot java import java.lang.NullPointerException

type Statistic
## PRIVATE
Convert the Enso Statistic into Java equivalent.
to_java : SingleValue
to_java = case this of
to_moment_statistic : SingleValue
to_moment_statistic = case this of
Sum -> Moments.SUM
Mean -> Moments.MEAN
Variance p -> if p then Moments.VARIANCE_POPULATION else Moments.VARIANCE
Expand Down Expand Up @@ -52,6 +61,32 @@ type Statistic
## The sample kurtosis of the values.
type Kurtosis

## Calculate the Covariance between data and series.

Arguments:
- series: the series to compute the covariance with.
type Covariance (series:Vector)

## Calculate the Pearson Correlation between data and series.

Arguments:
- series: the series to compute the correlation with.
type Pearson (series:Vector)

## Calculate the Spearman Rank Correlation between data and series.

Arguments:
- series: the series to compute the correlation with.
type Spearman (series:Vector)

## Calculate the coefficient of determination between data and predicted
series.

Arguments:
- predicted: the series to compute the r_squared with.
type R_Squared (predicted:Vector)


## Compute a single statistic on a vector like object.

Arguments:
Expand All @@ -69,11 +104,11 @@ compute data statistic=Count =
- statistics: Set of statistics to calculate.
compute_bulk : Vector -> [Statistic] -> [Any]
compute_bulk data statistics=[Count, Sum] =

count_min_max = statistics.any s->((s.is_a Count) || (s.is_a Minimum) || (s.is_a Maximum))

java_stats = statistics.map .to_java
java_stats = statistics.map .to_moment_statistic
skip_java_stats = java_stats.all s->s.is_nothing

report_invalid _ =
statistics.map_with_index i->v->
if java_stats.at i . is_nothing then Nothing else
Expand All @@ -97,8 +132,88 @@ compute_bulk data statistics=[Count, Sum] =
Maximum ->
if count_min_max_values.comparatorError then (Error.throw Vector.Incomparable_Values_Error) else
count_min_max_values.maximum
Covariance s -> here.calculate_correlation_statistics data s . covariance
Pearson s -> here.calculate_correlation_statistics data s . pearsonCorrelation
Spearman s -> here.calculate_spearman_rank data s
R_Squared s -> here.calculate_correlation_statistics data s . rSquared
_ -> stats_array.at i


## Calculate a variance-covariance matrix between the input series.

Arguments:
- data: The input data sets
covariance_matrix : [Vector] -> [Vector]
covariance_matrix data =
stats_vectors = here.calculate_correlation_statistics_matrix data
stats_vectors.map v->(v.map .covariance)


## Calculate a Pearson correlation matrix between the input series.

Arguments:
- data: The input data sets
pearson_correlation : [Vector] -> [Vector]
pearson_correlation data =
stats_vectors = here.calculate_correlation_statistics_matrix data
stats_vectors.map v->(v.map .pearsonCorrelation)


## Calculate a Spearman Rank correlation matrix between the input series.

Arguments:
- data: The input data sets
spearman_correlation : [Vector] -> [Vector]
spearman_correlation data =
Panic.handle_wrapped_dataflow_error <|
output = Vector.new_builder data.length

0.up_to data.length . each i->
output.append <|
Vector.new data.length j->
if j == i then 1 else
if j < i then (output.at j . at i) else
Panic.throw_wrapped_if_error <|
here.calculate_spearman_rank (data.at i) (data.at j)

output.to_vector


## PRIVATE
wrap_java_call : Any -> Any
wrap_java_call ~function =
report_unsupported _ = Error.throw (Illegal_Argument_Error ("Can only compute correlations on numerical data sets."))
handle_unsupported = Panic.catch Unsupported_Argument_Types handler=report_unsupported

report_illegal caught_panic = Error.throw (Illegal_Argument_Error caught_panic.payload.cause.getMessage)
handle_illegal = Panic.catch IllegalArgumentException handler=report_illegal

handle_unsupported <| handle_illegal <| function


## PRIVATE
Given two series, get a computed CorrelationStatistics object
calculate_correlation_statistics : Vector -> Vector -> CorrelationStatistics
calculate_correlation_statistics x_data y_data =
here.wrap_java_call <| CorrelationStatistics.compute x_data.to_array y_data.to_array


## PRIVATE
Given two series, get a compute the Spearman Rank correlation
calculate_spearman_rank : Vector -> Vector -> Decimal
calculate_spearman_rank x_data y_data =
here.wrap_java_call <| CorrelationStatistics.spearmanRankCorrelation x_data.to_array y_data.to_array


## PRIVATE
Given a set of series get CorrelationStatistics objects
calculate_correlation_statistics_matrix : [Vector] -> [CorrelationStatistics]
calculate_correlation_statistics_matrix data =
data_array = Vector.new data.length i->(data.at i).to_array . to_array
stats_array = here.wrap_java_call <| CorrelationStatistics.computeMatrix data_array
Vector.new stats_array.length i->(Vector.Vector (stats_array.at i))


## Compute a single statistic on the vector.

Arguments:
Expand All @@ -115,3 +230,26 @@ Vector.Vector.compute statistic=Count =
Vector.Vector.compute_bulk : [Statistic] -> [Any]
Vector.Vector.compute_bulk statistics=[Count, Sum] =
here.compute_bulk this statistics


## Assigns a rank to each value of data, dealing with equal values according to the method.

Arguments:
- data: Input data to rank.
- method: Method used to deal with equal values.
rank_data : Vector -> Rank_Method -> Vector
rank_data input method=Rank_Method.Average =
java_method = case method of
Rank_Method.Minimum -> Rank.Method.MINIMUM
Rank_Method.Maximum -> Rank.Method.MAXIMUM
Rank_Method.Average -> Rank.Method.AVERAGE
Rank_Method.Ordinal -> Rank.Method.ORDINAL
Rank_Method.Dense -> Rank.Method.DENSE

report_nullpointer caught_panic = Error.throw (Illegal_Argument_Error caught_panic.payload.cause.getMessage)
handle_nullpointer = Panic.catch NullPointerException handler=report_nullpointer
jdunkerley marked this conversation as resolved.
Show resolved Hide resolved
handle_classcast = Panic.catch ClassCastException handler=(Error.throw Vector.Incomparable_Values_Error)

handle_classcast <| handle_nullpointer <|
java_ranks = Rank.rank input.to_array Comparator.new java_method
Vector.Vector java_ranks
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@

## Specifies how to handle ranking of equal values.
type Rank_Method
jdunkerley marked this conversation as resolved.
Show resolved Hide resolved
## Use the mean of all ranks for equal values.
type Average

## Use the lowest of all ranks for equal values.
type Minimum

## Use the highest of all ranks for equal values.
type Maximum

## Use same rank value for equal values and next group is the immediate
following ranking number.
type Dense

## Equal values are assigned the next rank in order that they occur.
type Ordinal
Comment on lines +2 to +18
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This explanation is much better, but I'm still thinking if it would be worth to maybe offer some examples? Not sure if here or next to some method using this. But I'm still not sure if I correctly understand how Dense or Average work

36 changes: 24 additions & 12 deletions distribution/lib/Standard/Base/0.0.0-dev/src/Data/Vector.enso
Original file line number Diff line number Diff line change
Expand Up @@ -55,19 +55,22 @@ fill length ~item =
A vector allows to store an arbitrary number of elements in linear memory. It
is the recommended data structure for most applications.

Arguments:
- capacity: Initial capacity of the Vector.Builder

> Example
Construct a vector using a builder that contains the items 1 to 10.

example_new_builder =
builder = Vector.new_builder
builder = Vector.new_builder 10
do_build start stop =
builder.append start
if start >= stop then Nothing else
@Tail_Call do_build start+1 stop
do_build 1 10
builder.to_vector
new_builder : Builder
new_builder = Builder.new
new_builder : Integer -> Builder
new_builder (capacity=1) = Builder.new capacity
jdunkerley marked this conversation as resolved.
Show resolved Hide resolved

## ADVANCED

Expand Down Expand Up @@ -141,13 +144,7 @@ type Vector
at : Integer -> Any ! Index_Out_Of_Bounds_Error
at index =
actual_index = if index < 0 then this.length + index else index
## TODO [RW] Ideally we do not want an additional check here, but we
should catch a Invalid_Array_Index_Error panic. However, such a catch
should still properly forward any other panics or dataflow errors
which is not fully possible until the approach to handling Panics is
improved, as described in the following Pivotal ticket:
https://www.pivotaltracker.com/n/projects/2539304/stories/181029230
if actual_index>=0 && actual_index<this.length then this.unsafe_at actual_index else
Panic.catch Invalid_Array_Index_Error (this.unsafe_at actual_index) _->
Error.throw (Index_Out_Of_Bounds_Error index this.length)

## ADVANCED
Expand Down Expand Up @@ -1015,12 +1012,15 @@ type Builder

## Creates a new builder.

Arguments:
- capacity: Initial capacity of the Vector.Builder

> Example
Make a new builder

Vector.new_builder
new : Builder
new = Builder (Array.new 1) 0
new : Integer->Builder
new (capacity=1) = Builder (Array.new capacity) 0

## Returns the current capacity (i.e. the size of the underlying storage)
of this builder.
Expand Down Expand Up @@ -1088,6 +1088,18 @@ type Builder
this.append item
Nothing

## Gets an element from the vector at a specified index (0-based).

Arguments:
- index: The location in the vector to get the element from. The index is
also allowed be negative, then the elements are indexed from the back
of the vector, i.e. -1 will correspond to the last element.
at : Integer -> Any ! Index_Out_Of_Bounds_Error
at index =
actual_index = if index < 0 then this.length + index else index
Panic.catch Invalid_Array_Index_Error (this.to_array.at actual_index) _->
Error.throw (Index_Out_Of_Bounds_Error index this.length)

## Checks whether a predicate holds for at least one element of this builder.

Arguments:
Expand Down
17 changes: 17 additions & 0 deletions distribution/lib/Standard/Base/0.0.0-dev/src/Error/Common.enso
Original file line number Diff line number Diff line change
Expand Up @@ -386,6 +386,23 @@ type Panic
True -> caught_panic.convert_to_dataflow_error
False -> Panic.throw caught_panic

## If a dataflow error had occurred, wrap it in a `Wrapped_Dataflow_Error` and promote to a Panic.

Arguments:
- value: value to return if not an error, or rethrow as a Panic.
throw_wrapped_if_error : Any -> Any
throw_wrapped_if_error ~value =
if value.is_error then Panic.throw (Wrapped_Dataflow_Error value.catch) else value

## Catch any `Wrapped_Dataflow_Error` Panic and rethrow it as a dataflow error.

Arguments:
- action: The code to execute that potentially raised a Wrapped_Dataflow_Error.
handle_wrapped_dataflow_error : Any -> Any
handle_wrapped_dataflow_error ~action =
Panic.catch Wrapped_Dataflow_Error action caught_panic->
Error.throw caught_panic.payload.payload

## The runtime representation of a syntax error.

Arguments:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -50,8 +50,7 @@ type Vector_Builder
array = Array.new this.length
go ix elem = case elem of
Leaf vec ->
vec.map_with_index vi-> elem->
array.set_at ix+vi elem
Array.copy vec.to_array 0 array ix vec.length
ix + vec.length
Append l r _ ->
ix2 = go ix l
Expand Down
Loading