Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add spreadmissings #122

Open
wants to merge 27 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
*.jl.cov
*.jl.*.cov
*.jl.mem
Manifest.toml
300 changes: 292 additions & 8 deletions src/Missings.jl
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ module Missings

export allowmissing, disallowmissing, ismissing, missing, missings,
Missing, MissingException, levels, coalesce, passmissing, nonmissingtype,
skipmissings
skipmissings, spreadmissings

using Base: ismissing, missing, Missing, MissingException
using Base: @deprecate
Expand Down Expand Up @@ -208,6 +208,290 @@ missing
"""
passmissing(f) = PassMissing{Core.Typeof(f)}(f)

struct SpreadMissings{F} <: Function
f::F
spread::Symbol
function SpreadMissings(f, spread)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
function SpreadMissings(f, spread)
function SpreadMissings(f, spread::Symbol)

if !(spread in (:default, :nonmissing, :none))
throw(ArgumentError("spread must be either :default, :nonmissing, or :none"))
end
new{Core.Typeof(f)}(f, spread)
end
end

function non_spreadabble_check(t::Union{AbstractDict, NamedTuple, Tuple})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
function non_spreadabble_check(t::Union{AbstractDict, NamedTuple, Tuple})
function non_spreadable_check(t::Union{AbstractDict, NamedTuple, Tuple})

T = typeof(t)
s = "Spreadmissings on $T is reserved. Please wrap in Ref to be " *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
s = "Spreadmissings on $T is reserved. Please wrap in Ref to be " *
s = "spreadmissings on $T is reserved. Please wrap in Ref to be " *

Also it is true that wrapping it in a Ref will work? We don't unwrap Refs before passing them to the function, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it will not work... I will make it error for now.

"treated as a scalar."
throw(ArgumentError(s))
end
non_spreadabble_check(x) = nothing

"""
pdeffebach marked this conversation as resolved.
Show resolved Hide resolved
Given an input vector `a` where `nonmissinginds` is guaranteed
to not include any missing values, return a `SubArray` referencing
the `nonmissinginds`. The element type of the returned output
does not include `missing`.
"""
function nomissing_subarray(a::AbstractVector, nonmissinginds::AbstractVector)
T = nonmissingtype(eltype(a)) # Element type
N = 1 # Dimension of view
P = typeof(a) # Type of parent array
I = Tuple{typeof(nonmissinginds)} # Type of the non-missing indices
L = Base.IndexStyle(a) == IndexLinear # If the type supports fast linear indexing (assumed true)
pdeffebach marked this conversation as resolved.
Show resolved Hide resolved
SubArray{T, N, P, I, L}(a, (nonmissinginds,), 0, 1)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring for SubArray makes no mention of contructors, and in fact says "Construct SubArrays using the view function." So this direct usage of a SubArray constructor is usage of undocumented Base internals that could change in any minor or patch release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's probably true.

I think it will be worthwhile to make our own array type that does this better... hopefully it's not too hard.

Copy link

@aplavin aplavin Feb 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be relevant – my SentinelViews.jl package does kinda the opposite. It defines a view type that propagates the sentinel value from indices to the results, sentinelview([10, 20, 30], [1, nothing, 3]) == [10, nothing, 30].
It was reasonably straightforward to implement, as should be the "typesubtract view" needed here. With the obvious downside that ::SubArray function dispatches won't be triggered.

end

function new_args_subarray(args::Tuple, nonmissinginds::AbstractVector)
newargs = ntuple(length(args)) do i
a = args[i]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is equivalent, right?

Suggested change
newargs = ntuple(length(args)) do i
a = args[i]
newargs = map(args) do a

Also maybe better call a x as it's not always an array.

if a isa AbstractVector
nomissing_subarray(a, nonmissinginds)
else
a
end
end
end

function maybespread_missing(f, newargs, new_kwargs, vecs, nonmissinginds, nonmissingmask)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make signature more specific?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

spread = f.spread
res = f.f(newargs...; new_kwargs...)

if res isa AbstractVector
# Default and spread have the same behavior if
# output is a vector
if spread === :default || spread === :nonmissing
if length(res) != length(nonmissinginds)
s = "When spreading a vector result, " *
"length of output must match number of jointly non-"
"missing indices in inputs. Currently spread = :$(spread)."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
s = "When spreading a vector result, " *
"length of output must match number of jointly non-"
"missing indices in inputs. Currently spread = :$(spread)."
s = "When spreading a vector result with `spread=$(spread)`, " *
"length of output must match number of jointly non-"
"missing values in inputs "
"(got $(length(res)) and $(length(nonmissinginds))).".

throw(DimensionMismatch(s))
end
out = similar(res, Union{eltype(res), Missing}, length(vecs[1]))
fill!(out, missing)
out[nonmissingmask] .= res
elseif spread === :none
out = res
else
throw(ArgumentError("Should not reach 1"))
end
else
if spread === :nonmissing
out = Vector{Union{typeof(res), Missing}}(undef, length(vecs[1]))
fill!(out, missing)
for ind in nonmissinginds
out[ind] = res
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for ind in nonmissinginds
out[ind] = res
end
out[nonmissinginds] .= Ref(res)

elseif spread === :default || spread === :none
out = res
else
throw(ArgumentError("Should not reach 2"))
end
end

return out
end

function maybespread_nomissing(f, args, kwargs, vecs)
spread = f.spread
res = f.f(args...; kwargs...)

if res isa AbstractVector
# Default and spread have the same behavior if
# output is a vector
if spread === :default || spread === :nonmissing
if length(res) != length(first(vecs))
s = "When spreading a vector result, " *
"length of output must match number of jointly non-"
"missing indices in inputs. Currently spread = :$(spread)."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
s = "When spreading a vector result, " *
"length of output must match number of jointly non-"
"missing indices in inputs. Currently spread = :$(spread)."
s = "When spreading a vector result with `spread=$(spread)`, " *
"length of output must match number of jointly non-"
"missing values in inputs "
"(got $(length(res)) and $(length(nonmissinginds))).".

throw(DimensionMismatch(s))
end
out = res
elseif spread === :none
out = res
else
throw(ArgumentError("Should not reach 1"))
end
else
if spread === :nonmissing
out = Vector{typeof(res)}(undef, length(vecs[1]))
fill!(out, res)
elseif spread === :default || spread === :none
out = res
else
throw(ArgumentError("Should not reach 2"))
end
end

return out
end

function check_indices_match(vecs...)
Base.require_one_based_indexing(vecs...)
findex = eachindex(first(vecs))
# If vectors don't have the same indices, throw a
# nice error for the user.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you require one-based indexing, checking whether arrays have the same indices simply requires checking that they have the same length.

Also, instead of checking this manually, you could just call eachindex(vecs...) as it already throws an explicit error when indices are not compatible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. I didn't know about it. I'll keep the check_indices_match function for readability.

if !(all(x -> eachindex(x) == findex, vecs[2:end]))
d = Dict()
for i in 1:length(vecs)
e = eachindex(vecs[i])
if eachindex(e) in keys(d)
push!(d[i], i)
else
d[e] = [i]
end
end
s = "The indices of vector-input arguments are not all " *
"the same.\n"

for k in keys(d)
inds = join(d[k], ", ", " and ")
ind_msg = "Vector inputs $inds have indices $k\n"
s = s * ind_msg
end

throw(DimensionMismatch(s))
end
end


function (f::SpreadMissings{F})(args...; kwargs...) where {F}
kwargs_vals = values(values(kwargs))
xs = tuple(args..., kwargs_vals...)
foreach(non_spreadabble_check, xs)

# Detect vector inputs which contain missing in
# either the main arguments or keyword arguments
if any(x -> x isa AbstractVector{>:Missing}, xs)
# Check that all vector inputs have the
# same indices. Collect these vector inputs
# into a single object.
#
# TODO: Allow users to protect vector inputs
vecs = Base.filter(x -> x isa AbstractVector, xs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
vecs = Base.filter(x -> x isa AbstractVector, xs)
vecs = filter(x -> x isa AbstractVector, xs)

check_indices_match(vecs...)
# Determine which indices in our collection of
# vector inputs have no missing values in
# all our inputs.
nonmissingmask = fill(true, length(vecs[1]))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

for v in vecs
nonmissingmask .&= .!ismissing.(v)
end
nonmissinginds = findall(nonmissingmask)
# Construct new versions of arguments
# with no Vector{Union{T, Missing}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# with no Vector{Union{T, Missing}}
# with SubArrays whose eltypes do not allow Missing

newargs = new_args_subarray(args, nonmissinginds)
new_kwargs_vals = new_args_subarray(kwargs_vals, nonmissinginds)

new_kwargs = NamedTuple{keys(kwargs)}(new_kwargs_vals)
maybespread_missing(f, newargs, new_kwargs, vecs, nonmissinginds, nonmissingmask)
# There is at least one vector, but none of the vectors can contain missing
elseif any(x -> x isa AbstractVector, xs)
vecs = Base.filter(x -> x isa AbstractVector, xs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
vecs = Base.filter(x -> x isa AbstractVector, xs)
vecs = filter(x -> x isa AbstractVector, xs)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still applies.

check_indices_match(vecs...)
maybespread_nomissing(f, args, kwargs, vecs)
else
f.f(args...; kwargs...)
end
end

"""
spreadmissings(f; spread = :default)

Given a function `f`, function `f` but performs a transformation
on arguments to remove missing values before executing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Given a function `f`, function `f` but performs a transformation
on arguments to remove missing values before executing.
Return a function which calls function `f` after skipping entries
corresponding to missing values in `AbstractVector` arguments.
All input vectors must have the same length. Non-`AbstractVector`
arguments are left untouched.
If `spread` is `:default` or `:nonmissing` and `f` returns a vector,
its length must be equal to the number of jointly non-missing
entries in the vector inputs. A vector of the same length as
vector inputs is returned, filling positions corresponding
to missing values with `missing`.
If `spread` is `:none`, or if `f` returns a value other than a vector,
it is returned as-is.
For each vector argument, `f` is passed a `SubArray`
view with an element type equal to `nonmissingtype(T)`,
with `T` the element type of the original argument.


### Initial example
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Initial example
# Examples


```julia-repl
julia> using Statistics;
nalimilan marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
julia> using Statistics;
julia> using Statistics


julia> xmiss = [1, 2, 3, missing];

julia> ymiss = [missing, 200, 300, 400];

julia> summeans(x, y) = mean(x) + mean(y);

julia> spreadmissings(summeans)(xmiss, ymiss)
252.5
```

### Details
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Details
# Extended help


Given the call
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Given the call
The behavior of `spreadmissing` can be illustrated using an example.
The call


```
spreadmissings(f)(x::AbstractVector, y::Integer, z::AbstractVector)
```

finds the indices which corresond to `missing` values in *both*
`x` and `z`. Then apply `f` on the `SubArray`s of `x` and `z` which
contain non-missing values. In essense:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
finds the indices which corresond to `missing` values in *both*
`x` and `z`. Then apply `f` on the `SubArray`s of `x` and `z` which
contain non-missing values. In essense:
finds the indices which correspond to `missing` values in *both*
`x` and `z`. Then `f` is applied on the `SubArray`s of `x` and `z` which
contain non-missing values. This is essentially equivalent to:


```
inds = .!missing.(x) .& .!missing.(z)
sx = view(x, inds); sy = view(y, inds)
f(sx, y, sy)
```

!!! note
`spreadmissings` does not use the default `view` behavior. Rather,
it constructs a `SubArray` directly such that the eltype of the new
inputs do not include `Missing`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
!!! note
`spreadmissings` does not use the default `view` behavior. Rather,
it constructs a `SubArray` directly such that the eltype of the new
inputs do not include `Missing`.
The only difference is that `spreadmissings` does not use the
default `view` behavior. Rather, it constructs `SubArray`s directly
such that the element types of the new inputs do not include `Missing`.


### `spread` keyword argument

Control over how the output from `f` is "spread"
along with respect to missing values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### `spread` keyword argument
Control over how the output from `f` is "spread"
along with respect to missing values.
# `spread` keyword argument
The `spread` keyword argument controls whether the output from
`f` is "spread" over non-missing values.


* `:default`:
* If `output` is a `Vector` with the same length as the number of
jointly non-missing elements of the inputs `output` is "spread"
to match the non-missing elements of the inputs.
* If the `output` is a `Vector` whose length is not the same
as the length of number of non-missing elements of the inputs,
a `DimensionMismatch` error is thrown.
* If the output is not a `Vector`, `output` is simply returned directly
pdeffebach marked this conversation as resolved.
Show resolved Hide resolved
* `:nonmissing`:
* If `output` is a `Vector`, behavior is the same as `:default`
* If `output` is not a `Vector`, `output` is spread along non-missing
elements of the inputs.
* `:none`: `output` is returned directly, whether a `Vector` or not.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* If `output` is a `Vector`, behavior is the same as `:default`
* If `output` is not a `Vector`, `output` is spread along non-missing
elements of the inputs.
* `:none`: `output` is returned directly, whether a `Vector` or not.
* If output is not a vector, it is is spread over non-missing
elements of the inputs.
* If output is a vector, behavior is the same as `:default`.
* `:none`: output is returned directly, whether a vector or not.


A summary of the behavior is given in the table below:

| spread \\ output type | Vector | Non-Vector |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| spread \\ output type | Vector | Non-Vector |
| spread \\ output type | Vector | Non-vector |

|:---------------------- |:---------------- |:---------------- |
| :default | spread and match | return |
| :nonmissing | spread and match | spread and match |
| :none | return | return |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "and match" useful here? This vocabulary isn't used anywhere else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "match" bit is relevant when discussing the "spreading" behavior. Compare the way @rtransform df :m = mean(:x) to @transform df :m = @spreadmissings mean(:x) or something. In the former, you get the mean everywhere, in the latter you get the mean for non-missing indices only. It "matches" the output with the nonmissing indices.

I added a note about this in the docstring. We should consider adding a keyword argument for this behavior eventually.


If there are `AbstractVector` inputs but none of these inputs
`AbstractVector{>:Missing}`, behavior of `spread` is the same as
with inputs which allows for missing values. However the returned
vectors will not allow for `missing`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If there are `AbstractVector` inputs but none of these inputs
`AbstractVector{>:Missing}`, behavior of `spread` is the same as
with inputs which allows for missing values. However the returned
vectors will not allow for `missing`.
If there are `AbstractVector` inputs but none of these inputs are
`AbstractVector{>:Missing}`, the returned vectors will not allow
for `missing`.


If none of the argumets are `AbstractVector`s, `spreadmissings(f)`
behaves the same as `f` regardpess of `spread`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If none of the argumets are `AbstractVector`s, `spreadmissings(f)`
behaves the same as `f` regardpess of `spread`.
If none of the arguments are `AbstractVector`s, `spreadmissings(f)`
behaves the same as `f` regardless of `spread`.


### Limitations

`spreadmissings` currently does not support:

* Different length vector inputs. For

```
spreadmissings(f)([1, 2], [100, 200, 300])
```

will error.

* Full spreading of scalar outputs across the *full* length of the
input vector. That is, there is no `spread = :all` option.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would drop this as these are not really limitations. We do not support this on purpose.

"""
spreadmissings(f; spread = :default) = SpreadMissings(f, spread)

"""
skipmissings(args...)

Expand Down Expand Up @@ -258,7 +542,7 @@ struct SkipMissings{V, T}
others::T
end

Base.@propagate_inbounds function _anymissingindex(others::Tuple{Vararg{AbstractArray}}, i)
Base.@propagate_inbounds function _anymissingindex(others::Tuple{Vararg{AbstractArray}}, i)
for oth in others
oth[i] === missing && return true
end
Expand All @@ -267,7 +551,7 @@ Base.@propagate_inbounds function _anymissingindex(others::Tuple{Vararg{Abstract
end

@inline function _anymissingiterate(others::Tuple, state)
for oth in others
for oth in others
y = iterate(oth, state)
y !== nothing && first(y) === missing && return true
end
Expand All @@ -278,7 +562,7 @@ end
const SkipMissingsofArrays = SkipMissings{V, T} where
{V <: AbstractArray, T <: Tuple{Vararg{AbstractArray}}}

function Base.show(io::IO, mime::MIME"text/plain", itr::SkipMissings{V}) where V
function Base.show(io::IO, mime::MIME"text/plain", itr::SkipMissings{V}) where V
print(io, SkipMissings, '{', V, '}', '(', itr.x, ')', " comprised of " *
"$(length(itr.others) + 1) iterators")
end
Expand Down Expand Up @@ -335,7 +619,7 @@ end
@inline function Base.getindex(itr::SkipMissingsofArrays, i)
@boundscheck checkbounds(itr.x, i)
@inbounds xi = itr.x[i]
if xi === missing || @inbounds _anymissingindex(itr.others, i)
if xi === missing || @inbounds _anymissingindex(itr.others, i)
throw(MissingException("the value at index $i is missing for some element"))
end
return xi
Expand Down Expand Up @@ -380,9 +664,9 @@ Base.mapreduce_impl(f, op, A::SkipMissingsofArrays, ifirst::Integer, ilast::Inte
A = itr.x
if ifirst == ilast
@inbounds a1 = A[ifirst]
if a1 === missing
if a1 === missing
return nothing
elseif _anymissingindex(itr.others, ifirst)
elseif _anymissingindex(itr.others, ifirst)
return nothing
else
return Some(Base.mapreduce_first(f, op, a1))
Expand Down Expand Up @@ -436,7 +720,7 @@ end
Return a vector similar to the array wrapped by the given `SkipMissings` iterator
but skipping all elements with a `missing` value in one of the iterators passed
to `skipmissing` and elements for which `f` returns `false`. This method
only applies when all iterators passed to `skipmissings` are arrays.
only applies when all iterators passed to `skipmissings` are arrays.

# Examples
```
Expand Down
Loading