Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Array{Union{T, Missing}} constructors #24939

Closed
nalimilan opened this issue Dec 6, 2017 · 8 comments
Closed

Improve Array{Union{T, Missing}} constructors #24939

nalimilan opened this issue Dec 6, 2017 · 8 comments
Labels
arrays [a, r, r, a, y, s] missing data Base.missing and related functionality

Comments

@nalimilan
Copy link
Member

We need a way to construct Array{Union{T, Missing}} objects filled with missing easily. Here are three proposals which can be seen either as complementary or as alternatives.

  1. The Missings package provides the missings(::Type, dims...) function for that, which is really convenient since it avoids the need to write Union and Missing. I could make sense to add it to Base.

  2. More generally, if would be nice if Array{Union{T, Missing}}(uninitialized, dims...) implemented a simpler behavior. Currently, it returns an array filled with missing for isbits element types, but an array full of #undef for other types:

julia> Array{Union{Int, Missing}}(uninitialized, 2)
2-element Array{Union{Missing, Int64},1}:
 missing
 missing

julia> Array{Union{String, Missing}}(uninitialized, 2)
2-element Array{Union{Missing, String},1}:
 #undef
 #undef

There are technical reasons for the first case of course. But it would sound reasonable to apply the same rule for all types, if only to make it easier to explain and remember. For example, in the section of the manual about missing values, it would be annoying to have to present an example for isbits types, and an example for other types.

A possible rule would be that uninitialized arrays are filled with the first singleton type, "first" being defined not by the user-specified order but by the internal sorting of types. In terms of performance, I'm not sure whether it would have an impact. Maybe the current behavior allows using zeroed memory pages provided by calloc, which can be more efficient than filling them with a custom value?

  1. As an alternative to Array{Union{T, Missing}}(uninitialized, dims...), it seems appealing to be able to write Array{Union{String, Missing}}(missing, dims...). If that was supported, it would be less of an issue that the rules determining the contents of the array created using Array{Union{T, Missing}}(uninitialized, dims...) are complex: it would only be used when you really want an uninitialized array (as the syntax indeed indicates).

PS: before somebody proposes it, using fill is not really an option since fill(missing, 2) gives an Array{Missing}, and the only alternative is very long: fill!(Array{Union{T, Missing}}(uninitialized, dims...), missing).

@nalimilan nalimilan added missing data Base.missing and related functionality arrays [a, r, r, a, y, s] labels Dec 6, 2017
@rfourquet
Copy link
Member

rfourquet commented Dec 6, 2017

I'm surprised to see that Array{Union{Int, Missing}}(uninitialized, 2) is filled with missing, is that documented or guaranteed behavior in general (e.g. also with Union{Int,Void})? What about Array{Union{Int, Missing, Void}}(uninitialized, 2) ? It seems missing has the priority over nothing, is that reliable? I thought the whole point of unitialized is to not make guarantees about the content of the container (just what happens to be the most efficient), so it seems strange to have a consistent filling in some case.

@nalimilan
Copy link
Member Author

AFAIK that behavior is needed because arrays of isbits Unions cannot be left completely uninitialized: at least you need to choose a type for each entry, i.e. you need to set the type tag to a valid type. So you may as well choose a type reliably, by setting the type tag to zero everywhere. I think the order corresponds to what is shown when printing a type, but we may not want to guarantee it.

The problem is always the same: unless you return random contents, people will eventually depend on the implemented behavior even if its not part of the public API. But returning random contents is not always the most efficient solution.

@rfourquet
Copy link
Member

Ok thanks for the clarification! I'm fine then with having consistent filling for isbits Unions with uninitialzied, but it seems much clearer to have another official way to consistenly fill with missing, even for non-isbits, so I support your options 1) or 3) over 2) by far.

@Sacha0
Copy link
Member

Sacha0 commented Dec 6, 2017

Modulo a few details, option three seems mostly in keeping with the direction in #24595? :)

@StefanKarpinski
Copy link
Member

Agree that option 3 seems like the way to go.

@nalimilan
Copy link
Member Author

#25054 implements option 3.

@andyferris
Copy link
Member

I like option 3 because it shares the API of what I feel to be a more general solution here (#24595 (comment)), which works a bit like:

out = Array{T,N}(input, dims)

# is equivalent to

out = Array{T,N}(unintialized, dims)
out .= input

Replace input with missing, and violà.

@fredrikekre
Copy link
Member

Fixed by #25054

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] missing data Base.missing and related functionality
Projects
None yet
Development

No branches or pull requests

6 participants