Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surprising behaviors in array initialization #17

Open
knuesel opened this issue Jul 1, 2020 · 11 comments
Open

Surprising behaviors in array initialization #17

knuesel opened this issue Jul 1, 2020 · 11 comments

Comments

@knuesel
Copy link

knuesel commented Jul 1, 2020

The documentation says:

For isbits types, a random/reasonable sentinel will be attempted if none provided.
...
For Integer types, all bit patterns are valid, so a random value is chosen;

But the constructor uses a constant default value. For example the default sentinel for Int is -1, so SentinelArray([-1, -1, -1]) creates an array with three missing values. From the documentation I expected an array of three -1, and a random sentinel different from -1.

The behavior with undef is also surprising: The documentation refers to the "standard undef pattern". As I understand, undef is meant to create uninitialized arrays, for cases where initialization is useless and undesired for performance reasons. But SentinelVector{Int}(undef, 3) actually initializes the array with 3 missing values. I would suggest removing the undef argument from the constructors (so a vector full of 3 missing values can be created with SentinelVector{Int}(3)). A later version could re-introduce the undef pattern with the correct semantics...

@quinnj
Copy link
Member

quinnj commented Oct 8, 2020

I don't think the behavior with undef is surprising; it follows the same pattern as if you did Vector{Union{Float64, Missing}}(undef, 10). Though we are inconsistent with ref types where Vector{Union{String, Missing}}(undef, 10) will give you #undef and SentinelVector{String}(undef, 10) will give you missing, but I think that's permissible since SentinelVector is really about data and we're essentially saying the default for "missing" values is missing, not #undef.

There is an issue with passing in your own arrays to SentinelArray when they contain the sentinel, because then they're treated as missing (or whatever sentinel you want). This is tricky because I think there are use-cases both ways: you have data, that has a sentinel value already (think -999 or something) and so you call SentinelArray(array, -999) and boom; it's fast, easy, and now you're treating -999 as missing. But if you have a data array like [-1, 0, 1, 2, 3] and call SentinelArray(array) then the first element is now treated as missing, probably unintentionally. We should probably have a keyword argument that allows "passing through" the data, which means we'll check the incoming data for any sentinel values and cycle the default sentinel if present. Anyone have any good ideas for what to call that keyword argument? cc: @bkamins @nalimilan

@bkamins
Copy link
Member

bkamins commented Oct 9, 2020

maybe raw::Bool? if true assume no processing, if false then we process the array.

@knuesel
Copy link
Author

knuesel commented Oct 9, 2020

it follows the same pattern as if you did Vector{Union{Float64, Missing}}(undef, 10)

Interesting! This behavior is undocumented though. Here's what I found by looking at discussions in Julia issues:

So what about replacing undef with missing?

@nalimilan
Copy link
Member

I think the behavior with undef is fine, as you noted it's the same as in Base (and e.g. CategoricalArray does the same thing). But it would make sense to support SentinelArray{T,N}(missing, dims) too to parallel the Array constructor.

maybe raw::Bool? if true assume no processing, if false then we process the array.

I agree that would be safer. You could also consider that if the sentinel isn't passed explicitly, then the caller doesn't know it, so you should check whether it's in the data, and if so choose a different random sentinel. OTC when the sentinel is explicitly passed, then the caller knows that such values will be treated as missing if they appear in the data.

Why is -1 used as the sentinel by default? It sounds much more likely to clash with existing values than any other (large) random value.

@quinnj
Copy link
Member

quinnj commented Oct 9, 2020

Why is -1 used as the sentinel by default? It sounds much more likely to clash with existing values than any other (large) random value.

So we can initialize the SentinelArray faster via ccall(:memset, Ptr{Cvoid}, (Ptr{Cvoid}, Cint, Csize_t), pointer(A), 0xff, sizeof(A)); otherwise, we need to iterate each element and set the sentinel value. (i.e. only possible if sentinel has a uniform bit pattern, so 0 or -1)

Yeah, I like the idea of doing it differently if the user is passing the sentinel or not. Let me look into what that change looks like.

@knuesel
Copy link
Author

knuesel commented Oct 10, 2020

I think the behavior with undef is fine, as you noted it's the same as in Base (and e.g. CategoricalArray does the same thing).

It's not terrible, but it seems wasteful to conflate the missing and undef constructors since they have different semantics. One is about initializing an array with missing values, the other is about skipping initialization for performance reasons. I'd rather have an undef constructor that gives the performance boost.

I also see little value in having two ways to do the same thing, one being the "official" way of the core language, and another way that encourages to use an anti-pattern: abusing the undef constructor to have defined (missing) values. I say anti-pattern because in Base it relies on an undocumented implementation detail, and it works almost always but not always (perfect for nasty bugs).

It seems clear that the current design is due to historical reasons (there was no missing Array constructor in Base, and people hadn't realized that the undef-for-missing behavior in Base was not quite reliable). Why not make the API cleaner and more expressive before the 1.0 release? If CategoricalArray does the same, then let's fix it too!

@nalimilan
Copy link
Member

It seems clear that the current design is due to historical reasons (there was no missing Array constructor in Base, and people hadn't realized that the undef-for-missing behavior in Base was not quite reliable).

I don't think that's the case at all. AFAICT the reason why Vector{Union{Missing, Int}}(undef, dims) returns a vector full of missing is that the type tag part of the underlying storage has to be initialized to a valid type (either Int or Missing). The default happens to be Missing in most cases due to how types are sorted, and since for that type the value doesn't matter since it's a singleton, you get missing. But that doesn't affect performance at all so it's perfectly fine.

The same applies to CategoricalArrays, where we also need to fill reference codes with a valid value (otherwise we would have to check that they are in the expected range all the time which would require two comparisons instead of one). 0 is used, which indicates #undef for CategoricalVector{Int} but indicates missing for CategoricalVector{Union{Missing, Int}}.

@bkamins
Copy link
Member

bkamins commented Oct 10, 2020

due to how types are sorted, and since for that type the value doesn't matter since it's a singleton, you get missing

I confirm that this is also my understanding.

@knuesel
Copy link
Author

knuesel commented Oct 10, 2020

I agree 100% (by "current design" I meant that of SentinelArrays, not Base). I'm not arguing against the behavior in Base. And I see now that the behavior of SentinelArrays for undef is not that confusing, so this issue is not the right place to discuss the point I was trying to make in my previous 2 comments. Sorry for the confusion, I'll try to make a clearer case for my point in another issue.

@bkamins
Copy link
Member

bkamins commented Oct 10, 2020

Your comments are valuable - we just try very hard to maintain consistency with Base where possible so it is better to discuss things in detail.

@knuesel
Copy link
Author

knuesel commented Oct 10, 2020

Thanks @bkamins, that makes sense. And I appreciate the time you all take to answer my proposals. I have made a hopefully clearer case in a new issue: #40.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants