-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement relational NULL semantics for Nothing for in-memory Column operations #8816
Conversation
@@ -1600,7 +1600,7 @@ type Column | |||
run_vectorized_binary_op self op_name as_vector expected_result_type=Value_Type.Boolean skip_nulls=False new_name=result_name | |||
False -> | |||
set = Set.from_vector as_vector error_on_duplicates=False | |||
run_unary_op self set.contains new_name=result_name skip_nulls=False expected_result_type=Value_Type.Boolean | |||
run_unary_op self set.contains_relational new_name=result_name skip_nulls=True expected_result_type=Value_Type.Boolean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@radeusgd I'm not sure how to invoke the fallback implementation of ==
(line 214) -- what case uses this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need more context for this question. Line 214 of which file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 214 of this file -- I can't leave a comment there.
Or here: https://github.com/enso-org/enso/blob/develop/distribution/lib/Standard/Table/0.0.0-dev/src/Data/Column.enso#L214
I think this code might need to be modified for handling nulls differently, but I'm not sure what cases use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try using ==
on a Mixed
column, e.g.
t = Table.new [["X", ["a", 42, Nothing]], ["Y", ["b", 42, True]]]
(t.at "X") == (t.at "Y")
I think it should trigger it, but you can add some println to make sure :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I remember now -- ==
uses skip_nulls=True, so the fallback is never used for nulls. So no change is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh :D
if (storage.isNa(i)) { | ||
missing.set(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if arg.size() == 0
? Looking at the contains_relational
semantics, it should result in False
, but looking at this code it seems that it will result in Nothing
.
Do we have a test case for such scenario? If not, please let's add it and ensure the implementation is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
/** | ||
* A wrapper around BitSet that implements boolean operations conveniently. Unlike BitSet, | ||
* ImmutableBitSet takes a size parameter, which allows .not to be implemented. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
# [input, had_null, had_true, had_false, output] | ||
cases = [[True, True, True, True, True]] | ||
+ [[True, True, True, False, True]] | ||
+ [[True, False, True, True, True]] | ||
+ [[True, False, True, False, True]] | ||
+ [[True, True, False, True, Nothing]] | ||
+ [[True, True, False, False, Nothing]] | ||
+ [[True, False, False, True, False]] | ||
+ [[True, False, False, False, False]] | ||
+ [[False, True, True, True, True]] | ||
+ [[False, True, False, True, True]] | ||
+ [[False, False, True, True, True]] | ||
+ [[False, False, False, True, True]] | ||
+ [[False, True, True, False, Nothing]] | ||
+ [[False, True, False, False, Nothing]] | ||
+ [[False, False, True, False, False]] | ||
+ [[False, False, False, False, False]] | ||
+ [[Nothing, True, True, True, Nothing]] | ||
+ [[Nothing, True, False, True, Nothing]] | ||
+ [[Nothing, False, True, True, Nothing]] | ||
+ [[Nothing, False, False, True, Nothing]] | ||
+ [[Nothing, True, True, False, Nothing]] | ||
+ [[Nothing, True, False, False, Nothing]] | ||
+ [[Nothing, False, True, False, Nothing]] | ||
+ [[Nothing, False, False, False, False]] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should generate this programmatically, it is very hard to parse and check this huge table for correctness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -60,43 +60,30 @@ public Storage<?> runZip( | |||
} | |||
|
|||
private BoolStorage run(BoolStorage storage, boolean hadNull, boolean hadTrue, boolean hadFalse) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps just my lack of knowledge of the codebase, but what is the difference between Null, Nothing and Missing? This PR is about Nothing, this function takes a hadNull and generates a newMissing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they all mean the same concept here really, so it is just a bit of chaos that got into the codebase over time.
Nothing
is the Enso-centrict name.Null
probably came from the Java internals naming convention (where theNothing
is often represented as anull
reference).- missing is the naming that stemmed from also old version of Table (we used to have
Column.is_missing
instead ofColumn.is_nothing
; the external API got renamed, but the internals likely stayed. The*Missing
bitsets are used when we are storing values like integers in unboxedlong[]
arrays - such don't takenull
(likeObject[]
could), so there needs to be a separate flag of which values are missing. I guess it could be calledisNothing
though, for consistency.
We could probably rename all of these to nothing
-based names.
} | ||
|
||
public static ImmutableBitSet allFalse(int size) { | ||
return new ImmutableBitSet(new BitSet(), size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is
return new ImmutableBitSet(new BitSet(size), size);
going to make the allocation more efficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the usages, I think to the contrary.
The thing is that new BitSet()
creates an empty bitset, which will answer false
to any bitset.get(ix)
.
So it creates the allFalse
bitset while using the minimal amount of memory. Whereas specifying the size
, will pre-allocate the memory - which in this case just seems unnecessary.
So actually I think that keeping not adding the size
is best for allocation (but I'm only 85% sure about it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, BitSet
is really a set of 1-positions, not a set of boolean values, which is why I needed ImmutableBitSet
to have a size field, so it would really be more like a set of bits. Specifically, you can't use .not()
unless you know what the last bit position is. (The not
-related methods are the only ones that use size
for something other than checking that the size are the same.)
} | ||
|
||
public static ImmutableBitSet allTrue(int size) { | ||
return new ImmutableBitSet(new BitSet(), size).not(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly
return new ImmutableBitSet(new BitSet(size), size).not(); ?
neg_input = negation_case.at 1 | ||
neg_argument = negation_case.at 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick but IMO this is misleading neg_input
may be negate
but it may as well be identity
. So the neg_
is misleading what this is.
Maybe just transform_input
instead?
neg_input = negation_case.at 1 | |
neg_argument = negation_case.at 2 | |
transform_input = negation_case.at 1 | |
transform_argument = negation_case.at 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Pull Request Description
Updates in-memory table column operations to treat Nothing as a relational NULL.
This PR does not include changes to Table.join.
Checklist
Please ensure that the following checklist has been satisfied before submitting the PR:
Scala,
Java,
and
Rust
style guides. In case you are using a language not listed above, follow the Rust style guide.
./run ide build
.