Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement relational NULL semantics for Nothing for in-memory Column operations #8816

Merged
merged 62 commits into from
Jan 24, 2024

Conversation

GregoryTravis
Copy link
Contributor

@GregoryTravis GregoryTravis commented Jan 19, 2024

Pull Request Description

Updates in-memory table column operations to treat Nothing as a relational NULL.
This PR does not include changes to Table.join.

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

  • The documentation has been updated, if necessary.
  • Screenshots/screencasts have been attached, if there are any visual changes. For interactive or animated visual changes, a screencast is preferred.
  • All code follows the
    Scala,
    Java,
    and
    Rust
    style guides. In case you are using a language not listed above, follow the Rust style guide.
  • All code has been tested:
    • Unit tests have been written where possible.
    • If GUI codebase was changed, the GUI was tested when built using ./run ide build.

@@ -1600,7 +1600,7 @@ type Column
run_vectorized_binary_op self op_name as_vector expected_result_type=Value_Type.Boolean skip_nulls=False new_name=result_name
False ->
set = Set.from_vector as_vector error_on_duplicates=False
run_unary_op self set.contains new_name=result_name skip_nulls=False expected_result_type=Value_Type.Boolean
run_unary_op self set.contains_relational new_name=result_name skip_nulls=True expected_result_type=Value_Type.Boolean
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@radeusgd I'm not sure how to invoke the fallback implementation of == (line 214) -- what case uses this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need more context for this question. Line 214 of which file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 214 of this file -- I can't leave a comment there.
Or here: https://github.com/enso-org/enso/blob/develop/distribution/lib/Standard/Table/0.0.0-dev/src/Data/Column.enso#L214

I think this code might need to be modified for handling nulls differently, but I'm not sure what cases use it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try using == on a Mixed column, e.g.

t = Table.new [["X", ["a", 42, Nothing]], ["Y", ["b", 42, True]]]
(t.at "X") == (t.at "Y")

I think it should trigger it, but you can add some println to make sure :)

Copy link
Contributor Author

@GregoryTravis GregoryTravis Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I remember now -- == uses skip_nulls=True, so the fallback is never used for nulls. So no change is needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh :D

@GregoryTravis GregoryTravis marked this pull request as ready for review January 19, 2024 20:58
Comment on lines 60 to 61
if (storage.isNa(i)) {
missing.set(i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if arg.size() == 0? Looking at the contains_relational semantics, it should result in False, but looking at this code it seems that it will result in Nothing.

Do we have a test case for such scenario? If not, please let's add it and ensure the implementation is correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +5 to +8
/**
* A wrapper around BitSet that implements boolean operations conveniently. Unlike BitSet,
* ImmutableBitSet takes a size parameter, which allows .not to be implemented.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

Comment on lines 152 to 177
# [input, had_null, had_true, had_false, output]
cases = [[True, True, True, True, True]]
+ [[True, True, True, False, True]]
+ [[True, False, True, True, True]]
+ [[True, False, True, False, True]]
+ [[True, True, False, True, Nothing]]
+ [[True, True, False, False, Nothing]]
+ [[True, False, False, True, False]]
+ [[True, False, False, False, False]]
+ [[False, True, True, True, True]]
+ [[False, True, False, True, True]]
+ [[False, False, True, True, True]]
+ [[False, False, False, True, True]]
+ [[False, True, True, False, Nothing]]
+ [[False, True, False, False, Nothing]]
+ [[False, False, True, False, False]]
+ [[False, False, False, False, False]]
+ [[Nothing, True, True, True, Nothing]]
+ [[Nothing, True, False, True, Nothing]]
+ [[Nothing, False, True, True, Nothing]]
+ [[Nothing, False, False, True, Nothing]]
+ [[Nothing, True, True, False, Nothing]]
+ [[Nothing, True, False, False, Nothing]]
+ [[Nothing, False, True, False, Nothing]]
+ [[Nothing, False, False, False, False]]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should generate this programmatically, it is very hard to parse and check this huge table for correctness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@GregoryTravis GregoryTravis added the CI: Clean build required CI runners will be cleaned before and after this PR is built. label Jan 22, 2024
@@ -60,43 +60,30 @@ public Storage<?> runZip(
}

private BoolStorage run(BoolStorage storage, boolean hadNull, boolean hadTrue, boolean hadFalse) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps just my lack of knowledge of the codebase, but what is the difference between Null, Nothing and Missing? This PR is about Nothing, this function takes a hadNull and generates a newMissing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they all mean the same concept here really, so it is just a bit of chaos that got into the codebase over time.

  • Nothing is the Enso-centrict name.
  • Null probably came from the Java internals naming convention (where the Nothing is often represented as a null reference).
  • missing is the naming that stemmed from also old version of Table (we used to have Column.is_missing instead of Column.is_nothing; the external API got renamed, but the internals likely stayed. The *Missing bitsets are used when we are storing values like integers in unboxed long[] arrays - such don't take null (like Object[] could), so there needs to be a separate flag of which values are missing. I guess it could be called isNothing though, for consistency.

We could probably rename all of these to nothing-based names.

}

public static ImmutableBitSet allFalse(int size) {
return new ImmutableBitSet(new BitSet(), size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is

return new ImmutableBitSet(new BitSet(size), size);

going to make the allocation more efficient?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the usages, I think to the contrary.

The thing is that new BitSet() creates an empty bitset, which will answer false to any bitset.get(ix).

So it creates the allFalse bitset while using the minimal amount of memory. Whereas specifying the size, will pre-allocate the memory - which in this case just seems unnecessary.

So actually I think that keeping not adding the size is best for allocation (but I'm only 85% sure about it).

Copy link
Contributor Author

@GregoryTravis GregoryTravis Jan 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, BitSet is really a set of 1-positions, not a set of boolean values, which is why I needed ImmutableBitSet to have a size field, so it would really be more like a set of bits. Specifically, you can't use .not() unless you know what the last bit position is. (The not-related methods are the only ones that use size for something other than checking that the size are the same.)

}

public static ImmutableBitSet allTrue(int size) {
return new ImmutableBitSet(new BitSet(), size).not();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly

return new ImmutableBitSet(new BitSet(size), size).not(); ?

Comment on lines 178 to 179
neg_input = negation_case.at 1
neg_argument = negation_case.at 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick but IMO this is misleading neg_input may be negate but it may as well be identity. So the neg_ is misleading what this is.

Maybe just transform_input instead?

Suggested change
neg_input = negation_case.at 1
neg_argument = negation_case.at 2
transform_input = negation_case.at 1
transform_argument = negation_case.at 2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@GregoryTravis GregoryTravis added the CI: Ready to merge This PR is eligible for automatic merge label Jan 24, 2024
@mergify mergify bot merged commit 5eb3f3b into develop Jan 24, 2024
26 of 27 checks passed
@mergify mergify bot deleted the wip/gmt/5156-nothing branch January 24, 2024 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI: Clean build required CI runners will be cleaned before and after this PR is built. CI: Ready to merge This PR is eligible for automatic merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants