Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add size-limited strings and varying bit-width integer Value_Types to in-memory backend and check for ArithmeticOverflow in LongStorage #7557

Merged
merged 40 commits into from
Aug 22, 2023

Conversation

radeusgd
Copy link
Member

@radeusgd radeusgd commented Aug 10, 2023

Pull Request Description

Important Notes

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

  • The documentation has been updated, if necessary.
  • Screenshots/screencasts have been attached, if there are any visual changes. For interactive or animated visual changes, a screencast is preferred.
  • All code follows the
    Scala,
    Java,
    and
    Rust
    style guides. In case you are using a language not listed above, follow the Rust style guide.
  • All code has been tested:
    • Unit tests have been written where possible.
    • If GUI codebase was changed, the GUI was tested when built using ./run ide build.

@radeusgd
Copy link
Member Author

So I was not completely sure what to do with the case of arithmetic overflow.

We essentially have two cases: 64-bit integers and smaller types.

For 64-bit overflow, we cannot do much as we don't have a bigger type yet. Current behaviour is to do the standard 'modulus' overflow, but I think that this is wrong as it can lead to data corruption. So I replace the value with Nothing and report the warning.

Now what if I add two 16-bit values together and they overflow the 16-bit type. Then I have a bit more options - I could widen the storage type to 32-bit or 64-bit to make it fit. However, there is no clear 'heuristic' when to do so. Do I do this 'on demand'? That will mean the result type will be unpredictable - I start with 2 16-bit columns and can end up with whatever.

Instead, do I do this always? Maybe all operations should return 64-bit integers? But that will mean I will very quickly 'lose' the smaller bit-width. Even if I had small columns, they will be 'upcasted' after any operation is performed on them. If I want them back to be small, I would need to cast them again.

I think that always up-converting to 64-bit could have its merit. It is definitely a bit simpler to implement. But then the smaller types become 'second class citizens' - because we immediately 'escape' them. It may be practical though, as we will encounter overflows less often this way and can always can re-cast afterwards. And it's easier to implement.

What do you think @jdunkerley @GregoryTravis?

I'm sure for 64-bit overflows we want to have warnings and all. But for smaller types - do we also check for overflow or do we up-cast them on every operation to 64-bits?

I imagine I will implement the 64-bit logic first as it's simpler, but I'm wondering what to do with these smaller types.

@GregoryTravis
Copy link
Contributor

What do you think @jdunkerley @GregoryTravis?

I agree that widening to 64 bit by default is the better move.

The most common use case is that the original data uses a narrow integer type, but after it's read, the user doesn't need it to stay that way.

The second most common use case is that the user needs to compute something with narrow types, and write them back to a column with a narrow type. In this case, they'll get a clear warning /error that they tried to write 64 bit integers to a narrows column, and they'll have to cast.

@radeusgd
Copy link
Member Author

radeusgd commented Aug 10, 2023

What do you think @jdunkerley @GregoryTravis?

I agree that widening to 64 bit by default is the better move.

The most common use case is that the original data uses a narrow integer type, but after it's read, the user doesn't need it to stay that way.

The second most common use case is that the user needs to compute something with narrow types, and write them back to a column with a narrow type. In this case, they'll get a clear warning /error that they tried to write 64 bit integers to a narrows column, and they'll have to cast.

I think you are right. I will amend the tests tomorrow.

Amended the tests in commit 66508b1 and then in abfbf0b I added that even the % promotes to 64 bits. That is not strictly necessary (no chance of overflow with %) - but it makes the code simpler and the semantics more consistent - I thought that maybe just one single operation not promoting may actually be more confusing. I don't think it serves much practical purpose, because you hardly ever only perform % - it will usually be surrounded by other operations that would promote to 64-bits anyway.

Comment on lines 851 to 868
if non_trivial_types_supported then
src = source_table_builder [["X", [1, 2, 3]], ["Y", ["a", "xyz", "abcdefghijkl"]], ["Z", ["a", "pqrst", "abcdefghijkl"]]]
## TODO [RW] figure out what semantics we want here; I think the current one may be OK but it is going to
be slightly painful, so IMO an auto-conversion could be useful. We could make it so that we do
auto-conversion (cast), but in a more strict mode such that if anything does not fit
(even just string padding required) we fail hard and tell the user to fix this.
Test.specify "fails if the target type is more restrictive than source" <|
result = src.update_database_table dest update_action=Update_Action.Insert key_columns=[]
result.should_fail_with Column_Type_Mismatch
but_maybe="I'm not sure if we want this automatic restriction. If anything, we should probably report situations like abcdefghijkl being truncated."
Test.specify "should warn if the target type is more restrictive than source and truncation may occur" pending=but_maybe <|
result = src.update_database_table dest update_action=Update_Action.Insert key_columns=[]
IO.println (Problems.get_attached_warnings result)

result.column_names . should_equal ["X", "Y", "Z"]
result.at "X" . to_vector . should_contain_the_same_elements_as [1, 2, 3]
result.at "Y" . to_vector . should_contain_the_same_elements_as ["a", "xyz", "abc"]
result.at "Z" . to_vector . should_contain_the_same_elements_as ["a ", "pqrst", "abcde"]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is not really for this PR, hence I'm not going to stop that PR by it.

But I just realised that if we have a table with e.g. 16-bit integers, and we create a table in Enso it will be 64-bit by default. So as shown in the test above, it will raise an error and require the user to cast.

I'm wondering if it would not be better to try casting automatically. If the cast has any warnings - error hard and get the user to resolve the situation, but if the types fit I imagine we could do this automatically for convenience.

Although that will only work in-memory where we can easily check if cast worked without warnings. In database we'd risk losing data so we surely cannot do this automatic conversion.

@jdunkerley do you think such convenience auto-cast in upload is worth it? If so, I will appreciate creating a ticket for it. Or just let me know and I'll create one once I'm back from vacation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is worth doing; if the user has a target table with a narrow type, they are likely trying to write data that they believe will fit, so this is a common case.

@radeusgd radeusgd marked this pull request as ready for review August 11, 2023 12:20
@radeusgd radeusgd self-assigned this Aug 11, 2023
Comment on lines 851 to 868
if non_trivial_types_supported then
src = source_table_builder [["X", [1, 2, 3]], ["Y", ["a", "xyz", "abcdefghijkl"]], ["Z", ["a", "pqrst", "abcdefghijkl"]]]
## TODO [RW] figure out what semantics we want here; I think the current one may be OK but it is going to
be slightly painful, so IMO an auto-conversion could be useful. We could make it so that we do
auto-conversion (cast), but in a more strict mode such that if anything does not fit
(even just string padding required) we fail hard and tell the user to fix this.
Test.specify "fails if the target type is more restrictive than source" <|
result = src.update_database_table dest update_action=Update_Action.Insert key_columns=[]
result.should_fail_with Column_Type_Mismatch
but_maybe="I'm not sure if we want this automatic restriction. If anything, we should probably report situations like abcdefghijkl being truncated."
Test.specify "should warn if the target type is more restrictive than source and truncation may occur" pending=but_maybe <|
result = src.update_database_table dest update_action=Update_Action.Insert key_columns=[]
IO.println (Problems.get_attached_warnings result)

result.column_names . should_equal ["X", "Y", "Z"]
result.at "X" . to_vector . should_contain_the_same_elements_as [1, 2, 3]
result.at "Y" . to_vector . should_contain_the_same_elements_as ["a", "xyz", "abc"]
result.at "Z" . to_vector . should_contain_the_same_elements_as ["a ", "pqrst", "abcde"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is worth doing; if the user has a target table with a narrow type, they are likely trying to write data that they believe will fit, so this is a common case.

@radeusgd radeusgd force-pushed the wip/radeusgd/5159-new-inmemory-value-types branch from b710f25 to 90f44a9 Compare August 21, 2023 09:52
@radeusgd
Copy link
Member Author

radeusgd commented Aug 21, 2023

I amended the default iteration counts as I was seeing insufficient warmup leading to inconsistent benchmark results. Results afterwards:

Scenario develop this PR
Addition 3.331 ms 2.880 ms
Addition with Overflow 2.874 ms 7.619 ms
Multiplication 2.872 ms 2.844 ms
Multiplication with Overflow 2.796 ms 7.487 ms

We can see that as long as there is no overflow, both approaches have comparable performance (surprisingly for + the addExact was even a bit faster - but I suspect it was just insufficient warmup for the develop variant - I was extending warmup already as without warmup it was more around 8ms for no overflow vs 19ms for overflow cases; I imagine if I had extended the warmup even more both results would likely converge). This suggests that in the 'happy path' there seems to be no or close-to-none overhead for the operation (in fact both measurements here were faster).

Of course in case of overflow, the new approach is slower. In my setting I set it so that 20% of rows do overflow and with that the slowdown is about 2.6x. It will likely depend on the % of rows that overflow (<1% will likely cause little overhead, whereas 100% will increase it significantly) and other factors like base stack size that influences how costly it is to throw the exception. Still here we are comparing essentially 'incorrect' behaviour (the develop variant just allows the values to overflow) with an error-catching one which simply has much more 'work to do' so it is expected it may be slower.

Raw results

develop:

Found 4 cases to execute
Benchmarking 'Column_Arithmetic_1000000.Plus_Fitting' with configuration: [warmup={3 iterations, 5 seconds each}, measurement={3 iterations, 5 seconds each}]
Warmup duration:    15194.5505 ms
Warmup invocations: 1663
Warmup avg time:    9.021 ms
Measurement duration:    15187.6392 ms
Measurement invocations: 4503
Measurement avg time:    3.331 ms
Benchmark 'Column_Arithmetic_1000000.Plus_Fitting' finished in 30391.999 ms
Benchmarking 'Column_Arithmetic_1000000.Plus_Overflowing' with configuration: [warmup={3 iterations, 5 seconds each}, measurement={3 iterations, 5 seconds each}]
Warmup duration:    15161.867 ms
Warmup invocations: 3680
Warmup avg time:    4.077 ms
Measurement duration:    15204.4607 ms
Measurement invocations: 5220
Measurement avg time:    2.874 ms
Benchmark 'Column_Arithmetic_1000000.Plus_Overflowing' finished in 30368.108 ms
Benchmarking 'Column_Arithmetic_1000000.Multiply_Fitting' with configuration: [warmup={3 iterations, 5 seconds each}, measurement={3 iterations, 5 seconds each}]
Warmup duration:    15153.7378 ms
Warmup invocations: 3618
Warmup avg time:    4.147 ms
Measurement duration:    15173.2304 ms
Measurement invocations: 5224
Measurement avg time:    2.872 ms
Benchmark 'Column_Arithmetic_1000000.Multiply_Fitting' finished in 30328.21 ms
Benchmarking 'Column_Arithmetic_1000000.Multiply_Overflowing' with configuration: [warmup={3 iterations, 5 seconds each}, measurement={3 iterations, 5 seconds each}]
Warmup duration:    15093.7514 ms
Warmup invocations: 4072
Warmup avg time:    3.684 ms
Measurement duration:    15135.0264 ms
Measurement invocations: 5366
Measurement avg time:    2.796 ms
Benchmark 'Column_Arithmetic_1000000.Multiply_Overflowing' finished in 30229.871 ms

this PR:

Found 4 cases to execute
Benchmarking 'Column_Arithmetic_1000000.Plus_Fitting' with configuration: [warmup={3 iterations, 5 seconds each}, measurement={3 iterations, 5 seconds each}]
Warmup duration:    15197.3365 ms
Warmup invocations: 2184
Warmup avg time:    6.869 ms
Measurement duration:    15191.5517 ms
Measurement invocations: 5208
Measurement avg time:    2.88 ms
Benchmark 'Column_Arithmetic_1000000.Plus_Fitting' finished in 30398.177 ms
Benchmarking 'Column_Arithmetic_1000000.Plus_Overflowing' with configuration: [warmup={3 iterations, 5 seconds each}, measurement={3 iterations, 5 seconds each}]
Warmup duration:    15036.2704 ms
Warmup invocations: 863
Warmup avg time:    17.389 ms
Measurement duration:    15029.3346 ms
Measurement invocations: 1969
Measurement avg time:    7.619 ms
Benchmark 'Column_Arithmetic_1000000.Plus_Overflowing' finished in 30067.19 ms
Benchmarking 'Column_Arithmetic_1000000.Multiply_Fitting' with configuration: [warmup={3 iterations, 5 seconds each}, measurement={3 iterations, 5 seconds each}]
Warmup duration:    15137.2899 ms
Warmup invocations: 3695
Warmup avg time:    4.06 ms
Measurement duration:    15163.2409 ms
Measurement invocations: 5274
Measurement avg time:    2.844 ms
Benchmark 'Column_Arithmetic_1000000.Multiply_Fitting' finished in 30302.24 ms
Benchmarking 'Column_Arithmetic_1000000.Multiply_Overflowing' with configuration: [warmup={3 iterations, 5 seconds each}, measurement={3 iterations, 5 seconds each}]
Warmup duration:    15015.5373 ms
Warmup invocations: 1148
Warmup avg time:    13.069 ms
Measurement duration:    15027.4779 ms
Measurement invocations: 2004
Measurement avg time:    7.487 ms
Benchmark 'Column_Arithmetic_1000000.Multiply_Overflowing' finished in 30044.145 ms

@radeusgd radeusgd force-pushed the wip/radeusgd/5159-new-inmemory-value-types branch from a8cf52f to 2659f97 Compare August 22, 2023 09:47
@jdunkerley jdunkerley linked an issue Aug 22, 2023 that may be closed by this pull request
@radeusgd radeusgd changed the title Add size-limited strings and varying bit-width integer Value_Types to in-memory backend Add size-limited strings and varying bit-width integer Value_Types to in-memory backend and check for ArithmeticOverflow in LongStorage Aug 22, 2023
@radeusgd radeusgd requested a review from GregoryTravis August 22, 2023 16:51
@mergify mergify bot merged commit 2385f5b into develop Aug 22, 2023
@mergify mergify bot deleted the wip/radeusgd/5159-new-inmemory-value-types branch August 22, 2023 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI: Ready to merge This PR is eligible for automatic merge
Projects
None yet
3 participants