Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate usages of ArrayData::new (WIP) #813

Closed
wants to merge 1 commit into from

Conversation

jhorstmann
Copy link
Contributor

Which issue does this PR close?

Related/complementary to the validation in #810, this PR investigates our usages of ArrayData::new and whether they are actually safe. Creating primitive or boolean arrays can be done safely with minimal validation. For other data types this introduces an unsafe new_unchecked method for usages in performane critical kernels.

Todo:

  • Add Safety annotations to all usages of new_unchecked or decide whether to use a safe and validating alternative
  • In debug mode we should validate the buffers even in new_unchecked
  • ArrayDataBuilder::build needs to be either unsafe or do validation
  • Investigate whether setting a null_count that differs from the actual null_bitmap violates any invariants

"Closes" #806, #704, #705, #706 and possibly #777 since it would be no longer possible to create invalid ArrayData without unsafe code. We shouldn't close those issues without having a safe alternative.

Rationale for this change

The above mentioned security issues all start by creating ArrayData objects that violate the invariants of those arrays. Marking this creation unsafe thus satisfies the rust guidelines.

What changes are included in this PR?

Are there any user-facing changes?

The ArrayData::new method is removed in this PR but should be reintroduced with full validation of all parameters.

@jhorstmann
Copy link
Contributor Author

@alamb I also started looking into all the ArrayData::new usages and whether there could be safe alternatives for some. This does not replace the validation you are working on, I'm hoping that highlighting these usages helps in coming up with a safer abstraction.

@alamb
Copy link
Contributor

alamb commented Oct 5, 2021

This is a very cool idea @jhorstmann -- thank you.

Do you have any thoughts on when it would be most appropriate to apply the two levels of "validate"?

  • ArrayData::validate: "cheap" validation checks that checks that the lengths + offsets into the data buffers are within bounds but does not actually look at any data (e.g. it doesn't look at the actual values of offsets in Utf8 arrays)
  • ArrayData::validate_full: "expensive" validation checks that actually look at the contents of the buffers to ensure data integrity is maintained

Specifically, I am wondering if you think ArrayData::validate would be ok to be done always (even in ArrayData::new_unchecked) or if new_unchecked should really skip all validation checks.

@jhorstmann
Copy link
Contributor Author

Validating primitive or boolean arrays is probably cheap, so most kernels (arithmetic or comparisons for example) could always use the safe version. The main usecase for the unsafe variant is probably the cast kernels, when there is a match directly before that verifies that the from/to layouts are equal. There are not that many kernels in this repo that return List or Dictionary arrays and which would slow down because of the validation, but in our query engine we use those data types a lot.

Interestingly ArrayData::slice already bypasses any validation since it creates the ArrayData struct directly.

@alamb
Copy link
Contributor

alamb commented Oct 7, 2021

The main usecase for the unsafe variant is probably the cast kernels,

Amusingly (?) the cast kernels is the one place some of these checks have found a bug already by applying the validity checks, resulting #815 :)

@jhorstmann
Copy link
Contributor Author

Changes here were merged via #822

@jhorstmann jhorstmann closed this Oct 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants