Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add 'TypeStrategy' to types #11888

Merged
merged 19 commits into from
Jan 11, 2022
Merged

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented Nov 8, 2021

Description

Following up on discussion in #11853, this PR introduces a new TypeStrategy interface which defines binary serde and comparison for values of any TypeSignature.

The binary serialization methods were consolidated from the static methods in Types, and the code I think is overall much more simple.

TypeStrategy deals in objects, so is not optimal for primitive values like longs and doubles, so the static methods for reading/writing these values with nulls remain but now live in TypeStrategies, along with all of the built-in implementations (string, long, double, float, array).

The comparator implementations match the orderings defined in druid-processing... we may wish to consider bringing some of them into druid-core, merging druid-core and druid-processing :trollface: , or something else to try to consolidate things a bit better. The array comparator behaves mostly the same as postgres array comparison afaict, with the exception that we default to "nulls first" for most thing rather than "nulls last" so we do that instead.

ObjectByteStrategy is moved back into ObjectStrategy, it's registry replaced with one of TypeStrategy, and instead ComplexMetricsSerde has a new default implementation of a method called getTypeStrategy which wraps the ObjectStrategy methods, but could be overridden for more optimal implementations.

I think the utility of being able to get binary serde to use for things like buffer aggregators is on full display in the ExprEval serialization methods, so I think overall i'm in favor of this change, since basically anywhere you have a ColumnInspector (which includes RowSignature) you have the means to read, write, and compare values.


Key changed/added classes in this PR
  • TypeStrategy
  • TypeStrategies
  • Types
  • NullableTypeStrategy
  • ComplexMetricsSerde
  • ColumnType
  • ColumnTypeFactory
  • ExpressionType

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

* the values maximum size. If the value is null, the size will be {@link Byte#BYTES}, otherwise it will be
* {@link Byte#BYTES} + {@link #estimateSizeBytes(Object)}
*/
default int estimateSizeBytesNullable(@Nullable T value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about hand the null handling over to the implementation here. The strategy as I understand it is using a whole byte for whether there is a null value. A byte for whether it is null consumes 8x the space than it really needs, this can actually add up in unfortunate ways when there are lots of columns in the result set.

For example, a "better" implementation would be to start every row with a bitmap of all of the columns that are null. This would consume only a single bit per column rather than a byte per column and, if we were so inclined, could also be properly padded to try to enforce word-alignment.

Having an implementation that does it with a whole byte as a stepping stone is maybe okay. But, with the way this interface is built, the interface is forcing rather sub-optimal null-handling on the query engines that use this interface. I think it would be better to perhaps declare that a size of -1 means that the value is equivalent to null and have the thing external from this do something meaningful with that knowledge.

We would likely also need to adjust the signature of write to return an int to indicate the number of bytes written (once again, returning a -1 for "it was null")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I agree, I pulled the null-byte wrapper handling into a special NullableTypeStrategy which can be used to wrap regular TypeStrategy implementations (which are now not expected to process nulls), and added lots of javadocs to suggest not using it unless you've got space to burn or no other options, so hopefully it will limit use.

@gianm
Copy link
Contributor

gianm commented Nov 17, 2021

Some high level thoughts:

  1. estimateSizeBytesNullable appears to be the only reference to nullability. The other methods say they deal with non-null data. That seems inconsistent. Here I think I'm in favor of what @cheddar is suggesting, that this interface should not be null-aware, and null handling should be left to the caller. (Callers shouldn't need to use this interface if they get a null.)

  2. To simplify life for callers that don't want to care about null handling, we could make a NullableTypeStrategy that accepts nulls and stores a null byte. It'd be composable with all the other TypeStrategy implementations, and lets certain callers stay simple while also letting other callers be fancy.

  3. I'm confused about the intent of estimateSizeBytes: is it meant to be a maximum size? Like, can the estimate be an underestimate? Some of the javadocs and comments seem to suggest that is the case. If it's meant to be a max, we should call it specifically "max" instead of "estimate".

  4. I'm not sure if anything could use it today, but, we should design this interface to support variable size serialization lengths, since we'll want it for various future data types. Those could work by having write take a max length, and returning -1 if it would need more than that. Then the caller could allocate more space and call write again. Or, maybe, write could return -the_amount_of_space_it_needs. In this case the "estimate" method really could return an estimate rather than a max.

  5. Even with variable size support like (4) suggest, we will still have callers that can't deal with that. So there will need to be some API that supports those callers. I'm not sure what that'd look like; maybe a getMaxBytes method that returns the max, or -1 if there really isn't a known max? And if the max is -1, then callers that only support constant size serialization can't use the API?

@clintropolis
Copy link
Member Author

  1. estimateSizeBytesNullable appears to be the only reference to nullability. The other methods say they deal with non-null data. That seems inconsistent. Here I think I'm in favor of what @cheddar is suggesting, that this interface should not be null-aware, and null handling should be left to the caller. (Callers shouldn't need to use this interface if they get a null.)

Oops, this was a leftover from the first round of changes from @cheddar 's comment where I moved most of the null byte handling stuff out of TypeStrategy itself, but forgot to move this one out. Since then, I've moved all of the null byte handling into the special NullableTypeStrategy that should be used for callers that want to use the null byte from NullHandling, along with lots of javadocs to encourage other options if available since the byte per value is quite expensive.

  1. To simplify life for callers that don't want to care about null handling, we could make a NullableTypeStrategy that accepts nulls and stores a null byte. It'd be composable with all the other TypeStrategy implementations, and lets certain callers stay simple while also letting other callers be fancy.

Did this, as well as adding a getNullableStrategy to TypeSignature that will automatically wrap the output of getStrategy with the NullableTypeStrategy

  1. I'm confused about the intent of estimateSizeBytes: is it meant to be a maximum size? Like, can the estimate be an underestimate? Some of the javadocs and comments seem to suggest that is the case. If it's meant to be a max, we should call it specifically "max" instead of "estimate".

Before the most recent set of changes, estimate was actually being used to get the exact size of some value in bytes, in order to check that variably sized values did not exceed this maximum. This has been reworked as suggested in 4) to instead pass maxSizeBytes directly into the write family of methods, and now has been relaxed to only needing to be an estimate. The only thing still using it in the current PR is the heap based ExpressionLambdaAggregator, which uses it to check that the aggregator does not explode the heap in an unbounded way.

  1. I'm not sure if anything could use it today, but, we should design this interface to support variable size serialization lengths, since we'll want it for various future data types. Those could work by having write take a max length, and returning -1 if it would need more than that. Then the caller could allocate more space and call write again. Or, maybe, write could return -the_amount_of_space_it_needs. In this case the "estimate" method really could return an estimate rather than a max.

Variably sized serialization was previously already supported (but required externally verifying that there was enough space with the estimate method). But, I liked the ideas here, so I have modified the write methods to now accept a maxSizeBytes parameter and went with -the_amount_of_extra_space_it_needs approach if this value (or buffer size) is exceeded, so that callers could choose to explode if the value was not able to be completely written, as well as know how much additional space would be required to fully write the value.

  1. Even with variable size support like (4) suggest, we will still have callers that can't deal with that. So there will need to be some API that supports those callers. I'm not sure what that'd look like; maybe a getMaxBytes method that returns the max, or -1 if there really isn't a known max? And if the max is -1, then callers that only support constant size serialization can't use the API?

Variably sized support is in place, I'm not sure we need a getMaxBytes method as attempting to call write now requires specifying a max bytes, even for constant sized values, callers with constant sized serialization can either specify the correct size or some size larger than the actual size, which is sort of ugly, but also primitive numerics shouldn't really be using TypeStrategy directly, instead they should be using the static methods to avoid converting the java primitives, so maybe it isn't a huge deal.

@Override
public int compare(Object[] o1, Object[] o2)
{
final int iter = Math.max(o1.length, o2.length);
Copy link
Contributor

@cryptoe cryptoe Dec 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be math.min as might have a scenario we can get indexOutofBoundException no ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops - yes, good catch, will add some tests since there obviously aren't any 😅

final int oldPosition = buffer.position();
buffer.position(offset);
T value = read(buffer);
buffer.position(oldPosition);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider two changes:

  • Using try/finally so the buffer position is still valid if read throws an exception
  • Updating the javadoc to say that even though the buffer position is unchanged upon method exit, it may be changed temporarily while the method is running, so it isn't concurrency-safe.

buffer.position(offset);
write(buffer, value);
final int size = buffer.position() - offset;
buffer.position(oldPosition);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider two changes:

  • Using try/finally so the buffer position is still valid if write throws an exception
  • Updating the javadoc to say that even though the buffer position is unchanged upon method exit, it may be changed temporarily while the method is running, so it isn't concurrency-safe.

/**
* Estimate the size in bytes that writing this value to memory would require. This method is not required to be
* exactly correct, but many implementations might be. Implementations should err on the side of over-estimating if
* exact sizing is not efficient
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd include something here about how this estimate is expected to be used. (Like an example)

That'll help people understand when they should use this method, and also understand how to implement it. I'm suggesting this because you gotta be careful with "estimate" methods: people have wildly differing ideas of how accurate they need to be, so it pays to be specific in the doc comments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added to javadocs to better explain usage



/**
* Read a non-null value from the {@link ByteBuffer} at the current {@link ByteBuffer#position()}. This will move
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"non-null" isn't true for the NullableTypeStrategy impl, which is still a TypeStrategy, so this method contract should be adjusted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split NullableTypeStrategy out to be standalone so the contract isn't confusing, it wasn't really used outside of tests as a TypeStrategy so it doesn't seem like a loss and there is no longer ambiguity of whether or not nulls are handled so i think is an improvement

* Write a non-null value to the {@link ByteBuffer} at position {@link ByteBuffer#position()}. This will move the
* underlying position by the size of the value written, and returns the number of bytes written.
*
* If writing the value would take more than 'maxSizeBytes' (or the buffer limit, whichever is smaller), this method
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we spec this to be an error if maxSizeBytes is greater than buffer.remaining()? Seems like a weird thing for a caller to do. Or is there a reason a caller might do that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i had it this way so that callers that don't know or care about size can pass in an arbitrary value like Integer.MAX_VALUE, but it didn't seem very useful so i've made it an exception now, the old behavior was a bit strange coupled with the negative size written return values for when write doesn't have enough room since the caller knows that they definitely need to add that many bytes to maxSizeBytes to have enough space to complete the write

* If writing the value would take more than 'maxSizeBytes' (or the buffer limit, whichever is smaller), this method
* will return a negative value indicating the number of additional bytes that would be required to fully write the
* value. Partial results may be written to the buffer when in this state, and the position may be left at whatever
* point the implementation ran out of space while writing the value.
Copy link
Contributor

@gianm gianm Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the caller is expected to save the position? Or is the caller safe to assume it can roll back by -retval?

i.e., should callers do this?

int written = write(buf, value, maxBytes);
if (written < 0) {
  buf.position(buf.position() + written); // roll back buf, undoing the write
}

or this?

int oldPosition = buf.position();
int written = write(buf, value, maxBytes);
if (written < 0) {
  buf.position(oldPosition); // roll back buf, undoing the write
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the latter, caller should save the position before writing to handle an incomplete write, added javadocs to clarify

* to set and reset buffer positions as appropriate for the offset based methods, but may be overridden if a more
* optimized implementation is needed.
*/
public interface TypeStrategy<T> extends Comparator<T>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will stuff serialized with this interface be persisted to disk, sent between servers that might be running different versions of code, used in caches, etc? Or will it always be in-memory, single-server, & ephemeral?

Implementers are going to want to know the answer, so they can design their formats accordingly. (If persistence is required then they might want to add a version byte, avoid using native endianness, etc.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added more javadocs indicating that this is primarily for transient stuffs, and if you want persistence should bring your own version and endian tracking and stuff

@clintropolis clintropolis mentioned this pull request Dec 23, 2021
9 tasks
*
* @return number of bytes written (always 9)
*/
public static int writeNullableLong(ByteBuffer buffer, int offset, long value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should it be called writeLongWithNonNullMark or something rather than NullableLong? It reads like the long parameter can be null.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to readNotNullNullable.../writeNotNullNullable... to reduce confusion

*
* @return number of bytes written
*/
int write(ByteBuffer buffer, T value, int maxSizeBytes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we document in the javadoc that this method can explode when maxSizeBytes > buffer.remaining()?

@Override
public int write(ByteBuffer buffer, T value, int maxSizeBytes)
{
byte[] bytes = objectStrategy.toBytes(value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this method also call TypeStrategies.checkMaxSize() first?

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 after CI. Thank you @clintropolis!

@clintropolis
Copy link
Member Author

thanks for review all 👍

@clintropolis clintropolis merged commit e583033 into apache:master Jan 11, 2022
@clintropolis clintropolis deleted the type-strategy branch January 11, 2022 01:12
@abhishekagarwal87 abhishekagarwal87 added this to the 0.23.0 milestone May 11, 2022
sachinsagare pushed a commit to pinterest/druid that referenced this pull request Oct 28, 2022
* add TypeStrategy - value comparators and binary serialization for any TypeSignature

(cherry picked from commit e583033)
sachinsagare pushed a commit to sachinsagare/druid that referenced this pull request Nov 2, 2022
* add TypeStrategy - value comparators and binary serialization for any TypeSignature

(cherry picked from commit e583033)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants