Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated output formats to output BigDecimal fields as numeric values instead of a byte representation. #1640

Draft
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

fernst
Copy link
Member

@fernst fernst commented Jun 16, 2022

No description provided.

@fernst fernst added the build Trigger unit test build label Jun 16, 2022
@fernst fernst requested review from tivv and albertshau June 16, 2022 03:15

// Wrap BigDecimal fields in a wrapper which handles the conversion to String.
if (value instanceof BigDecimal) {
value = new BigDecimalWrapper((BigDecimal) value);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wrapper is used because the original JsonWriter class uses toString() to generate the representation of the Number instance.

Becase of internal private methods, I can't simply implement this logic directly, so I'm using the wrapper class defined below as a workaround.

* the License.
*/

package io.cdap.cdap.format.io;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote this in this package so I had access to the underlaying JsonWriter instance.

@fernst fernst force-pushed the format-bigdecimals-as-number-text-json-output branch 2 times, most recently from c30d43c to 929b962 Compare June 16, 2022 03:34
* Custom Datum Writer for Structured Records, this class writes BigDecimal values as numbers instead of byte arrays in
* the JSON output.
*/
public class BigDecimalAwareJsonStructuredRecordDatumWriter extends JsonStructuredRecordDatumWriter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which cases do we need the original represenatation? Also, I am wondering why is it a byte array given io.cdap.cdap.format.io.JsonStructuredRecordDatumWriter#encode(io.cdap.cdap.common.io.Encoder, io.cdap.cdap.api.data.schema.Schema, java.lang.Object) handles DECIMAL?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we pass the value directly from the StructuredRecord. So decimals are simply a byte array internally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, why this logic does not go into JsonStructuredRecordDatumWriter itself?


if (value != null && logicalType == Schema.LogicalType.DECIMAL) {
BigDecimal bdValue = fromObject(value, schema);
getJsonWriter(encoder).value(bdValue);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like a hack. Why can't we do it without it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the code as implemented in the platform doesn't have logic to handle BigDecimals directly. Only primitive CDAP types are handled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

io.cdap.cdap.format.io.JsonStructuredRecordDatumWriter#encode(io.cdap.cdap.common.io.Encoder, io.cdap.cdap.api.data.schema.Schema, java.lang.Object) has Decimal support. If it does not work, we should fix it over there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it's cdap-formats it's shipped in plugin, so we can have a new plugin released with a cdap-formats fix without new CDAP release

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class only handles Decimals if the logicalTypeAsString flag is set (which we only do for preview). It also writes the value as a String which would wrap the output in quotes

https://github.com/cdapio/cdap/blob/develop/cdap-formats/src/main/java/io/cdap/cdap/format/io/JsonStructuredRecordDatumWriter.java#L135

private static final StructuredRecord RECORD = StructuredRecord.builder(SCHEMA)
.set("i", 1)
.set("f", 256.1f)
.setDecimal("bd", BD)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also test for `.setDecimal(.., null)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

if (fieldSchema.getLogicalType() != null && fieldSchema.getLogicalType() == Schema.LogicalType.DECIMAL) {
BigDecimal decimalValue = record.getDecimal(fieldName);

// Throw exception if the field is expected tu be decimal, but it could not be processed as such.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: tu

Schema.LogicalType logicalType = nonNullableSchema.getLogicalType();

if (value != null && logicalType == Schema.LogicalType.DECIMAL) {
BigDecimal bdValue = fromObject(value, schema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW: Can we have a BigDecimalWrapper right here? This would remove the need for a special JsonWriter

Copy link
Member Author

@fernst fernst Jun 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still to call the writer directly, as the superclass doesn't expose methods to write number instances.

This is the reason I have to use the writer directly.

Copy link
Contributor

@tivv tivv Jun 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why do you use it directly (and that another question above, we should just expose it). My question is if you do wrapping right here, you would not need BigDecimalAwareJsonWriter and use just vanilla JsonWriter, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is the JsonWriter's value(Number number) method uses the Number class toString() method internally to generate the representation of the value that it needs to output.

The problem with this is that BigQuery's toString() method can represent values in scientific notation. In order to avoid this, I create that wrapper which delegates to BigQuery's toPlainString() method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. If you pass this Wrapper to vanilla JsonWriter.value(Number) it would do the job, no? You don't need to create wrapper inside JsonWriter

@fernst fernst force-pushed the format-bigdecimals-as-number-text-json-output branch from 929b962 to efa31f4 Compare June 16, 2022 17:56
@fernst fernst force-pushed the format-bigdecimals-as-number-text-json-output branch from efa31f4 to 14a4992 Compare June 16, 2022 17:59
@fernst fernst requested a review from tivv June 16, 2022 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Trigger unit test build
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants