-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle ShortDecimal correctly inside importFromArrow
#8957
Conversation
Hi @boneanxs! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
✅ Deploy Preview for meta-velox canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix. Just added several comments.
velox/vector/arrow/Bridge.cpp
Outdated
if (type->isShortDecimal()) { | ||
values = AlignedBuffer::allocate<int64_t>(arrowArray.length, pool); | ||
auto rawValues = values->asMutable<int64_t>(); | ||
auto oldValues = static_cast<const int128_t*>(arrowArray.buffers[1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like arrow array can hold a decimal of a bit-width other than int128_t (int256 is also possible). Maybe we can add a check in Velox importFromArrowImpl to ensure the bit-width is not set or the bit-width is indeed 128. Then it should be safe to cast buffers[1] as int128* here.
https://github.com/apache/arrow/blob/main/cpp/src/arrow/c/bridge.cc#L348-L356
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since type is firstly called here:
velox/velox/vector/arrow/Bridge.cpp
Line 1443 in 09ba220
auto type = importFromArrow(arrowSchema); |
int128_t
, otherwise importFromArrowImpl could already fail, do we need to add extra check here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried below cases in the unit test TEST_F(ArrowBridgeSchemaImportTest, scalar)
, and they both work. Could you help confirm?
EXPECT_EQ(*DECIMAL(10, 4), *testSchemaImport("d:10,4,128"));
EXPECT_EQ(*DECIMAL(10, 4), *testSchemaImport("d:10,4,256"));
@@ -1102,6 +1102,21 @@ class ArrowBridgeArrayImportTest : public ArrowBridgeArrayExportTest { | |||
} | |||
} | |||
|
|||
void testShortDecimalImport() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we reuse the method testArrowImport
? Maybe a template parameter TOuput and a default argument for the output values need to be added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, @rui-mo I tried to reuse the method but the codes looks lengthy and not elegant... Could you pls give me more hint in case I miss something We have to pass int128_t
as arrowInput while int64_t
as vector raw values, and take these values as Decimal
type...😣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented some code in rui-mo@a2d33ff. Please check if it makes sense, thanks.
9502797
to
ab1d17a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
velox/vector/arrow/Bridge.cpp
Outdated
int bitWidth = std::stoi(&format[idx + sz + 1], &sz); | ||
if (bitWidth != 128) { | ||
VELOX_USER_FAIL( | ||
"Conversion failed for '{}'. Velox decimal does not support custom bitwidth.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VELOX_USER_CHECK_EQ(bitWidth, 128, "msg");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Just two nits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Let's see if the @mbasmanova @Yuhta have any suggestion.
Gentle ping @Yuhta @majetideepak @pedroerp , hey, could you pls help review this? |
velox/vector/arrow/Bridge.cpp
Outdated
@@ -972,7 +972,17 @@ TypePtr importFromArrowImpl( | |||
// Parse "d:". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for readability, maybe also move the logic to a helper funciton?
case 'd':
return parseDecimalFormat(format);
f1830f0
to
958e6b0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Like I pointed out above, the test still access out of bounds memory in the string. We got this with ASAN runs on the new tests:
|
auto firstCommaIdx = formatStr.find(',', 2); | ||
auto secondCommaIdx = formatStr.find(',', firstCommaIdx + 1); | ||
|
||
if (firstCommaIdx == std::string::npos || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add extra check here to avoid ASAN issue, cc @pedroerp
725b988
to
389a6f9
Compare
@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
389a6f9
to
02a6ab6
Compare
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
…ator#8957) Summary: Arrow uses `int128_t` to store ShortDecimal values, while inside velox we use `int64_t`. `ExportToArrow` already handle it specifically, but `ImportFromArrow` misses this. This pr tries to fix it. Pull Request resolved: facebookincubator#8957 Reviewed By: Yuhta Differential Revision: D55019687 Pulled By: pedroerp fbshipit-source-id: 2fe32236a0e17a52ef713cff96836a48a37fec56
Arrow uses
int128_t
to store ShortDecimal values, while inside velox we useint64_t
.ExportToArrow
already handle it specifically, butImportFromArrow
misses this.This pr tries to fix it.