Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24257][SQL]LongToUnsafeRowMap calculate the new size may be wrong #21311

Closed
wants to merge 6 commits into from

Conversation

cxzl25
Copy link
Contributor

@cxzl25 cxzl25 commented May 12, 2018

What changes were proposed in this pull request?

LongToUnsafeRowMap
Calculate the new size simply by multiplying by 2
At this time, the size of the application may not be enough to store data
Some data is lost and the data read out is dirty

How was this patch tested?

HashedRelationSuite
test("LongToUnsafeRowMap with big values")

@maropu
Copy link
Member

maropu commented May 14, 2018

@gatorsmile @hvanhovell Could you trigger tests?

@kiszk
Copy link
Member

kiszk commented May 14, 2018

cc @cloud-fan

ensureAcquireMemory(used * 8L * 2)
val newPage = new Array[Long](used * 2)
val multiples = math.max(math.ceil(needSize.toDouble / (used * 8L)).toInt, 2)
ensureAcquireMemory(used * 8L * multiples)
Copy link
Member

@kiszk kiszk May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we move the size check into before ensureAcquireMemory()? IIUC, we have to check used * multiplies <= ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH` now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about shaping up this logic along with the other similar ones (spliting this func into two parts: grow/append)? e.g., UTF8StringBuilder https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UTF8StringBuilder.java#L43

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on grow/append

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.Spliting append func into two parts: grow/append.

if (cursor + 8 + row.getSizeInBytes > page.length * 8L + Platform.LONG_ARRAY_OFFSET) {
val needSize = cursor + 8 + row.getSizeInBytes
val nowSize = page.length * 8L + Platform.LONG_ARRAY_OFFSET
if (needSize > nowSize) {
val used = page.length
if (used >= (1 << 30)) {
sys.error("Can not build a HashedRelation that is larger than 8G")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not related to this pr though, sys.error instead of UnsupportedOperationException?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. sys.error instead of UnsupportedOperationException

@cloud-fan
Copy link
Contributor

ok to test

val used = page.length
if (used >= (1 << 30)) {
sys.error("Can not build a HashedRelation that is larger than 8G")
}
ensureAcquireMemory(used * 8L * 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doubling the size when growing is very typical, seems what you want to address is when the memory is enough for the requsted size but not enough for doubling the size. I'd suggest we should double the size most of the time, as long as there is enough memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok . Doubling the size when growing.

…n growing;sys.error instead of UnsupportedOperationException
@cxzl25
Copy link
Contributor Author

cxzl25 commented May 14, 2018

Thanks for your review. @maropu @kiszk @cloud-fan

I submitted a modification including the following:

  1. spliting append func into two parts:grow/appendG
  2. doubling the size when growing
  3. sys.error instead of UnsupportedOperationException

@SparkQA
Copy link

SparkQA commented May 14, 2018

Test build #90575 has finished for PR 21311 at commit d9d8e62.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 14, 2018

Test build #90574 has finished for PR 21311 at commit 22a2767.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

@cxzl25, to clarify:

Some data is lost and the data read out is dirty

To clarify, is this a potential cause of a wrong-answer correctness bug? If so, we should be sure to backport the resulting fix to maintenance branches. /cc @cloud-fan @gatorsmile

@cxzl25
Copy link
Contributor Author

cxzl25 commented May 22, 2018

@JoshRosen @cloud-fan @gatorsmile
When introducing SPARK-10399,UnsafeRow#getUTF8String check the size at this time.
UnsafeRow#getUTF8String
OnHeapMemoryBlock

The sum of size 2097152 and offset 32 should not be larger than the size of the given memory space 2097168
image

But when this patch is not introduced, no error, get wrong value.

spark-2 2 0

@cloud-fan
Copy link
Contributor

Calculate the new size simply by multiplying by 2
At this time, the size of the application may not be enough to store data
Some data is lost and the data read out is dirty

Can you explain more about it? IIUC if we don't have enough memory for size * 2, we would just fail with OOM, instead of setting a wrong size.

@cxzl25
Copy link
Contributor Author

cxzl25 commented May 22, 2018

@cloud-fan
LongToUnsafeRowMap#append(key: Long, row: UnsafeRow)
when row.getSizeInBytes > newPageSize( oldPage.length * 8L * 2),still use newPageSize value.
When the new page size is insufficient to hold the entire row of data, Platform.copyMemory is still called.No error.
At this time, the last remaining data was discarded.
When reading data, read this buffer according to offset and length,the last data is unpredictable.

@@ -626,6 +618,32 @@ private[execution] final class LongToUnsafeRowMap(val mm: TaskMemoryManager, cap
}
}

private def grow(neededSize: Int): Unit = {
// There is 8 bytes for the pointer to next value
val totalNeededSize = cursor + 8 + neededSize
Copy link
Contributor

@cloud-fan cloud-fan May 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The grow logic should be: we must grow to fit the new row, otherwise OOM should be thrown. If possible, grow to oldSize * 2

private def grow(inputRowSize: Int): Unit = {
  val neededNumWords = (cursor - Platform.LONG_ARRAY_OFFSET + 8 + inputRowSize + 7) / 8
  if (neededNumWords > page.length) {
    if (neededNumWords > (1 << 30)) fail...
    val newNumWords = math.max(neededNumWords, math.min(page.length * 2, 1 << 30))
    ensureAcquireMemory(newNumWords * 8L)
  ...
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Thank you for your suggestion and code.

@@ -626,6 +618,29 @@ private[execution] final class LongToUnsafeRowMap(val mm: TaskMemoryManager, cap
}
}

private def grow(inputRowSize: Int): Unit = {
val neededNumWords = (cursor - Platform.LONG_ARRAY_OFFSET + 8 + inputRowSize + 7) / 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't forget the comment for the 8 bytes pointer

"Can not build a HashedRelation that is larger than 8G")
}
val newNumWords = math.max(neededNumWords, math.min(page.length * 2, 1 << 30))
if (newNumWords > ARRAY_MAX) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we won't need this check now, newNumWords is guaranteed to be less than (1 << 30), which is much smaller than ARRAY_MAX

val unsafeProj = UnsafeProjection.create(Seq(BoundReference(0, StringType, false)))
val keys = Seq(0L)
val map = new LongToUnsafeRowMap(taskMemoryManager, 1)
val bigStr = UTF8String.fromString("x" * 1024 * 1024 * 2)
Copy link
Contributor

@cloud-fan cloud-fan May 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add a comment to say, the page array is initialized with length 1 << 17, so here we need a value larger than 1 << 18, to trigger the bug

val keys = Seq(0L)
val map = new LongToUnsafeRowMap(taskMemoryManager, 1)
val bigStr = UTF8String.fromString("x" * 1024 * 1024 * 2)
keys.foreach { k =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we just have one key, why use loop?

@@ -30,6 +30,7 @@ import org.apache.spark.sql.catalyst.expressions._
import org.apache.spark.sql.catalyst.plans.physical.BroadcastMode
import org.apache.spark.sql.types.LongType
import org.apache.spark.unsafe.Platform
import org.apache.spark.unsafe.array.ByteArrayMethods
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed

Long.MaxValue,
1),
0)
val unsafeProj = UnsafeProjection.create(Seq(BoundReference(0, StringType, false)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: UnsafeProjection.create(Array(StringType))

map.append(k, unsafeProj(InternalRow(bigStr)))
}
map.optimize()
val row = unsafeProj(InternalRow(bigStr)).copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val resultRow = new UnsafeRow(1)

val key = 0L
// the page array is initialized with length 1 << 17,
// so here we need a value larger than 1 << 18
val bigStr = UTF8String.fromString("x" * 1024 * 1024 * 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we just do "x" * (1 << 19) here?

@cloud-fan
Copy link
Contributor

LGTM, good catch!

@SparkQA
Copy link

SparkQA commented May 22, 2018

Test build #90966 has finished for PR 21311 at commit 6fe1dd0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val key = 0L
// the page array is initialized with length 1 << 17 (1M bytes),
// so here we need a value larger than 1 << 18 (2M bytes),to trigger the bug
val bigStr = UTF8String.fromString("x" * (1 << 22))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to double check, do we have to use 1 << 22 to trigger this bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary.
Just chose a larger value to make it easier to lose data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean this bug can't be reproduced consistently? e.g. if we pick 1 << 18 + 1, we may not expose this bug, so we have to use 1 << 22 to 100% reproduce this bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LongToUnsafeRowMap#getRow
resultRow=UnsafeRow#pointTo(page(1<<18), baseOffset(16), sizeInBytes(1<<21+16))

UTF8String#getBytes
copyMemory(base(page), offset, bytes, BYTE_ARRAY_OFFSET, numBytes(1<<21+16));

In the case of similar size sometimes, can still read the original value.

When introducing SPARK-10399,UnsafeRow#getUTF8String check the size at this time.
If we pick 1 << 18 + 1, 100% reproduce this bug.

But when this patch is not introduced, differences that are too small sometimes do not trigger.
So I chose a larger value.

My understanding may be problematic. Please advise. Thank you.

        sun.misc.Unsafe unsafe;
        try {
            Field unsafeField = Unsafe.class.getDeclaredField("theUnsafe");
            unsafeField.setAccessible(true);
            unsafe = (sun.misc.Unsafe) unsafeField.get(null);
        } catch (Throwable cause) {
            unsafe = null;
        }

        String value = "xxxxx";
        byte[] src = value.getBytes();

        byte[] dst = new byte[3];
        byte[] newDst = new byte[5];

        unsafe.copyMemory(src, 16, dst, 16, src.length);
        unsafe.copyMemory(dst, 16, newDst, 16, src.length);

        System.out.println("dst:" + new String(dst));
        System.out.println("newDst:" + new String(newDst));

output:

dst:xxx
newDst:xxxxx

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then 1 << 19 should be good enough as it doubles the size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I think so.

@SparkQA
Copy link

SparkQA commented May 22, 2018

Test build #90967 has finished for PR 21311 at commit f3916e7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 22, 2018

Test build #90970 has finished for PR 21311 at commit d7da8ae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 22, 2018

Test build #90981 has finished for PR 21311 at commit b8b6324.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented May 23, 2018

retest this please

@SparkQA
Copy link

SparkQA commented May 23, 2018

Test build #91009 has finished for PR 21311 at commit b8b6324.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented May 23, 2018

retest this please

@SparkQA
Copy link

SparkQA commented May 23, 2018

Test build #91018 has finished for PR 21311 at commit b8b6324.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented May 23, 2018

retest this please

@SparkQA
Copy link

SparkQA commented May 23, 2018

Test build #91041 has finished for PR 21311 at commit b8b6324.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented May 23, 2018

retest this please

@SparkQA
Copy link

SparkQA commented May 23, 2018

Test build #91052 has finished for PR 21311 at commit b8b6324.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented May 24, 2018

Test build #91066 has finished for PR 21311 at commit b8b6324.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request May 24, 2018
…rong

LongToUnsafeRowMap has a mistake when growing its page array: it blindly grows to `oldSize * 2`, while the new record may be larger than `oldSize * 2`. Then we may have a malformed UnsafeRow when querying this map, whose actual data is smaller than its declared size, and the data is corrupted.

Author: sychen <[email protected]>

Closes #21311 from cxzl25/fix_LongToUnsafeRowMap_page_size.

(cherry picked from commit 8883401)
Signed-off-by: Wenchen Fan <[email protected]>
@asfgit asfgit closed this in 8883401 May 24, 2018
asfgit pushed a commit that referenced this pull request May 24, 2018
…rong

LongToUnsafeRowMap has a mistake when growing its page array: it blindly grows to `oldSize * 2`, while the new record may be larger than `oldSize * 2`. Then we may have a malformed UnsafeRow when querying this map, whose actual data is smaller than its declared size, and the data is corrupted.

Author: sychen <[email protected]>

Closes #21311 from cxzl25/fix_LongToUnsafeRowMap_page_size.

(cherry picked from commit 8883401)
Signed-off-by: Wenchen Fan <[email protected]>
asfgit pushed a commit that referenced this pull request May 24, 2018
…rong

LongToUnsafeRowMap has a mistake when growing its page array: it blindly grows to `oldSize * 2`, while the new record may be larger than `oldSize * 2`. Then we may have a malformed UnsafeRow when querying this map, whose actual data is smaller than its declared size, and the data is corrupted.

Author: sychen <[email protected]>

Closes #21311 from cxzl25/fix_LongToUnsafeRowMap_page_size.

(cherry picked from commit 8883401)
Signed-off-by: Wenchen Fan <[email protected]>
asfgit pushed a commit that referenced this pull request May 24, 2018
…rong

LongToUnsafeRowMap has a mistake when growing its page array: it blindly grows to `oldSize * 2`, while the new record may be larger than `oldSize * 2`. Then we may have a malformed UnsafeRow when querying this map, whose actual data is smaller than its declared size, and the data is corrupted.

Author: sychen <[email protected]>

Closes #21311 from cxzl25/fix_LongToUnsafeRowMap_page_size.

(cherry picked from commit 8883401)
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan
Copy link
Contributor

cloud-fan commented May 24, 2018

thanks, merging to master/2.3/2.2/2.1/2.0! There is no conflict so I backported all the way to 2.0. I'll watch the jenkins build in the next few days.

@cxzl25
Copy link
Contributor Author

cxzl25 commented May 24, 2018

@cloud-fan Thank you very much for your help.

MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
…rong

LongToUnsafeRowMap has a mistake when growing its page array: it blindly grows to `oldSize * 2`, while the new record may be larger than `oldSize * 2`. Then we may have a malformed UnsafeRow when querying this map, whose actual data is smaller than its declared size, and the data is corrupted.

Author: sychen <[email protected]>

Closes apache#21311 from cxzl25/fix_LongToUnsafeRowMap_page_size.

(cherry picked from commit 8883401)
Signed-off-by: Wenchen Fan <[email protected]>
rdblue pushed a commit to rdblue/spark that referenced this pull request May 19, 2019
…rong

LongToUnsafeRowMap has a mistake when growing its page array: it blindly grows to `oldSize * 2`, while the new record may be larger than `oldSize * 2`. Then we may have a malformed UnsafeRow when querying this map, whose actual data is smaller than its declared size, and the data is corrupted.

Author: sychen <[email protected]>

Closes apache#21311 from cxzl25/fix_LongToUnsafeRowMap_page_size.
otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
…rong

LongToUnsafeRowMap has a mistake when growing its page array: it blindly grows to `oldSize * 2`, while the new record may be larger than `oldSize * 2`. Then we may have a malformed UnsafeRow when querying this map, whose actual data is smaller than its declared size, and the data is corrupted.

Author: sychen <[email protected]>

Closes apache#21311 from cxzl25/fix_LongToUnsafeRowMap_page_size.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants