-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-34039][SQL] ReplaceTable should invalidate cache #31081
Conversation
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #133782 has finished for PR 31081 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @sunchao .
According to the affected version of SPARK-34039, this is merged to master.
cc @cloud-fan , @aokolnychyi
@sunchao . I tested this case with Apache Spark 3.1.0 RC1 with Iceberg 0.10. This is a correctness issue. How do you think about that? If you agree, please make a backport. (Also, cc @HyukjinKwon )
|
Thanks @dongjoon-hyun . Did the above include this PR? There is another issue SPARK-34052 on handling temporary view in DSv2 which could cause the above as well. |
No~ As I wrote, I used Apache Spark 3.1.0 RC1 with Iceberg 0.10 to see the existing behavior. |
Got you. Yes I think we should backport this (I'll create one soon). In addition, we may need SPARK-34052 to completely address the issue. In the second: spark-sql> SELECT * FROM v;
4 5 IMO, the temporary view |
It would be great to see this fixed in 3.1.0. Thanks for the work! |
I agree with you, @sunchao . |
This changes `ReplaceTableExec`/`AtomicReplaceTableExec`, and uncaches the target table before it is dropped. In addition, this includes some refactoring by moving the `uncacheTable` method to `DataSourceV2Strategy` so that we don't need to pass a Spark session to the v2 exec. Similar to SPARK-33492 (apache#30429). When a table is refreshed, the associated cache should be invalidated to avoid potential incorrect results. Yes. Now When a data source v2 is cached (either directly or indirectly), all the relevant caches will be refreshed or invalidated if the table is replaced. Added a new unit test. Closes apache#31081 from sunchao/SPARK-34039. Authored-by: Chao Sun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
Sure, I think RC1 is failed. Thanks for letting me know @dongjoon-hyun. +1 for porting it back. |
### What changes were proposed in this pull request? This is a backport of #31081 to branch-3.1. This changes `ReplaceTableExec`/`AtomicReplaceTableExec`, and uncaches the target table before it is dropped. In addition, this includes some refactoring by moving the `uncacheTable` method to `DataSourceV2Strategy` so that we don't need to pass a Spark session to the v2 exec. ### Why are the changes needed? Similar to SPARK-33492 (#30429). When a table is refreshed, the associated cache should be invalidated to avoid potential incorrect results. ### Does this PR introduce _any_ user-facing change? Yes. Now When a data source v2 is cached (either directly or indirectly), all the relevant caches will be refreshed or invalidated if the table is replaced. ### How was this patch tested? Added a new unit test. Closes #31100 from sunchao/SPARK-34039-branch-3.1. Authored-by: Chao Sun <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
BTW, do we have more issues to fix like this? I have seen similar cache/uncache fixes often lately. If we have some issues to fix already, it might be best to delay the next RC until we have the fixes. |
@HyukjinKwon I'm working on #31107 which also seems to be a bug on caching. Yes it seems there are many issues related to this area recently, and I feel we haven't identified all of them yet. |
BTW some of the cache fixes are not in branch-3.0, @HyukjinKwon @dongjoon-hyun let me know if it's necessary to make the backports. |
Thanks, @sunchao . Yes. It would be great if we have it if the UT fails there in the same way. |
@HyukjinKwon yes I've tried to put all these under https://issues.apache.org/jira/browse/SPARK-33507. @MaxGekk do you mind adding yours to here too? |
@MaxGekk |
I have converted my tickets to sub-tasks of SPARK-33507 |
What changes were proposed in this pull request?
This changes
ReplaceTableExec
/AtomicReplaceTableExec
, and uncaches the target table before it is dropped. In addition, this includes some refactoring by moving theuncacheTable
method toDataSourceV2Strategy
so that we don't need to pass a Spark session to the v2 exec.Why are the changes needed?
Similar to SPARK-33492 (#30429). When a table is refreshed, the associated cache should be invalidated to avoid potential incorrect results.
Does this PR introduce any user-facing change?
Yes. Now When a data source v2 is cached (either directly or indirectly), all the relevant caches will be refreshed or invalidated if the table is replaced.
How was this patch tested?
Added a new unit test.