-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32481][CORE][SQL] Support truncate table to move data to trash #29387
Conversation
cc @dongjoon-hyun please review |
ce6f124
to
bd4ccf5
Compare
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please clean up the leftover of #29319 .
Hi @dongjoon-hyun, I have cleaned the MR kindly review. |
Gentle ping @dongjoon-hyun |
Could you review this, @sunchao ? |
ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks useful. One thing I'd recommend is to make this a boolean flag instead of an interval, and then rely on Hadoop side config fs.trash.interval
to control the trash retention.
Sure I'll update it. |
Test build #127704 has finished for PR 29387 at commit
|
@sunchao @dongjoon-hyun, I have updated the MR can you please review. |
Test build #127757 has finished for PR 29387 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala
Show resolved
Hide resolved
Test build #127763 has finished for PR 29387 at commit
|
Test build #127774 has finished for PR 29387 at commit
|
Test build #127777 has finished for PR 29387 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Udbhav30 ! this LGTM now. Can you check why test is failing though? I'll let others (cc @dongjoon-hyun ) to chime in and help finishing this.
We may need to consider another flag in future to control whether when spark.sql.truncate.trash.enabled
but Hadoop side trash is disabled, this should switch to permanently delete data or skip the deletion all together (current behavior). Sometimes later may not be what users want.
Thank you for helping this PR, @sunchao . |
sql("CREATE TABLE tab1 (col INT) USING parquet") | ||
sql("INSERT INTO tab1 SELECT 1") | ||
// scalastyle:off hadoopconfiguration | ||
val hadoopConf = spark.sparkContext.hadoopConfiguration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark.sessionState.newHadoopConf()
?
withSQLConf(SQLConf.TRUNCATE_TRASH_ENABLED.key -> "false") { | ||
sql("CREATE TABLE tab1 (col INT) USING parquet") | ||
sql("INSERT INTO tab1 SELECT 1") | ||
val hadoopConf = spark.sessionState.newHadoopConf() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the other tests, this PR is using spark.sparkContext.hadoopConfiguration
.
If that is required, this test case looks misleading, withSQLConf(SQLConf.TRUNCATE_TRASH_ENABLED.key -> "false") {
is required here? I'm wondering if this test case passed with withSQLConf(SQLConf.TRUNCATE_TRASH_ENABLED.key -> "true") {
, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hii, In this test we did not update the hadoopConf
, so using spark.sessionState.newHadoopConf()
doesn't make any difference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun See here. If fs.trash.interval
is non positive then moveToAppropriateTrash
function returns false. So to test this I have to add positive value to fs.trash.interval
, but spark.sessionState.newHadoopConf()
does not update the hadoopConf
and so other testcase fails. And here this testcase is no-op so updating the hadoopConf
is not required so I used spark.sessionState.newHadoopConf()
Thank you for pinging me again, @Udbhav30 . I finished another round. I'll review after the PR is updated again. |
Test build #127859 has finished for PR 29387 at commit
|
Hii thanks for the review @dongjoon-hyun , i have updated the pr as per your suggestions |
Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @Udbhav30 , @sunchao , @gatorsmile .
Merged to master for Apache Spark 3.1.0.
@@ -2722,6 +2722,17 @@ object SQLConf { | |||
.booleanConf | |||
.createWithDefault(false) | |||
|
|||
val TRUNCATE_TRASH_ENABLED = | |||
buildConf("spark.sql.truncate.trash.enabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quick question, do we want to have each configuration for each operation? Looks like #29319 targets similar stuff. Maybe it'd make more sense to have a global configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will rework on #29319 and make it a global configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. It's too early to make it a global configuration.
|
||
val fs = tablePath.getFileSystem(hadoopConf) | ||
val trashRoot = fs.getTrashRoot(tablePath) | ||
assert(!fs.exists(trashRoot)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Udbhav30 This line of code is not Mac os friendly, the trashRoot
is /Users/xxx/.Trash/
, it is the path to the trash can of Mac os. So normally, it exists...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @LuciferYang for pointing it out, i will raise follow-up PR and assert the particular folder that is trashRoot/pathToTable/tab1
in this case instead of trashRoot
Ur, sorry, @Udbhav30 . It seems that Hadoop 2.7 doesn't have
I'll revert this first to recover Jenkins. |
Hmm this is a bummer, sorry missed that. It used to be just Perhaps we can use |
Instead of deleting the data, we can move the data to trash. Based on the configuration provided by the user it will be deleted permanently from the trash. Instead of directly deleting the data, we can provide flexibility to move data to the trash and then delete it permanently. Yes, After truncate table the data is not permanently deleted now. It is first moved to the trash and then after the given time deleted permanently; new UTs added Closes apache#29387 from Udbhav30/tuncateTrash. Authored-by: Udbhav30 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
Hi @sunchao, I have raised a new PR in #29552, kindly review that |
@Udbhav30 generally, one user can have multiple trash directories. |
What changes were proposed in this pull request?
Instead of deleting the data, we can move the data to trash.
Based on the configuration provided by the user it will be deleted permanently from the trash.
Why are the changes needed?
Instead of directly deleting the data, we can provide flexibility to move data to the trash and then delete it permanently.
Does this PR introduce any user-facing change?
Yes, After truncate table the data is not permanently deleted now.
It is first moved to the trash and then after the given time deleted permanently;
How was this patch tested?
new UTs added