[SPARK-32481][CORE][SQL] Support truncate table to move data to trash #29387

Udbhav30 · 2020-08-08T06:15:00Z

What changes were proposed in this pull request?

Instead of deleting the data, we can move the data to trash.
Based on the configuration provided by the user it will be deleted permanently from the trash.

Why are the changes needed?

Instead of directly deleting the data, we can provide flexibility to move data to the trash and then delete it permanently.

Does this PR introduce any user-facing change?

Yes, After truncate table the data is not permanently deleted now.
It is first moved to the trash and then after the given time deleted permanently;

How was this patch tested?

new UTs added

Udbhav30 · 2020-08-08T06:27:53Z

cc @dongjoon-hyun please review

core/src/main/scala/org/apache/spark/util/Utils.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

dongjoon-hyun

Please clean up the leftover of #29319 .

core/src/main/scala/org/apache/spark/util/Utils.scala

Udbhav30 · 2020-08-10T05:27:33Z

Please clean up the leftover of #29319 .

Hi @dongjoon-hyun, I have cleaned the MR kindly review.

Udbhav30 · 2020-08-18T20:09:55Z

Gentle ping @dongjoon-hyun

dongjoon-hyun · 2020-08-20T20:21:21Z

Could you review this, @sunchao ?

dongjoon-hyun · 2020-08-20T20:22:28Z

ok to test

sunchao

This looks useful. One thing I'd recommend is to make this a boolean flag instead of an interval, and then rely on Hadoop side config fs.trash.interval to control the trash retention.

core/src/main/scala/org/apache/spark/util/Utils.scala

Udbhav30 · 2020-08-20T20:54:00Z

This looks useful. One thing I'd recommend is to make this a boolean flag instead of an interval, and then rely on Hadoop side config fs.trash.interval to control the trash retention.

Sure I'll update it.

SparkQA · 2020-08-20T22:59:43Z

Test build #127704 has finished for PR 29387 at commit 266d0eb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Udbhav30 · 2020-08-21T18:01:20Z

@sunchao @dongjoon-hyun, I have updated the MR can you please review.

core/src/main/scala/org/apache/spark/util/Utils.scala

SparkQA · 2020-08-21T20:53:43Z

Test build #127757 has finished for PR 29387 at commit 8d94930.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

core/src/main/scala/org/apache/spark/util/Utils.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

SparkQA · 2020-08-21T23:27:46Z

Test build #127763 has finished for PR 29387 at commit a9806aa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-22T13:23:40Z

Test build #127774 has finished for PR 29387 at commit 7447f18.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2020-08-22T13:39:49Z

Test build #127777 has finished for PR 29387 at commit bfd3f79.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sunchao

Thanks @Udbhav30 ! this LGTM now. Can you check why test is failing though? I'll let others (cc @dongjoon-hyun ) to chime in and help finishing this.

We may need to consider another flag in future to control whether when spark.sql.truncate.trash.enabled but Hadoop side trash is disabled, this should switch to permanently delete data or skip the deletion all together (current behavior). Sometimes later may not be what users want.

core/src/main/scala/org/apache/spark/util/Utils.scala

dongjoon-hyun · 2020-08-22T15:56:04Z

Thank you for helping this PR, @sunchao .

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

core/src/main/scala/org/apache/spark/util/Utils.scala

dongjoon-hyun · 2020-08-24T21:26:26Z

sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala

+        sql("CREATE TABLE tab1 (col INT) USING parquet")
+        sql("INSERT INTO tab1 SELECT 1")
+        // scalastyle:off hadoopconfiguration
+        val hadoopConf = spark.sparkContext.hadoopConfiguration


spark.sessionState.newHadoopConf()?

dongjoon-hyun · 2020-08-24T21:31:13Z

sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala

+      withSQLConf(SQLConf.TRUNCATE_TRASH_ENABLED.key -> "false") {
+        sql("CREATE TABLE tab1 (col INT) USING parquet")
+        sql("INSERT INTO tab1 SELECT 1")
+        val hadoopConf = spark.sessionState.newHadoopConf()


In the other tests, this PR is using spark.sparkContext.hadoopConfiguration.
If that is required, this test case looks misleading, withSQLConf(SQLConf.TRUNCATE_TRASH_ENABLED.key -> "false") { is required here? I'm wondering if this test case passed with withSQLConf(SQLConf.TRUNCATE_TRASH_ENABLED.key -> "true") {, too.

Hii, In this test we did not update the hadoopConf, so using spark.sessionState.newHadoopConf() doesn't make any difference

@dongjoon-hyun See here. If fs.trash.interval is non positive then moveToAppropriateTrash function returns false. So to test this I have to add positive value to fs.trash.interval, but spark.sessionState.newHadoopConf() does not update the hadoopConf and so other testcase fails. And here this testcase is no-op so updating the hadoopConf is not required so I used spark.sessionState.newHadoopConf()

dongjoon-hyun · 2020-08-24T21:32:17Z

Thank you for pinging me again, @Udbhav30 . I finished another round. I'll review after the PR is updated again.

SparkQA · 2020-08-25T00:58:36Z

Test build #127859 has finished for PR 29387 at commit a2df53b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Udbhav30 · 2020-08-25T18:34:04Z

Thank you for pinging me again, @Udbhav30 . I finished another round. I'll review after the PR is updated again.

Hii thanks for the review @dongjoon-hyun , i have updated the pr as per your suggestions

dongjoon-hyun · 2020-08-25T18:35:00Z

Thanks!

dongjoon-hyun

+1, LGTM. Thank you, @Udbhav30 , @sunchao , @gatorsmile .
Merged to master for Apache Spark 3.1.0.

HyukjinKwon · 2020-08-26T06:44:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -2722,6 +2722,17 @@ object SQLConf {
      .booleanConf
      .createWithDefault(false)

+  val TRUNCATE_TRASH_ENABLED =
+    buildConf("spark.sql.truncate.trash.enabled")


quick question, do we want to have each configuration for each operation? Looks like #29319 targets similar stuff. Maybe it'd make more sense to have a global configuration.

I will rework on #29319 and make it a global configuration.

Yep. It's too early to make it a global configuration.

LuciferYang · 2020-08-26T13:09:41Z

sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala

+
+        val fs = tablePath.getFileSystem(hadoopConf)
+        val trashRoot = fs.getTrashRoot(tablePath)
+        assert(!fs.exists(trashRoot))


@Udbhav30 This line of code is not Mac os friendly, the trashRoot is /Users/xxx/.Trash/, it is the path to the trash can of Mac os. So normally, it exists...

Thanks @LuciferYang for pointing it out, i will raise follow-up PR and assert the particular folder that is trashRoot/pathToTable/tab1 in this case instead of trashRoot

dongjoon-hyun · 2020-08-26T18:24:13Z

Ur, sorry, @Udbhav30 . It seems that Hadoop 2.7 doesn't have getTrashRoot inside FileSystem.

[error] /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-1.2/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala:3119: value getTrashRoot is not a member of org.apache.hadoop.fs.FileSystem
[error]         val trashRoot = fs.getTrashRoot(tablePath)
[error]                            ^
[error] /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-1.2/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala:3147: value getTrashRoot is not a member of org.apache.hadoop.fs.FileSystem
[error]         val trashRoot = fs.getTrashRoot(tablePath)
[error]                            ^
[error] /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-1.2/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala:3170: value getTrashRoot is not a member of org.apache.hadoop.fs.FileSystem
[error]         val trashRoot = fs.getTrashRoot(tablePath)
[error]                            ^
[warn] /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-1.2/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala:706: a pure expression does nothing in statement position; multiline expressions might require enclosing parentheses
[warn]       q1
[warn]       ^
[warn] one warning found
[error] three errors found
[error] (sql/test:compileIncremental) Compilation failed
[error] Total time: 262 s, completed Aug 26, 2020 10:50:44 AM

I'll revert this first to recover Jenkins.

sunchao · 2020-08-26T19:05:04Z

Hmm this is a bummer, sorry missed that. It used to be just new Path(fs.getHomeDirectory(), ".Trash") but since 2.8 it added getTrashRoot to hide the details (even though implementation is still the same).

Perhaps we can use Trash.getCurrentTrashDir instead?

Instead of deleting the data, we can move the data to trash. Based on the configuration provided by the user it will be deleted permanently from the trash. Instead of directly deleting the data, we can provide flexibility to move data to the trash and then delete it permanently. Yes, After truncate table the data is not permanently deleted now. It is first moved to the trash and then after the given time deleted permanently; new UTs added Closes apache#29387 from Udbhav30/tuncateTrash. Authored-by: Udbhav30 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

Udbhav30 · 2020-08-26T19:16:11Z

Hmm this is a bummer, sorry missed that. It used to be just new Path(fs.getHomeDirectory(), ".Trash") but since 2.8 it added getTrashRoot to hide the details (even though implementation is still the same).

Perhaps we can use Trash.getCurrentTrashDir instead?

Hi @sunchao, I have raised a new PR in #29552, kindly review that

Tagar · 2020-08-26T19:28:58Z

@Udbhav30 generally, one user can have multiple trash directories.
The default one fs.getHomeDirectory() + ".Trash" as you mentioned, and there could be multiple non-default ones - one per encryption zone.
So each encryption zone's trash directory is encrypted with the same key and files can be moved to trash without reencryption.
For GDPR/ CCPA use cases we had some tables with PII created in an HDFS encryption zone and those couldn't use default trash location.

probot-autolabeler bot added CORE SQL labels Aug 8, 2020

Udbhav30 force-pushed the tuncateTrash branch from 2f27819 to 5e7f977 Compare August 8, 2020 06:22

dongjoon-hyun reviewed Aug 8, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/util/Utils.scala Outdated Show resolved Hide resolved

Udbhav30 force-pushed the tuncateTrash branch 2 times, most recently from ce6f124 to bd4ccf5 Compare August 9, 2020 07:22

dongjoon-hyun reviewed Aug 9, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Aug 9, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

dongjoon-hyun requested changes Aug 9, 2020

View reviewed changes

dongjoon-hyun reviewed Aug 9, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/util/Utils.scala Show resolved Hide resolved

Udbhav30 force-pushed the tuncateTrash branch from bd4ccf5 to 266d0eb Compare August 10, 2020 05:25

sunchao reviewed Aug 20, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/util/Utils.scala Outdated Show resolved Hide resolved

sunchao reviewed Aug 21, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/util/Utils.scala Outdated Show resolved Hide resolved

sunchao reviewed Aug 21, 2020

View reviewed changes

Udbhav30 force-pushed the tuncateTrash branch from 7447f18 to bfd3f79 Compare August 22, 2020 11:29

Udbhav30 force-pushed the tuncateTrash branch from bfd3f79 to bd6fafc Compare August 22, 2020 13:53

sunchao approved these changes Aug 22, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/util/Utils.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Aug 24, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Show resolved Hide resolved

dongjoon-hyun reviewed Aug 24, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/util/Utils.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Aug 24, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/util/Utils.scala Show resolved Hide resolved

dongjoon-hyun reviewed Aug 24, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/util/Utils.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Aug 24, 2020

View reviewed changes

Udbhav30 added 7 commits August 25, 2020 03:16

[SPARK-32481] Support truncate table to move data to trash

58d957a

Handle review comments

81101d9

add a warning msg

502dd08

add tests and minor fixes

97d2146

add tests and minor fixes

a52e9a5

typo fix

8848b3f

Review comments fix

a2df53b

Udbhav30 force-pushed the tuncateTrash branch from 1aa674b to a2df53b Compare August 24, 2020 22:01

Udbhav30 requested a review from dongjoon-hyun August 25, 2020 18:42

dongjoon-hyun approved these changes Aug 26, 2020

View reviewed changes

dongjoon-hyun closed this in 5c077f0 Aug 26, 2020

HyukjinKwon reviewed Aug 26, 2020

View reviewed changes

LuciferYang reviewed Aug 26, 2020

View reviewed changes

Udbhav30 mentioned this pull request Aug 26, 2020

[SPARK-32481][TESTS][FOLLOWUP] Use different directory name for MacOS #29550

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32481][CORE][SQL] Support truncate table to move data to trash #29387

[SPARK-32481][CORE][SQL] Support truncate table to move data to trash #29387

Udbhav30 commented Aug 8, 2020 •

edited

Loading

Udbhav30 commented Aug 8, 2020

dongjoon-hyun left a comment

Udbhav30 commented Aug 10, 2020

Udbhav30 commented Aug 18, 2020

dongjoon-hyun commented Aug 20, 2020

dongjoon-hyun commented Aug 20, 2020

sunchao left a comment

Udbhav30 commented Aug 20, 2020

SparkQA commented Aug 20, 2020

Udbhav30 commented Aug 21, 2020

SparkQA commented Aug 21, 2020

SparkQA commented Aug 21, 2020

SparkQA commented Aug 22, 2020

SparkQA commented Aug 22, 2020

sunchao left a comment •

edited

Loading

dongjoon-hyun commented Aug 22, 2020

dongjoon-hyun Aug 24, 2020

dongjoon-hyun Aug 24, 2020

Udbhav30 Aug 24, 2020

Udbhav30 Aug 24, 2020 •

edited

Loading

dongjoon-hyun commented Aug 24, 2020

SparkQA commented Aug 25, 2020

Udbhav30 commented Aug 25, 2020

dongjoon-hyun commented Aug 25, 2020

dongjoon-hyun left a comment

HyukjinKwon Aug 26, 2020

Udbhav30 Aug 26, 2020

dongjoon-hyun Aug 26, 2020

LuciferYang Aug 26, 2020

Udbhav30 Aug 26, 2020

dongjoon-hyun commented Aug 26, 2020

sunchao commented Aug 26, 2020

Udbhav30 commented Aug 26, 2020

Tagar commented Aug 26, 2020

[SPARK-32481][CORE][SQL] Support truncate table to move data to trash #29387

[SPARK-32481][CORE][SQL] Support truncate table to move data to trash #29387

Conversation

Udbhav30 commented Aug 8, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Udbhav30 commented Aug 8, 2020

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Udbhav30 commented Aug 10, 2020

Udbhav30 commented Aug 18, 2020

dongjoon-hyun commented Aug 20, 2020

dongjoon-hyun commented Aug 20, 2020

sunchao left a comment

Choose a reason for hiding this comment

Udbhav30 commented Aug 20, 2020

SparkQA commented Aug 20, 2020

Udbhav30 commented Aug 21, 2020

SparkQA commented Aug 21, 2020

SparkQA commented Aug 21, 2020

SparkQA commented Aug 22, 2020

SparkQA commented Aug 22, 2020

sunchao left a comment • edited Loading

Choose a reason for hiding this comment

dongjoon-hyun commented Aug 22, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Udbhav30 Aug 24, 2020 • edited Loading

Choose a reason for hiding this comment

dongjoon-hyun commented Aug 24, 2020

SparkQA commented Aug 25, 2020

Udbhav30 commented Aug 25, 2020

dongjoon-hyun commented Aug 25, 2020

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Aug 26, 2020

sunchao commented Aug 26, 2020

Udbhav30 commented Aug 26, 2020

Tagar commented Aug 26, 2020

Udbhav30 commented Aug 8, 2020 •

edited

Loading

sunchao left a comment •

edited

Loading

Udbhav30 Aug 24, 2020 •

edited

Loading