Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make hive temporary staging directory to be configurable #12199

Merged
merged 3 commits into from
Jan 25, 2019
Merged

Make hive temporary staging directory to be configurable #12199

merged 3 commits into from
Jan 25, 2019

Conversation

kokosing
Copy link
Contributor

@kokosing kokosing commented Jan 9, 2019

No description provided.

@findepi
Copy link
Contributor

findepi commented Jan 9, 2019

Looks good, but Travis is red.

Make hive temporary staging directory location to be configurable commit could include some explanation eg In some HDFS deployments, /tmp cannot be used to write temporary files

@kokosing kokosing requested a review from electrum January 9, 2019 08:52
@kokosing
Copy link
Contributor Author

kokosing commented Jan 9, 2019

@findepi Thanks. Updated.

{
// skip using temporary directory for S3
return !isS3FileSystem(context, hdfsEnvironment, path);
return isTemporaryStagingDirectoryEnabled(session)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe want to update the comment around WriteMode enum?

public enum WriteMode
{
/**
* common mode for new table or existing table (both new and existing partition)
*/
STAGE_AND_MOVE_TO_TARGET_DIRECTORY(false),
/**
* for new table in S3
*/
DIRECT_TO_TARGET_NEW_DIRECTORY(true),
/**
* for existing table in S3 (both new and existing partition)
*/
DIRECT_TO_TARGET_EXISTING_DIRECTORY(true),
/**/;

Copy link
Contributor

@nezihyigitbasi nezihyigitbasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make hive temporary staging directory to be disableable

  • I feel like we can update the commit title to:Add config flag to disable hive temp staging directory or Make use of hive temp staging directory configurable.

@@ -1190,4 +1192,17 @@ public HiveClientConfig setS3SelectPushdownMaxConnections(int s3SelectPushdownMa
this.s3SelectPushdownMaxConnections = s3SelectPushdownMaxConnections;
return this;
}

@Config("hive.temporary-staging-directory-enabled")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we rename this to hive.use-temporary-staging-directory-for-writes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we be using staging directory for reads? would there be use for this?

Copy link
Contributor

@wenleix wenleix Jan 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi : Unlikely. When the WriteMode is STAGE_AND_MOVE_TO_TARGET_DIRECTORY, the data is first written to the stage directory, and rename to the target directory when when commit the write.

But I don't feel we have to emphasize for-write? What about hive.use-temporary-staging-directory ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I don't feel we have to emphasize for-write?

i thought the same

What about hive.use-temporary-staging-directory ?

would do
however, i think the original hive.temporary-staging-directory-enabled is slightly better. Consider the case when one uses s3 where staging cannot (currently) be used.
When users sets hive.use-temporary-staging-directory = true, they could expect that staging is used (or error is reported). However, none of these will happen.
At the same time, I don't want to report an error when the option is used with s3 -- a single cluster may be talking to s3 and hdfs and user may want to set the option to change behavior on hdfs.

So... to me this should be either

  • hive.temporary-staging-directory-enabled (original) or
  • hive.hdfs.use-temporary-staging-directory

i prefer the original.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using hdfs could make someone to think that this only affects actual HDFS, and not any HDFS-compatible file system.

I would keep the original. @nezihyigitbasi What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like "using" a staging directory sounds better than "enabling" it (hive.use-temporary-staging-directory). But, if you all want to go with the original name I am fine with it.

@@ -76,6 +76,7 @@
private static final String COLLECT_COLUMN_STATISTICS_ON_WRITE = "collect_column_statistics_on_write";
private static final String OPTIMIZE_MISMATCHED_BUCKET_COUNT = "optimize_mismatched_bucket_count";
private static final String S3_SELECT_PUSHDOWN_ENABLED = "s3_select_pushdown_enabled";
private static final String TEMPORARY_STAGING_DIRECTORY_ENABLED = "temporary_staging_directory_enabled";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, should we rename this to use_temporary_staging_directory_for_writes?

Copy link
Contributor

@nezihyigitbasi nezihyigitbasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make hive temporary staging directory location to be configurable

  • Make hive temporary staging directory location configurable

Copy link
Contributor Author

@kokosing kokosing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nezihyigitbasi Comments addressed, would you mind to take a look?

@@ -1190,4 +1192,17 @@ public HiveClientConfig setS3SelectPushdownMaxConnections(int s3SelectPushdownMa
this.s3SelectPushdownMaxConnections = s3SelectPushdownMaxConnections;
return this;
}

@Config("hive.temporary-staging-directory-enabled")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using hdfs could make someone to think that this only affects actual HDFS, and not any HDFS-compatible file system.

I would keep the original. @nezihyigitbasi What do you think?

Copy link
Contributor

@nezihyigitbasi nezihyigitbasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. (Please merge after the release is completed.)

Copy link
Contributor

@wenleix wenleix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

assertEquals(hiveInsertTableHandle.getLocationHandle().getWritePath(), hiveInsertTableHandle.getLocationHandle().getTargetPath());

session = Session.builder(getSession())
.setCatalogSessionProperty("hive", "temporary_staging_directory_path", "/tmp/custom/temporary-${USER}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe explicitly put "temporary_staging_directory_enabled": "true"?

I am asking this because I got confused when looking at it first time, then I realized temporary_staging_directory_enabled is by default true.

Presto for some file systems that wrap S3, should behave in the same way
as it does for S3.
In some deployments, /tmp cannot be used to write temporary files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants