-
Notifications
You must be signed in to change notification settings - Fork 28
DistCp Transformations
Schedoscope includes a way to leverage DistCp for view materialization. The DistCp transformation starts a DistCp job that copies files from diverse sources into a view's fullPath
. As such, they are particularily suitable in the staging areas of a data warehouse.
case class DistCpTransformation(v: View,
var sources: List[String],
var target: String,
deleteViewPath: Boolean = false,
config: Configuration = new Configuration())
The DistCp transformation copies source files and folders matching a GLOB pattern to a target path.
-
v
: The view using the transformation. -
sources
: A list of source files / folders. -
target
: Targetfolder
. -
deleteViewPath
: Deletes thefullPath
of the view before copying. -
config
: Configuration for the MapReduce job. Can be left at default for most cases.
For detailed information regarding DistCp check out the official documentation: here.
DistCp's handling of destination and target paths is a little bit unconventional. So the DistCp transformation has the following helpers:
-
copyToView(sourceView: View, targetView: View)
: Will copy the content of thefullPath
from thesourceView
to thefullPath
of the target view. -
copyToDirToView(sourcePath: String, targetView: View)
: Will copy the content of thesourcePath
folder to thefullPath
of the target view. -
copyToFileToView(sourceFile: String, targetView: View)
: Will copy thesourceFile
to thefullPath
of the target view.
Example:
val product = dependsOn(() => Product(shopCode, year, month, day))
transformVia(() => DistCpTransformation.copyToView(product(), this))
The behavior of DistCp is highly configurable. To expose all the available options, Schedoscope includes the DistConfiguration class.
val conf = DistCpConfiguration()
conf.maxMaps = 20
conf.atomicCommit = true
transformVia(() => DistCpTransformation.copyToView(product(), this)
.configureWith(conf))
The class has the following options:
Option | Description |
---|---|
sourcePaths |
List of source paths. Setting this will overwrite the sources parameter of DistCpTransformation . |
targetPath |
Target path. Setting this will overwrite the target parameter of DistCpTransformation . |
atomicCommit |
Enable atomic commit. Data will either be available at final target in a complete and consistent form, or not at all. |
update |
Set if source and target folder contents be sync'ed up. |
deleteMissing |
Delete the files existing in the dst but not in src. |
ignoreFailures |
Set if failures during copy be ignored. |
overwrite |
Overwrite folders/files at destination. |
skipCRC |
Whether to skip CRC checks between source and target paths. |
blocking |
Set if Disctp should run blocking or non-blocking |
useDiff |
Use snapshot diff report between given two snapshots to identify the difference between source and target, and apply the diff to the target to make it in sync with source. This option is valid only with
|
useRDiff |
Use snapshot diff report between given two snapshots to identify what has been changed on the target since the snapshot This option is valid only with
|
numListstatusThreads |
Set the number of threads to use for listStatus. We allow max 40 threads. Setting numThreads to zero signify we should use the value from conf properties. |
maxMaps |
Set the max number of mappers to use for copy. |
mapBandwidth |
Specify bandwidth per map, in MB/second. |
sslConfigurationFile |
Set the SSL configuration file path to use with hftps:// (local path). |
copyStrategy |
Set the copy strategy to use. Should map to a strategy implementation in distp-default.xml. |
preserveStatus |
A set of file attributes that need to be preserved. |
preserveRawXattrs |
Indicate that raw.* xattrs should be preserved. |
atomicWorkPath |
Set the tmp folder for atomic commit. |
logPath |
Set the log path where distcp output logs are stored. Uses JobStagingDir/_logs by default. |
sourceFileListing |
File containing list of source paths. This will overwrite sourcePaths . |
filtersFile |
The path to a list of patterns to exclude from copy. |
append |
Set if we want to append new data to target files. This is valid only with syncFolder option and CRC is not skipped. |
fromSnapshot |
Set the old snapshot folder for useDiff /useRdiff
|
toSnapshot |
Set the new snapshot folder for useDiff /useRdiff
|
An example of using DistCp to copy a file into a view's fullPath:
transformVia(() => DistCpTransformation.copyToView("/hdp/prod/stage/input.csv", this))
An example of using DistCp to copy the underlying data of a dependency into a view's fullPath:
val product = dependsOn(() => Product(shopCode, year, month, day))
transformVia(() => DistCpTransformation.copyToView(product(), this))
DistCp is shipped within the Hadoop core; the DistCp transformation is part of the Schedoscope core as well.
Schedoscope tries to automatically detect changes to DistCp transformation-based views and to initiate rematerialization of views if the tranformation logic has potentially changed. For DistCp transformations, this checksum is based on the sources
and target
paths.