-
Notifications
You must be signed in to change notification settings - Fork 18
Data Staging
Stage-in is the ability to load external datasets into Hermes. Stage-out is the ability export data out of Hermes.
Data staging is frequently used by Data Warehouses for the extract-transform-load process. The data staging area is used for transforming and cleaning large datasets before moving them into the warehouse. We also apply data staging to workflows to enable various transformations over the same data before finally persisting to the warehouse.
Currently, stage-in and stage-out can be applied to POSIX files. Stage-in/stage-out can be used to load directories, specific files, or fractions of files into Hermes to be processed by the application. Note: The ability to stage-in and stage-out HDF5 datasets (as opposed to the entire HDF5 file) is currently under development.
The stage-in / stage-out utility scripts provide the following API:
./stage-in [url] [offset] [size] [dpe]
./stage-out [url]
The [url] parameter for now is just a POSIX path (e.g., "/home/user/hi.txt"). When [size] is 0, the size of the file will be determined automatically. In the future, this parameter could represent an HDF5 dataset using a different schema (e.g., "hdf5::/[dataset-group1]/[dataset-name1]").
An example of a typical stage-in / stage-out workflow is as follows:
mpirun -n 1 ${HERMES_INSTALL}/hermes_daemon
# Create a 4GB file (1GB / proc)
mpirun -n 4 ior -w -k -b 1048576 -o /tmp/hi.txt
# Stage-in the entire 4GB file
mpirun -n 4 ${HERMES_INSTALL}/stage_in /tmp/hi.txt 0 0 kRoundRobin
# Read the 4GB file in IOR
mpirun -n 4 -genv HERMES_CONF=${HERMES_CONF} ior -r -b 1048576 -o /tmp/hi.txt
# Stage the file back out
mpirun -n 4 ${HERMES_INSTALL/stage_out /tmp/hi.txt
Stage-in / stage-out can also be applied in a native Hermes program.
#include <hermes/staging.h>
int main(int argc, char **argv) {
auto stager = DataStagerFactory::Get(url);
stager->StageIn(url, PlacementEngine::kRoundRobin);
stager->StageOut(url);
}