Skip to content

Data Staging

lukemartinlogan edited this page Nov 28, 2022 · 11 revisions

Stage-in is the ability to load external datasets into Hermes. Stage-out is the ability export data out of Hermes.

Use Case

Data staging is frequently used by Data Warehouses for the extract-transform-load process. The data staging area is used for transforming and cleaning large datasets before moving them into the warehouse. We also apply data staging to workflows to enable various transformations over the same data before finally persisting to the warehouse.

Current Status

Currently, stage-in and stage-out can be applied to POSIX files. Stage-in/stage-out can be used to load directories, specific files, or fractions of files into Hermes to be processed by the application. Note: The ability to stage-in and stage-out HDF5 datasets (as opposed to the entire HDF5 file) is currently under development.

Utility Script

The stage-in / stage-out utility scripts provide the following API:

./stage-in [url] [offset] [size] [dpe]
./stage-out [url]

The [url] parameter for now is just a POSIX path (e.g., "/home/user/hi.txt"). When [size] is 0, the size of the file will be determined automatically. In the future, this parameter could represent an HDF5 dataset using a different schema (e.g., "hdf5::/[dataset-group1]/[dataset-name1]").

An example of a typical stage-in / stage-out workflow is as follows:

mpirun -n 1 ${HERMES_INSTALL}/hermes_daemon
# Create a 4GB file (1GB / proc)
mpirun -n 4 ior -w -k -b 1048576 -o /tmp/hi.txt
# Stage-in the entire 4GB file
mpirun -n 4 ${HERMES_INSTALL}/stage_in /tmp/hi.txt 0 0 kRoundRobin
# Read the 4GB file in IOR
mpirun -n 4 
-genv HERMES_CONF=${HERMES_CONF} 
-genv LD_PRELOAD "${HERMES_INSTALL}/libhermes_posix.so" \
-genv HERMES_STOP_DAEMON 0 \
-genv ADAPTER_MODE WORKFLOW \
-genv HERMES_CLIENT 1 \
-genv HERMES_PAGE_SIZE 1048576 \
ior -r -b 1048576 -o /tmp/hi.txt
# Stage the file back out
mpirun -n 4 ${HERMES_INSTALL/stage_out /tmp/hi.txt

Native API

Stage-in / stage-out can also be applied in a native Hermes program.

#include <hermes/staging.h>

int main(int argc, char **argv) {
  auto stager = DataStagerFactory::Get(url);
  stager->StageIn(url, PlacementEngine::kRoundRobin);
  stager->StageOut(url);
}
Clone this wiki locally