Skip to content

1. Getting Started

lukemartinlogan edited this page Sep 12, 2022 · 27 revisions

Building Hermes

There are several ways to obtain a working Hermes installation. Information on dependencies can be found in the README.

  1. Docker Image
  1. CMake
  • Instructions can be found in the README
  1. Spack
  • Instructions can be found in the README

If you get stuck, the root of the repository contains a ci folder where we keep the scripts we use to build and test Hermes in a Github Actions workflow. The workflow file itself is here.

Deploying Resources

Hermes is an application extension. Storage resources are deployed under Hermes control by

  1. Configuring Hermes for your system and application
  2. Making your application "Hermes-aware"

An application can be made aware of Hermes in at least three different ways:

  • Through Hermes adapters, LD_PRELOAD-able shared libraries which intercept common I/O middleware calls such as UNIX STDIO, POSIX, and MPI-IO
  • Through an HDF5 virtual file driver (VFD)
  • By directly targeting the Hermes native API

These options represent different use cases and trade-offs, for example, with respect to expected performance gains and required code change.

Adapters

When using the STDIO adapter (intercepting fopen, fwrite, etc.) and the POSIX adapter (intercepting open, write, etc.), there are multiple ways to deploy Hermes with an existing application.

NOTE: The MPI-IO adapter is still experimental, and only supports MPICH at this time.

Hermes services running in same process as the application

If your application is meant to be run as a single process, this is the recommended approach. It doesn't require spawning any daemons. You simply LD_PRELOAD the appropriate adapter and provide a path to a Hermes configuration file through the HERMES_CONFIG environment variable.

# POSIX adapter
LD_PRELOAD=${HERMES_INSTALL_DIR}/lib/libhermes_posix.so \
  HERMES_CONF=/path/to/hermes.yaml \
  ./my_app
  
# STDIO adapter
LD_PRELOAD=${HERMES_INSTALL_DIR}/lib/libhermes_stdio.so \
  HERMES_CONF=/path/to/hermes.yaml \
  ./my_app

IMPORTANT: Currently, even a single-process application is required to call MPI_Init and MPI_Finalize. This will be fixed soon. See #148.

Hermes services running in separate process as a daemon.

If your app is an MPI application that runs with 2 or more ranks, then you must spawn a Hermes daemon before launching your app. Here's an example of running an app with four ranks on two nodes, two ranks per node:

# We need to start one and only one Hermes daemon on each node. I start this job
# in the background so I can launch the application in the same terminal.
mpirun -n 2 -ppn 1 \
  -genv HERMES_CONF /path/to/hermes.yaml \
  ${HERMES_INSTALL_DIR}/bin/hermes_daemon &

# Now we can start our application

mpirun -n 4 -ppn 2 \
  -genv LD_PRELOAD ${HERMES_INSTALL_DIR}/lib/libhermes_posix.so \
  -genv HERMES_CONF /path/to/hermes.yaml \
  ./my_app

By default, when the application finishes it will also shutdown the Hermes daemon. However, it is sometimes desirable to keep the daemon alive so your data remains buffered and available for consumption by a second application. We can achieve this via the HERMES_STOP_DAEMON environment variable. Here is an example of a checkpoint/restart workflow.

# Start a daemon
HERMES_CONF /path/to/hermes.yaml \
  ${HERMES_INSTALL_DIR}/bin/hermes_daemon &

# Run an app that writes a checkpoint file
LD_PRELOAD=${HERMES_INSTALL_DIR}/lib/libhermes_posix.so \
  HERMES_CONF=/path/to/hermes.yaml \
  # Keep the daemon alive
  HERMES_STOP_DAEMON=0 \
  # Don't persist buffered data to the final destination
  ADAPTER_MODE=SCRATCH \
  ./my_producer ${PFS}/checkpoint.txt

# Run an app that reads the buffered data and performs some computation
LD_PRELOAD=${HERMES_INSTALL_DIR}/lib/libhermes_posix.so \
  HERMES_CONF=/path/to/hermes.yaml \
  ./my_consumer ${PFS}/checkpoint.txt

Normally, the producer would write a checkpoint to a parallel file system, and the consumer would read it back. But when we use Hermes we can buffer the data in faster, local media. The key lines here are to set HERMES_STOP_DAEMON=0 and ADAPTER_MODE=SCRATCH. The default ADAPTER_MODE is to persist the buffered data to the "final destination," (${PFS}/checkpoint.txt in this case), but in SCRATCH mode we keep it buffered.

IMPORTANT: Currently the adapters keep buffered data by reference counting open files. Once all open handles are closed, Hermes deletes the buffered data. In the future we will add ways to retain buffered data even after all open handles are closed (see #258). As a temporary workaround for this limitation, just allow your program to exit without explicitly calling close or fclose. The OS will clean up these handles after the app disconnects from the daemon.

The role of MPI

MPI is used in Hermes for launching processes and doing some synchronization in startup and finalization. Beyond that, no MPI calls are made and no data is transferred over MPI. Data transfer and communication are done via remote procedure calls. That said, MPI is fully available to applications using the native Hermes API. We have wrappers for the most common functionality:

namespace hermes::api {
void Hermes::AppBarrier(); // collective
bool Hermes::IsFirstRankOnNode();
int Hermes::GetProcessRank();
int Hermes::GetNodeId();
int Hermes::GetNumProcesses();
}

If you require more complex MPI usage, you can get access to the MPI_Comm like this:

namespace hapi = hermes::api;
std::shared_ptr<hapi::Hermes> hermes = hapi::InitHermes();
MPI_Comm *app_comm = (MPI_Comm *)hermes->GetAppCommunicator();

// Now you can pass the dereferenced app_comm to any MPI function that accepts
// an MPI_Comm. For example:
int size;
MPI_Comm_size(*app_comm, &size);

Note that what we call the "App communicator" is distinct from the "Hermes communicator." This document assumes all user code using the native Hermes API is contained within a block such as:

if (hermes->IsApplicationCore()) {
  // User code.
} else {
  // Hermes core. No user code here.
}

See the end to end test for an example.

Hermes Tutorial

Here we will walk through an entire example of using Hermes with IOR. IOR supports several I/O APIs (-a option), including POSIX, MPI-IO, and HDF5. Hermes has adapters for POSIX, MPI-IO, and HDF5. For serial (single process) HDF5, the Hermes VFD can be enabled via environment variable as described here. Parallel HDF5 can use the MPI-IO adapter. For this tutorial, we'll focus on POSIX. We assume you already have working Hermes and IOR installations. Note that we currently require a slightly modified IOR which can be found on the chogan/hermes branch of the IOR fork here. See the README for Hermes installation details.

Workload

We will simulate a checkpoint/restart workload in which a group of processes each write a checkpoint file, and then another group of processes on different nodes reads the checkpoint files. In the default case, the checkpoint files will be written to and then read from the parallel file system. When running with Hermes, the data will be buffered in fast, local media resulting in a nice speedup with no code changes required.

Target system

Clients

I'm running on a cluster with 8 client nodes, each with the following characteristics:

Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
40 cores (Hyperthreading enabled)
46 GiB DRAM
40 Gbps ethernet with RoCE capability

Storage Tiers

Name Description Measured Write Bandwidth
PFS OrangeFS running on 8 server nodes, backed by HDDs 536 MiB/s
NVMe Node-local NVMe attached SSDs. 1918 MiB/s
RAM Node-local DRAM. 79,061 MiB/s

Hermes configuration

Here we describe the non-default options we provide in a Hermes configuration file. See Configuration for more details.

num_devices: [1]

Each storage tier we wish to use as buffering is known as a Device in Hermes. The PFS tier is what we call the final destination, or the place where your written files would end up if you ran your application without Hermes. For this example, the only device we're using for buffering is RAM.

capacities_gb: [2]

The first entry represents the RAM device. We allot 2 GiB for buffering.

block_sizes_mb: [1]

Since I know all of our writes will be 1 MiB in size, I set the block size appropriately.

num_slabs: {1};
slab_unit_sizes: [
  [1]
]
desired_slab_percentages: [
  [1.0]
]

Knowing the I/O characteristics of our workload, we only need to support a single 1 MiB block as our buffer size, meaning each device has only 1 slab, and each slab is 1 block.

bandwidths_mbps = [80000]

This is the bandwidth I measured with IOR.

latencies_us: [15]

This can be obtained from the hardware manufacturer.

buffer_pool_arena_percentage: 0.85
metadata_arena_percentage: 0.11
transient_arena_percentage: 0.04

The amount of RAM we allocate to Hermes in the first entry in capacities_gb above is split into actual buffering space and metadata. Here we're saying that 85% of the space is for buffering and the rest is for internal Hermes usage. We plan to simplify this in the future, but for now you may have to play with it a little. Hermes will inform you if it runs out of metadata space.

max_buckets_per_node: 15
max_vbuckets_per_node: 8

We want max_buckets_per_node to be greater than the number of files we plan on opening on each node. Our app will run with 12 ranks on each node, so we'll say 15 buckets just to be safe. We can leave max_vbuckets_per_node as the default, since VBuckets are only used when the adapter will persist files to permanent storage. We'll be using the adapter in scratch mode (i.e., ADAPTER_MODE=SCRATCH) since we're only interested in buffering the files.

mount_points: ""
swap_mount: "/PFS/USER/hermes_swap"

The RAM mount point is always an empty string.

rpc_server_base_name: "ares-comp-"
rpc_server_suffix: "-40g";
rpc_host_number_range: [25-28]

This is a compact way of specify a hosts file that would normally look like this:

ares-comp-25-40g
ares-comp-26-40g
ares-comp-27-40g
ares-comp-28-40g
rpc_protocol: "ofi+verbs"
rpc_domain: "mlx5_0"

I'm using verbs rather than tcp for RoCE support. Internally, Hermes will use RDMA for RPC data transfers over 4 KiB.

default_placement_policy = "RoundRobin";

Since I know my data will all fit into Hermes, and I will be writing and reading all of it, I use the RoundRobin data placement strategy because it's a bit faster than MinimizeIoTime for this simple workload.

Running

IOR Baseline

mpirun -n 48 -ppn 12 \
  ior -w -r -o /PFS/USER/ior.out -t 1m -b 128m -F -e -Y -C -O summaryFormat=CSV

Here we launch 48 IOR processes across 4 nodes. The IOR options are explained in the following table.

Flag Description
-w Perform write
-r Perform read
-o Output/Input file
-t Size per write
-b Total I/O size per rank
-F Create one file for each process
-e Call fsync on file close.
-Y Call fsync after each write.
-C Shuffle ranks so that they read from different nodes than they wrote to
-O summaryFormat Show the output in a compact, CSV format

Some of these options require justification.

  • -Y: We do direct (non-buffered) I/O in order to simulate a situation with high RAM pressure. If the application is using most of the RAM, then the OS page cache will have less RAM available for buffering.
  • -C: This option simulates a situation where different nodes read the checkpoint than the ones that wrote it, resulting in a situation where the checkpoint cannot be read from the page cache, and forcing the app to go to the PFS.

Here are the results:

access,bw(MiB/s),IOPS,Latency,block(KiB),xfer(KiB),open(s),wr/rd(s),close(s),total(s),numTasks,iter
write,30.3120,30.3795,1.5543,131072.0000,1024.0000,15.0489,202.2413,35.0600,202.6921,48,0
read,2012.0224,2026.8185,0.0158,131072.0000,1024.0000,0.4558,3.0314,1.0307,3.0536,48,0

Our write bandwidth is 30 MiB/s and our read bandwidth is 2012 MiB/s.

IOR with Hermes

To enable Hermes with an IOR checkpoint/restart workload, we must start a daemon, LD_PRELOAD a Hermes adapter and set some environment variables.

NOTE: As a temporary workaround to issue #258 we must comment out the line backend->close(fd, params->backend_options); in ior.c:TestIoSys before compiling IOR. This change is implemented in the chogan/hermes branch of the IOR fork here.

We spawn a daemon on each node, then run our app with the appropriate environment variables, similar to the process described above.

HERMES_CONF_PATH=/absolute/path/to/hermes.yaml

# Start one daemon on each node
mpirun -n 4 -ppn 1 \
  -genv HERMES_CONF ${HERMES_CONF_PATH} \
  ${HERMES_INSTALL_DIR}/bin/hermes_daemon &

# Give the daemons a chance to initialize
sleep 3

# Start "checkpoint" app
mpirun -n 48 -ppn 12 \
  -genv LD_PRELOAD ${HERMES_INSTALL_DIR}/lib/libhermes_posix.so \
  -genv HERMES_CONF ${HERMES_CONF_PATH} \
  -genv HERMES_CLIENT 1 \
  -genv ADAPTER_MODE SCRATCH \
  -genv HERMES_STOP_DAEMON 0 \
  ior -w -k -o ${CHECKPOINT_FILE} -t 1m -b 128m -F -e -Y -O summaryFormat=CSV

# Start the "restart" app
mpirun -n 48 -ppn 8 \
  -genv LD_PRELOAD ${HERMES_INSTALL_DIR}/lib/libhermes_posix.so \
  -genv HERMES_CONF ${HERMES_CONF_PATH} \
  -genv HERMES_CLIENT 1 \
  -genv ADAPTER_MODE SCRATCH \
  ior -r -o ${CHECKPOINT_FILE} -t 1m -b 128m -F -e -O summaryFormat=CSV

Results:

access,bw(MiB/s),IOPS,Latency,block(KiB),xfer(KiB),open(s),wr/rd(s),close(s),total(s),numTasks,iter
write,1748.0946,1928.3122,0.0122,131072,1024,1.9385,3.1862,0.8167,3.5147
read,2580.2756,2861.2635,0.0074,131072,1024,0.1923,2.1473,1.7458,2.3811

We get a nice boost in write bandwidth, and a modest speedup in read bandwidth, all with no code changes.

We haven't done any performance optimization yet, so I expect to bridge the gap significantly between the 2.5 GiB read speed of Hermes and the baseline speed of reading from RAM (tmpfs on /dev/shm) with IOR of 60 GiB.

Clone this wiki locally