Installation

Here is a quick sketch of getting the toolchain up and running. One of the first decisions you need to make is where ou want the $MPISTAT_HOME folder to be.. It's recommended this goes in a parallel filesystem - the same one that you are most interested in scanning. The mpi workers will be creating files in this directory so it needs to be available on every compute node, have good IO performance and have enough space to store the protocol buffer data files.

We are keeping the protocol buffer files indefinitely at the moment. One scan typically takes 47GB and we do 2 scans a week. We are able to manage 5TB a year of mpistat data in our multi-petabyte system. This data is effectively replicated in our clickhouse database which is on SSD. Due to the limited SSD space we are periodically culling older databases. Since we keep the protocl buffer files, we can easily create a historic database if needed. It can be interesting to compare databases of the same system at different time points to see how the usage changes over time.

Overview

Set up the clickhouse server
Get the dependencies set up
- MPI
- Boost C++ Library
- Google Protocol Buffers Library
- libcircle
Clone the mpistat repo and set the MPISTAT_HOME environment variable to this location
Make a local copy of bin/activate.sh.example called bin/activate.sh and edit appropriately
Build the mpistat binary by running make in the src folder
Edit the qos / queue in the slurm / UGE templates (we will shortly make this a config option and templatised in the jinja templates so this step will not be needed in future)
Give it a try as a non-root user to a folder that the test user has permissions to traverse
Once you are happy that things are working correctly you can let it loose on a full filesystem as root.

(Obviously running anything as root is done at your own risk. Take note of items 15 and 16 in the License associated with this work.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Installation

Overview

Clone this wiki locally