KrakenMare is a containerized software stack providing a versatile data pipeline to monitor HPC clusters. The source code is provided under the Apache License version 2.0 (See LICENSE file at the root of the tree) The data is provided under the Creative Common 0 License version 1.0 (See LICENSE.data file at the root of the tree)
This work has been supported in part by the U.S. Department of Energy under LLNS Subcontract B621301.
Configuration:
- Minimum RAM = 16 GB, Minimum Cores = 4
- Recommended RAM = 32GB, Recommended Cores = 8
Install docker (you may use instructions at https://github.com/bcornec/Labs/tree/master/Docker#docker-installation)
Follow the instructions at https://docs.docker.com/engine/install/linux-postinstall/#manage-docker-as-a-non-root-user to enable your userid to run docker commands without sudo.
Check it works:
#
docker run hello-world
ONLY for proxy-challenged IT departments
you may need to perform the following tasks:
#
mkdir -p /etc/systemd/system/docker.service.d/
#
cat > /etc/systemd/system/docker.service.d/http-proxy.conf << EOF
[Service]
Environment="HTTP_PROXY=http://web-proxy.domain.net:8080" "HTTPS_PROXY=http://web-proxy.domain.net:8080" "NO_PROXY=<insert-your-hostname-here>"
EOF
Note: change web-proxy.domain.net to the name of your proxy machine and adapt as well the port used.
You may want to read https://docs.docker.com/engine/admin/systemd/#http-proxy
For all IT departments
Create daemon.json to tell Docker about the registry and mirror that will be created and to enable their use.
#
mkdir -p /etc/docker
#
echo '{"registry-mirrors": ["http://myregistry:5000"], "insecure-registries": ["http://myregistry:5000", "http://myregistry:5001"], "dns": ["8.8.8.8", "4.4.4.4"]}' > /etc/docker/daemon.json
Note: adjust the 8.8.8.8 and 4.4.4.4 IP addresses to match your DNS IP addresses and myregistry to the name of your registry machine. For single machine setup the registry is the current hostname.
#
systemctl daemon-reload
#
systemctl restart docker
#
curl -L https://github.com/docker/compose/releases/download/1.25.5/docker-compose-$(uname -s)-$(uname -m) -o /usr/local/bin/docker-compose
#
chmod +x /usr/local/bin/docker-compose
Next steps have to be performed as a docker capable user (able to launch docker commands, user belonging to the docker group e.g.)
$
cp playbooks/km.conf ~/km.conf
Note: edit it to match your setup. The default configuration is for the minimum size system.
$
docker swarm init --advertise-addr x.y.z.t
Note: x.y.z.t is a local IP address to attach to. This must be an IP used by the system you are using.
Instructions from this point require no modification and are to be run as shown.
$
for label in broker-1 broker-2 broker-3 fanin registry framework injectors test-tools supervisory_cloud; do docker node update --label-add $label=true $(hostname); done
$
playbooks/setup.sh -r
Note: this will start two registry, a mirror registry and a proxy registry
$
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
df2b2c0bd90f registry "/entrypoint.sh /etc..." 5 seconds ago Up 5 seconds 0.0.0.0:5000->5000/tcp docker-registry_registry-mirror_1
75a0537d3758 registry "/entrypoint.sh /etc..." 57 minutes ago Up 5 seconds 0.0.0.0:5001->5000/tcp docker-registry_registry-private_1
Check that docker ps shows two registries. If you restart docker you will need to run setup.sh -r to start the registires.
$
playbooks/setup.sh -b
This will build the stack from the Internet, can be very long depending on Internet connection
$
playbooks/setup.sh -d
This will deploy all containers and start the stack
Then run a sanity check... after a while (stack may take up to 5 minutes to start)
$
playbooks/setup.sh -t
running:timeout 10 mosquitto_sub -h mosquitto -p ok ... see logs </tmp/mosquitto>
running:kafkacat -b broker-1 -L ok ... see logs </tmp/broker-1>
running:kafkacat -b broker-2:9093 -L ok ... see logs </tmp/broker-2>
running:kafkacat -b broker-3:9094 -L ok ... see logs </tmp/broker-3>
running:redis-cli -h redis ping ok ... see logs </tmp/redis>
running:curl -s framework:8080/agents ok ... see logs </tmp/framework>
running:curl -s http://druid:8081/status/health ok ... see logs </tmp/druid_coord>
running:curl -s http://druid:8082/status/health ok ... see logs </tmp/druid_broker>
running:curl -s http://druid:8083/status/health ok ... see logs </tmp/druid_histo>
running:curl -s http://druid:8090/status/health ok ... see logs </tmp/druid_overlord>
running:curl -s http://druid:8091/status/health ok ... see logs </tmp/druid_middle>
running:kafkacat -L -X ssl.ca.location=/run/secr ok ... see logs </tmp/ssl-broker-1>
running:kafkacat -L -X ssl.ca.location=/run/secr ok ... see logs </tmp/ssl-broker-2>
running:kafkacat -L -X ssl.ca.location=/run/secr ok ... see logs </tmp/ssl-broker-3>
running:curl -s http://prometheus:9090/api/v1/ta ok ... see logs </tmp/prometheus>
running:curl -s --cacert /run/secrets/km-ca-1.cr ok ... see logs </tmp/schemaregistry>
running:kafkacat -b broker-1:29092 -L -X securit ok ... see logs </tmp/sasl-broker-1>
running:kafkacat -b broker-2:29093 -L -X securit ok ... see logs </tmp/sasl-broker-2>
running:kafkacat -b broker-3:29094 -L -X securit ok ... see logs </tmp/sasl-broker-3>
running:curl -s --cacert /run/secrets/km-ca-1.cr ok ... see logs </tmp/schemaregistry>
number of messages in KAFKA on the fabric topic... @time = 1587455812 : 320242
A single zookeeper using confluent cp-zookeeper image as base is started with security features enabled.
Three brokers each on their own set of ports with both open 9XXX and SASL authentication/TLS encrypted 2XXX) ones. Each broker has it's own label enabling multi node in swarm. Sizing of Java memory is controlled by km.conf file created above.
Eclipse mosquitto broker with security enabled. It is co-located with fanin service.
Multipurpose test container. Used by setup.sh to check pipeline.
Reads messages from mqtt and sends to kafka
Apache druid in single server configuration. Size is controlled by km.conf.
Grafana 6.6.2 (needed for support of Druid plugin) configured to read from Druid database and to display Prometheus Kafka metrics.
Confluent schemaregistry as central repository of Avro schema
Agent, device and sensor registration server. Uses redis for backend
Backend for framework
Kafka performance metrics exporter for Grafana
Adds users for services
Creates Avro schema files from Avro definition files and pushes to schemaregistry
Pushes ingestion spec into druid
Creates and configures topics
Sample agent that automatically runs to show all features
Elasticsearch for persisting agent, device, sensor registration data
Configures elasticsearch sink connectors
Kafka connect for sink to elasticsearch.
Sample redfish agent is configured (via RedfishAgent.cfg) to contact HPE's externally available iLO simulator at https://ilorestfulapiexplorer.ext.hpe.com/redfish/v1 . Further documentation on iLO REST API can be found at https://www.hpe.com/us/en/servers/restful-api.html
As this network path may not function we do not start the agent at stack start up. Please use
$
docker service logs -f krakenmare_redfish
until it reports
use: /redfish/start.sh to actually start the container payload.
Please enter the container with
$
docker exec -ti $(docker ps | grep redfish | awk '{print $1}') bash
and run as above or use
$
docker exec $(docker ps | grep redfish | awk '{print $1}') /redfish/start.sh
The agent requests the chassis and from each chassis the temperatures and fans building up a map of sensors. This process takes several seconds.
The agent registers itself and it's map of sensors and queries the iLO once a minute.