Skip to content
cstackpole edited this page Apr 14, 2013 · 43 revisions

Welcome to the Cluster Builder Manual!

This manual is licensed under the Creative Commons Attribution-ShareAlike CC BY-SA license

This Guide is still under construction. If you run into a problem, please post an issue report and I will fix it as soon as I can. Thanks!

Why build a cluster?

The world of clusters can be a very broad subject. Some need computational power for research or for business while others might be interested in this guide as a means to learn parallel coding, to learn administration, or just for fun. I enjoy building clusters for fun, research, and for profit. This instructional guide will walk one through the process of building a cluster from the ground up, including the basics of administration and management. I am in the process of expanding the notes to cover a variety of possibilities and configurations.

Pre-build process

Cluster design.

How you build a cluster can vary wildly depending on how much money you are willing to invest. This guide is going to take the assumption of a smaller cluster on a standard Gigabit network.

The frontend should have at least an 80GB hard drive, but the more room you have for your /home partition the better. As your users will be logging into the frontend to compile, launch jobs, and do work then you really should have several GB of memory and multiple cores. A section on creating a Login Node for users will come later. The frontend should also have two network cards. The first will have access to the public network or the internet while the second will have access to the private network that is reserved specifically for the nodes.

The compute nodes should be as beefy as possible. The type of nodes completely depends on the type of work. Large data set jobs may require more memory then processing power while rendering jobs may require more GPUs then anything else. If you are just doing this for the education then whatever you have works. My first personal cluster had ancient hardware because that is all I had access to. My current personal playground setup consists of clearance-sale refurb boxes from Newegg; they cost me very little and are surprisingly powerful. When building a cluster for a job, tailoring the compute nodes to the job is very important but completely dependent on the job type.

In the 'simple' setup, this guide will assume that you will be exporting your /home directory from the frontend via NFS. This is not the only option that you have. It is not uncommon to have a SAN or NAS on which /home is stored. Some build their own while others buy one. This is more a more advanced topic outside the scope of this guide at this time.

Guide example cluster

This guide is assuming the following setup.

Internet <-> Frontend01 <-> nodes

The frontend is known as frontend01.cluster.domain and it will have a public side address of 192.168.1.201 and a private side network of 10.10.10.10.

The nodes are known as node01, node02, node03, and node04.

The example cluster that this guide uses will be based on a 64bit system, but the guide will attempt to mask the commands for those with 32bit systems. Anytime the variable $ARCH is shown, substitute it with either i386 or x86_64.

Build a repo.

If you are building a cluster of any significant size then you will be grabbing the same packages many times. This can be very time consuming for you over a slow internet connection and very load intensive on a community repository. In these situations it is often very useful having a local repository from which you can pull your packages from. Here is one way of building a local repository.
CreateRepo

Building a cluster.

Operating system.

Start with the installation of the frontend. The operating system one chooses is very important. Many prefer using Red Hat, CentOS, or Scientific Linux but there are many good reasons for choosing a Debian based system as well.
Scientific Linux
Debian (On the Todo list!) Ubuntu (On the Todo list!)

Configuring software on the frontend

First login to configure the new installation: On Scientific Linux.

Verify network settings.

Configure NFS for /home.

Configure a DHCP/TFTP server.

Kickstarting the nodes

Kickstarting the nodes

Resource Management

Hardware resource manager Torque User resource manager Maui '(coming soon)'

-or-

Open Grid Scheduler '(coming soon)'

Parallel Computing

OpenMPI

Testing and benchmarking the cluster

Cbench

Administration of the cluster

Puppet '(coming soon)'

Modules '(coming soon)'

Configuring users

Add users and push their logins to the nodes

Add a development user for creating packages for the cluster

Trouble shooting.

Things didn't go as planned, huh? I am truly sorry. Unfortunately, my fingers don't always type what my brain tells them to do and I typo something I shouldn't have. Chances are, that is what happened. The best place to start is at the beginning. Once we find a place where things are going wrong, we can narrow down the potential problems.

  • Does the DHCP server start?
    ** Verify your /etc/dnsmasq.conf file is typo free.
  • When booting the node, does it get a TFTP IP address?
    ** Is DNSMasq running properly?
    ** Watch the /var/log/messages file on the server.
    ** Verify your firewall settings. Try temporarily disabling the firewall to see if that helps. If so, fix your firewall rules and turn it back on.
    ** Verify your SELinux settings. Try temporarily disabling SELinux to see if that helps. If so, fix the SELinux permissions and turn it back on.
  • When installing the node, does it get a kickstart file?
    ** Check permissions on the http.cluster.domain server.
  • When installing the node, does it fail during install?
    ** Verify your kickstart file is correctly configured. Also, knowing where in the install process can help.

Helpful Links

Helpful Links

Creative Commons Attribution-ShareAlike CC BY-SA license

Clone this wiki locally