sprocket i/o

thomas stromberg on technology, nature, and motorcycles

sprocket i/o header image 2

Rocks: Cluster in a Can

June 14th, 2005 · Comments

A few months ago, I needed to roll out a cluster test-bed for people to test out clustering technologies at work before they pay and play on the real thing. I saw that Rocks 4.0.0 came out last week and decided to pull this article from my drafts queue and get it posted.

Having to deploy a new test cluster was my chance to take a fresh look at a Linux distribution I was previously not very impressed with: Rocks Cluster Distribution. I had looked at Rocks last year for a similar problem, and decided to use Cluster Knoppix instead. Using ClusterKnoppix ended up being a complete waste of time. It didn’t do half of what we wanted, and caused much more headache than it was worth. Still, the idea of not wasting disk space on the operating system was an interesting idea to us.

Read on if you’re a systems geek…

So, what is Rocks anyways? It’s like a cluster in a can. In 3 hours or less, you can have a fully functional 24-node Linux cluster, decked out with all the best clustering technologies such as Torque, Condor, SGE, Ganglia, Maui, PVFS. Best of all, anyone can do it. To get 24-nodes setup, I only ever had to run a single command line once to GitR’Done: install-ethers.

Rocks 3.3.0 is based on Red Hat Enterprise 3, though the installation is entirely customized so you would never recognize it. Being based on a Linux 2.4 platform has it’s disadvantages when it comes to performance and support, but it’s not bad enough to overlook just how awesome Rocks is.

Since the time of this writing, Rocks 4.0.0 has come out, which I believe may use Linux 2.6

Installation

Installation for Rocks Cluster Linux is very straight forward, though not without it’s possibility for error. The first concept to get under your belt is this concept of ‘rolls’. Rolls are essentially meta-packages, containing a set of RPM’s. The difference is that Rolls are expected to be completely self-configuring. If you install the Condor roll, for instance, you will boot up to an automagically configured Condor grid without ever having to open a configuration file. If you’ve dealt with Condor as much as I have, this is a good thing. Rolls can often be installed after the head node has been installed, but not all of them can. My advice is to install every roll that you think you may use in the future up front.

First, you head on over to the 3.3.0 Download page, and grab the ISO’s for the base, hpc, and kernel rolls. My crappy old Dell GX110′s could never read the hpc+kernel cd (read past disk errors), so instead I went directly to their FTP site to grab the hpc and kernel disks seperately. I also grabbed the SGE, PBS, and Condor rolls for good measure. It’s honestly pretty annoying that you have to burn each of these on to a CD for the installation, rather than just booting a small ISO and pointing it back to the FTP site, but in theory the pain is only suffered once. One thing to keep in mind is that Rocks will silently fail if there are any read errors on the CD’s, so save yourself an hour or two by having your burning software verify the CD integrity after the fact. In my case, even CD’s that were verified ended up unreadable in the CD-ROM drive I was using for the install, so it took a few attempts.

Once you insert the ‘base’ disk into your head node, a quick boot screen will flash, at which you need to type “frontend” if you are installing a new head node. It will then boot, and ask you to insert the CD for every roll that you plan to deploy. Once it knows what kind of cluster you are configuring, it will then ask for some basic information, such as the hostname for the cluster, the cluster “name” (Rocky!), a contact address, and where it’s located. Then, it asks for all the CD’s, yet again, to copy and install them. It took about an hour to install all of the rolls onto the 667MHz GX110′s I had for it, which was a bit slower than I had hoped for, but given the hardware, understandable. Once that was all through, Rocks prompted me for a root password, and network information for both the internal (eth0) and external (eth1) interfaces. Once that was done, we were ready to reboot and rock’n'roll.

The next step is to log into the head node, and run install-ethers, which sets up kickstarting for your compute nodes. The way it actually works is that it’s just a little python script to sniff out new mac addresses on the internal network, hands out a new name such as compute-0-1 to them, and adds them dhcpd.conf, /etc/hosts, and the cluster information database. Once this is running you can start powering on your compute nodes in order, and they will automatically join your cluster. The process of joining the cluster takes about 20 per compute node on the old hardware that I have, but you can run several simultaneously (up to 7, on my hardware). As each node powers up, you will see the install-ethers script saying that it found a new node and is loading the operating system onto it.

One of the worst things I did the first time I had installed Rocks was leave the nodes set to boot off of PXE before the CD-ROM or Hard disk. This was due to the Cluster Knoppix infrastructure we had previously on the machines, and it certainly did not apply to Rocks which installs the operating system locally. I ended up waiting an hour before I realized that each machine was continuously reinstalling itself. In the end, I set each machine to CD-ROM, Hard Disk, and PXE, and hitting F12 while booting up in order to force it to boot off the network for installation for the nodes that previously had Windows XP installed.

As soon as your compute nodes reboot, you can start scheduling jobs to them, and seeing their vital statistics within the Ganglia monitoring software. Initially, I thought of Ganglia as a duplication of effort, since we were already running Zabbix, but the integration of Ganglia with Rocks is so well-integrated and informative, that it makes no sense to run anything else for basic statistics generation. I have not yet tried to add any custom Ganglia plug-ins, but it is probably fairly trivial to do so. You can see the Ganglia interface for my test cluster, Rocky. It’s all very nifty.

Other than adding users, you are essentially done with your configuration. Congratulations!

Maintenance

Part of why I did not take Rocks seriously in the past was because of the maintenance overhead behind it’s structure and philosophy. This is not to say that you cannot treat a Rocks cluster like a collection of independant Red Hat machines, but to do so would be doing a great disservice to yourself.

Because Rocks defaults to only having the head node available on the public network, authentication has to be handled a little differently. I don’t know about Kerberos behind a NAT, but at least NIS does not work too well. Rocks instead has it’s own NIS-work-alike which uses multicast, named 411, which takes getting used to. You can also use it to distribute arbitrary files, such as /etc/sudoers. It’s a pretty neat system, but if you are looking to integrate Rocks into your existing directory and authentication infrastructure, it can be a little bit of work.

Normally I handle adding patches and packages to Linux cluster’s in a bit of a crude (but mostly effective) manner. I install the package onto a node, and have a script which rsync’s out all the changes to the compute nodes. Naturally, you run into the possibility of version/package skew, where you are not updating all the files you should be to all the machines you should be, which can be a bit frustrating. Even worse, if you use this method to distribute patches, you have to watch out for any patches that touches grub or lilo, as you will have to reinstall the boot sector to keep everything in sync. To complicate things, if some nodes boot off of SCSI and some IDE, you need to treat their differing /boot directories appropriately.

Rocks has a different approach, which while it has more overhead for smaller clusters, is certainly smarter if your local compute node storage is disposable. If you want to add software, you simply copy the rpm packages to a directory, add them to an XML file which defines which rpm packages get installed to the type of machine you have defined the nodes to be. When you want to reload a node, you just type shoot-node [hostname]. 10 minutes later, you’ve got a fresh new compute box that has rejoined your cluster with the latest goods. Pretty slick, eh?

What? Reinstall nodes to add software?

Now, nothing stops you from doing things the old way: running rpm accross all of the nodes, or doing an rsync file update. But, if you have the spare cycles, why not just shoot the nodes to ensure the greatest amount of consistentcy? I can see several cluster administrators rolling their eyes at the idea of reinstalling nodes when software gets installed, but think of it this way.

At best, the software you need to install has intelligent RPM’s that not only install the required software, but run all of the post-installation configuration steps, so that they join the cluster and are configured to communicate with the head node. If they require any users or directories to be created, they do so. In the case of simple software, these are pretty safe assumptions. So, you just copy them to a directory, and cluster-fork rpm -ivh them to each node.

The other tactic is file synchronization, which is what I use. This is full of potential pitfalls, however. You must make sure that each file edited by the rpm is part of your file synchronization configuration. You can’t synchronize every single file, since some must be configured on a per-node basis. You must also pick a node to dedicate as your ‘golden image’, which can be relied upon to be a testing dummy as well. This node will also be used for patching,

Final Thoughts

After deploying 3 more SUSE 9.2 based clusters since the first time I evaluated Rocks, I have to say, the Rocks idea has grown on me significantly. Using Rocks versus a general distribution is a bit like picking which car to buy to bring to a race track. Some people will buy a Porsche 911 GT, while some people will buy a Honda Civic and spend months souping it up to get the same performance as a Porsche.

Tags: Uncategorized

Viewing 3 Comments

 

Trackbacks

(Trackback URL)

close Reblog this comment
blog comments powered by Disqus