2009-03-20

Red Hat Enterprise Linux Cluster Suite 5.2 DHCP Problems

Well, I learned another valuable clustering lesson today. Manually assign all of your IP addresses and do not rely on DHCP. I'd still suggest you have the IPs statically assigned on the DHCP server but don't rely on it at all.

Yesterday I rebooted all of the machines in the cluster simultaneously after making some extensive modifications to the iSCSI configuration on them all. (I'm using cluster-ssh to manage them all simultaneously.) All of the machines went down at exactly the same time and came up at exactly the same time. (They are identical hardware with identical software, afterall.)

Today, for whatever reason, when all of the leases expired at roughly the same time they didn't renew quickly enough. I walked out of a meeting this afternoon to all of the cluster nodes reporting "Quorum dissolved" and all services on the cluster were failed.

Labels: , , ,

2009-03-11

Red Hat Enterprise Linux Cluster Suite 5.2 Relocate Problems

Gist of it all
So, I've been fighting mind bending failover problems on our new compute cluster at work for the past week. I can summarize the solution to service failover problems in one sentence:
Don't use Red Hat's cluster service scripts, evar.

How it actually works
Apache's script (apache.sh) in RHEL 5.2 sends a TERM signal to Apache, waits N seconds (0 by default for Luci/Ricci, 20 for sys-config-cluster) and then proclaims that Apache has failed to shutdown if its still running. This type of failure is fatal for the cluster service.

How it should work
The standard apache init scripts (/etc/rc.d/init.d/httpd) do the right thing, more or less. The parent process is sent the TERM signal, waits 10 seconds and if its still running sends the KILL signal. That may seem harsh but this is a web server for crying out loud. There's no real chance of data corruption if you KILL the web server. If something goes horribly wrong and Apache needs to move to another machine I'm quite happy with Billy Bob having to retype a form or resume his download. That affects only Billy Bob. Bringing down the service for the entire world, in comparison, seems like a very bad solution.

Rant
Red Hat has spent years developing this clustering software. Its fairly good and seems reasonably well written. Its a shame they only spent 5 minutes writing the service scripts. The stock init scripts are infinitely better written but won't work in a clustered environment.

Solution
So, my advice, spend 20 minutes and write your own userscript for each service. If a service is important enough to be on a cluster you can afford to spend the time to write your own script to start and stop it. If you don't know how to write shell scripts and you're managing a cluster then do everyone a favor and seek employment at McDonalds.

Howto
User scripts for the cluster are simply LSB compliant init scripts. I advise you to not waste much time trying trying to hack the /etc/rc.d/init.d scripts to work within the cluster. You can do so but you'll end up stripping out most of it. You may wish to source the /etc/rc.d/init.d/functions file, though. But you will need to pass the pid filename to every single function you call. (Read the functions script to see which options to pass to each function. daemon, killproc, and status functions are most useful for starting, stopping and checking status respectively.)

Labels: , , ,