When it comes to reliability, we take nothing for granted.  Equipment is much better today than it was a decade ago, but every disk, every server, every circuit has some chance of failure at any time.

All of our hosting centers employ redundancy at every level:
  • Server disk drives are arranged in RAID arrays, so that failure of a disk won't take down the server.
  • All servers have dual power supplies, connected to different power busses, so a power supply failure or blown fuse won't stop the show.
  • All servers are connected to two routers, on different networks, with automatic failover if a switch or router fails.
  • All of our  facilities have full backup power capability - diesel or natural gas, depending on the location, and are capable of operating for weeks if necessary in the event of extended power outages.  In addition, all facilities operate on battery 'float' with runtime sufficient to allow generators time to start and warm up without causing any servers to go down.
  • All of our facilities are fed via highly reliable optical fiber, and all have redundant fiber connections to multiple backbone network providers.
Even with our belt-and-suspenders system design, things can go bump in the night (or the day), so our Network Operations staff monitors all parts of our operations 24/7, watching for errors, or odd changes in the charts that might signal a developing problem.

Our human operators are aided by a very thorough and redundantly hosted monitoring system that, among other things:
  • Maintains graphs of bandwidth use, latency, packet loss and errors for all circuits, and triggers alarms if any values deviate from normal parameters.
  • Maintains graphs of email throughput, undelivered backlog to remote systems having problems, spam and virus percentages, and triggers alarms when anything seems amiss.
  • Checks for proper http, smtp, pop, imap, ftp and other responses of all servers 60 times each hour.  Our monitoring system not only checks to make sure the services are responding, but checks to make sure the right responses are received.  Key web sites are monitored to make sure that they're returning the right text, to verify that nothing has been broken during an upload or a database has started to malfunction.
Our monitoring system not only displays all status information onscreen for Operations personnel, but when alarms are triggered, sends pages as well to make sure that everybody knows about a problem as soon as it develops in case they're busy looking at something else.

Sometimes servers need to be moved from one hosting center to another, or server hardware needs to be refreshed or upgraded.  We never take sites down to do this; instead, we take the opportunity to bring up all of the to-be-moved sites on new hardware in the new location, then shut down the old server, avoiding downtime.

No system, including ours, is perfect, but most of the time we're able to detect and correct developing problems before they have operational impact.  Some of our bigger web sites have had zero downtime for years at a time (some have had no downtime, ever, going back as far as a decade).