Reduce Downtime: Design Your System Around Business Availability

Posted by Charley Lingerfelt

Find me on:

10/30/14 2:16 PM

 

When you have real-time data streams, like satellite tracking, and so much other real-time data, the cost to recover and rebuild that data is high. A one hour system outage can require several hours to rebuild/restore operations and recreate lost data. In the meantime, the rest of the business starts falling behind.One of our TMW clients with a fleet of under 200 trucks quanitfied the cost of a system outage at $50,000 per hour.

 

With this in mind we've put together a few questions to help you calculate the cost of a system outage to your business. 

  • How important is the system to business?
  • How much would downtime cost the business if the system was offline or degraded?
    • Revenue Lost
    • Staffing Cost
    • Opportunity Cost

So what is the difference between High-Availability and Disaster Recovery?

Simply stated, High-Availbility is the ability to recover quickly and reduce impact to users. Disaster Recovery is the ability to recover a system in a timely manner when the location of the primary system has been impacted by a disaster requiring the failover of user transactions to a secondary system located at an alternate site. 

Design goals include elimination of single-points of failure where possible, multiple redundant systems, network paths and components. 

Basic Hardware Architecture Considerations

For starters, servers should be internally redundant and system design elements should include:

  • Multiple power supplies to ensure the power server will stay online in the case of a power source or power supply outage.
  • Redundant disk drive arrays to support one or more failing drives with the ability to stay online and servicing requests. These arrays will also support additional features like snapshots which would support point-in-time recovery as well as remote replication to enable disaster recovery capabilities. 
  • Multiple network paths to ensure the system would stay online in the event that a cable or network switch goes offline.
  • Hardware "phone home" integration should be configured and tested to ensure a failing server reports to the IT operation staff and the vendor maintenance team to ensure timely repair of hardware failures.

What does this look like from an X86 or VMWare system perspective?

VMWare adds additional risk protection, optimized performance and enables proactive maintenance. 

  • VMWare provides auto-recovery capability of application servers by restarting virtual servers on alternative physical servers in the case of a major hardware failure. 
  • vMotion enables administrators to move applications from one server to another to allow individual server maintenance. 
  • Distributed Resource Scheduling (DRS) moves applications from one server to another using vMotion based on a set of prioritization rules to enable the best performance on your mission critical applications. 
  • Hardware integration with VMWare allows failing servers to notify vCenter to use vMotion to move workloads off the failing hardware to further reduce the chance of an outage. 
  • Site Recovery Manager (SRM) allows administrators to script the failover and migration of user transactions to a remote site. Failing over these transactions can be a complicated and time-intensive administrative task that SRM can help automate. SRM is very useful in performing both a real disaster recovery failover as well as testing a Disaster Recovery plan before you need it. 

What is IBM doing to enhance DR on Power Systems?

IBM Power Systems (IBM i) support more than one resiliency solution.  Traditionally IBM i shops used logical replication software to mirror their servers. Today IBM provides an integrated solution as an extension to the operating system called PowerHA SystemMirror for i.  

  • PowerHA SystemMirror for i provides the ability to mirror the system at the disk level, rather than the operating system level.  This significantly reduces complexity and administrative overhead at the expense of a little more complexity during initial setup.
  • Combining PowerHA SystemMirror for i with a supported IBM SAN further simplifies the DR environment by allowing the SAN to do all the "heavy lifting" of remote mirroring. The mirroring process becomes transparent to the Power server, reducing overhead.

The bottomline: Any downtime in your IT infrastructure will impact business operations. The time of day the outage occurs, duration of the outage, and the systems affected will determine the cost to the business and the time needed to restore normal operations. The time and investment to prevent your business from encountering this situation is typically far less than the cost of recovery. 

If you have any questions about how to make your business more resilient please give us a call at (415) - 455-5770 or send us a note at info@tamgroup.com .