TrafficCluster Fault Tolerance

This technical note discusses ZXTM’s Fault Tolerance mechanism ('TrafficCluster'). It describes the concepts of clustering and traffic IP groups. It then outlines different fault-tolerance strategies, including active-active, active-standby and multiply-redundant configurations.

Overview

Zeus Extensible Traffic Manager (ZXTM) manages network services in a fault-tolerant manner.

A ZXTM cluster is a group of ZXTM machines. Together, they receive requests from the network and forward them on to the most suitable and available back-end node.

The machines in a ZXTM cluster monitor each other. If one or more machines were fails for any reason, the remaining machine(s) can be configured to automatically take over the services that the failed machines were hosting. This makes the hosted services resilient to a range of hardware or software failures, providing a very powerful degree of fault-tolerance.

.

Clustering

To take advantage of ZXTM’s fault tolerance, you must deploy two or more ZXTM machines in a cluster.

A cluster of ZXTM machines shares the same configuration. Configuration updates on one machine are automatically distributed to all other machines in the cluster. Each machine monitors the other machines in the cluster to detect failures.

Traffic IP Groups

A traffic IP group is a set of IP addresses. A traffic IP group spans some or all of the machines in a cluster.

All of the machines spanned by a traffic IP group cooperate to ensure that each traffic IP address is available. Each traffic IP address is managed by one of the machines. If one or more machines fail, the IP addresses that would have been lost are instead redistributed across the remaining machines.

Provided that at least one spanned machine is active, all of the traffic IP addresses are always available. It is normal to publish your hosted services on an IP address that is contained in a traffic IP group, to ensure that the service is always available and fault-tolerant.

IP Address Transfer

All of the machines in a ZXTM cluster monitor each other to detect failures or recoveries from failures. ZXTM uses a robust, fully deterministic algorithm to monitor machines and distribute IP addresses. Failover is a simple and reliable process because the machines never need to negotiate or share state.

Failure Detection

A ZXTM machine frequently checks its connectivity by pinging the back-end nodes and its configured gateway. You can tune this behaviour in the Traffic Manager Fault Tolerance section of the System/Global Settings page.

  • If the machine is healthy, it broadcasts signed ‘health notification’ messages at frequent intervals to a configured multicast address that all ZXTMs listen on.
  • If the machine is not healthy, it drops any traffic IP addresses and broadcasts signed messages indicating that there has been a failure.

Each ZXTM machine listens for health notification messages from other machines in the cluster. If the machine fails to hear from another machine within a defined time period, or it receives a ‘failure’ notification from the other machine, it concludes that that machine has failed:

  • The failed machine will either have voluntarily dropped its traffic IP addresses, or completely lost connectivity (so its traffic IP addresses are unroutable).

The ZXTM machines then initiate the IP address redistribution operation.

You can debug this entire process by enabling the flipper!verbose setting. This causes each ZXTM to log every multicast message sent and received, and every state change and IP transfer.

Address Redistribution

A ZXTM machine uses its knowledge of which machines are active to calculate which traffic IP addresses it should be hosting. The distribution algorithm is:

  • Fully deterministic, so ZXTM machines do not need to negotiate or pass tokens between themselves;
  • Optimized to spread the allocation of traffic IP addresses evenly across the remaining ZXTM machines.

All state changes are logged to each ZXTM machine’s error log file, and can be reviewed using the Admin Server. Additionally, ZXTM can be configured easily to email information about each state change to an administrator when it happens, or to run a custom command or script.

Example

Suppose you have 4 ZXTM traffic managers. You create a Traffic IP Group containing 3 ips, which spans 3 of the traffic managers, and a second Group that contains 2 IPs and spans two of the traffic manager


  • When all traffic managers are functioning and healthy, each traffic manager will receive one IP address.
  • Suppose ZXTM 3 fails. The IP addresses assigned to will be transfered to other traffic managers. No other traffic IPs are moved.
  • Now suppose a ZXTM 2 fails, so only is left in Group 1. All IPs in this group will now reside on ZXTM 1

Fault Tolerant configurations

ZXTM’s traffic IP groups provide a flexible way to implement fault tolerance. This flexibility can be applied to a range of different problems.

Active-Standby

In an active-standby configuration, one ZXTM machine manages traffic, and the second sits in a ‘standby’ state, ready to take over should the first machine fail.

This can be achieved with a single traffic IP group containing a single traffic IP. This traffic IP group should span both ZXTM machines.

One ZXTM machine (active) will acquire the traffic IP, and the other (the standby) will not. The active machine will handle incoming traffic directed to that IP address.

If the active ZXTM should fail, the standby ZXTM will acquire the traffic IP address and handle further incoming traffic. When the original active ZXTM recovers, the standby will automatically relinquish the IP address.

Active-Active

In an active-active configuration, incoming traffic is split across two ZXTM machines. Both machines are active, and if one fails, the remaining machine handles all incoming traffic.

One way to achieve this configuration is to use a traffic IP group containing two externally visible traffic IP addresses. Incoming traffic can be split across the two addresses using a technique such as round-robin DNS.

When both ZXTM machines are active, each will acquire one of the traffic IP address. If one fails, the other will acquire both addresses.

Active-Active with Loopback (single external IP address)

The Active-Active technique described above requires that you expose two traffic IP addresses for your service, and relies on round-robin DNS to split the traffic. An alternative solution is to use a ‘loopback’ configuration.

First, create your primary virtual server that processes traffic and load-balances it across the back-end servers. Configure this virtual server so that it listens on internal IP addresses and ports on each ZXTM, rather than externally accessible ones. For example, the virtual server could listen on 10.100.1.1:80 and 10.100.1.2:80, where the IP addresses are internally visible only.

Then you need to create a second 'loopback' virtual server that listens on a single external IP address and immediately distributes traffic to the primary virtual server across the various ZXTM machines in your cluster:

  • As in the active-passive example, you should set up one traffic IP address, which will be raised by one traffic manager. Any traffic coming in to this address should be processed by the simple 'loopback' virtual server listening on that traffic IP address.
  • The loopback virtual server should immediately select a 'loopback' pool that contains the internal IP addresses of the ZXTM machines in the cluster; the loopback pool should use a simple load-balancing method such as round-robin or least connections to evenly distribute traffic across the ZXTM machines in the cluster.

The loopback virtual server will use little processing power. Ensure that all of the CPU-intensive processing is performed by the primary virtual server – tasks such as SSL decryption, rules, content compression etc.

This method splits the load of more intensive tasks between the two traffic managers. If either traffic manager fails, the service will continue to run (perhaps with a momentary interruption to some traffic). For example, if the ZXTM that is hosting the external Traffic IP address were to fail, the Traffic IP would be transferred to the remaining ZXTM. The loopback pool will detect that one of the nodes was unavailable and direct all traffic to the primary virtual server running on the remaining ZXTM.

Multiply-Redundant TrafficCluster (N+M)

A multiply-redundant configuration can suffer several failures without any loss in capacity. It is a common requirement for mission-critical services, where the system should remain fault-tolerant even after one or more failures.

This configuration is commonly referred to as an N+M configuration, where N is the number of active machines, and M is the number of redundant standby machines. It can be achieved by creating a traffic IP group containing N traffic IP addresses, spanning all N+M machines.

Often, N is sized according to the capacity required for the cluster, and M is chosen to be 2. In this case, a single hardware hardware failure does not leave the cluster vulnerable to a second hardware faulure.

If N is greater than 1 and the organization wishes only to expose a single external IP address, this can be achieved by using the ‘loopback’ technique. Traffic on the external IP address is immediately load-balanced onto the N traffic IPs. To make the external IP address fault-tolerant, it could either reside in its own traffic IP group, or it could be one of the existing N addresses. In the second case, traffic should be ‘looped back’ onto a different port on the N traffic IP addresses, so that ZXTM can distinguish between new and ‘looped-back’ traffic

Configuring a ZXTM Cluster

The following is a step-by-step guide on how to configure a simple ZXTM cluster. We will create a cluster consisting of three devices, two active and one standby. Using two IP addresses in one traffic IP group.

Joining a cluster

The first step is for our ZXTMs to create a cluster. For this we need to run the configure command on each of the ZXTMs that need to join the cluster. When the command runs it first discovers all available clusters, then asks which is the one to join. The cluster password is then required to complete the operation.

The traffic manager that was already present in this cluster, then replicates it’s configuration across to the new ZXTM. From this point any configuration changes made on any member of the cluster are replicated across all devices.

Creating a Traffic IP Group

This is done from within the UI in the Traffic IP Groups tab of the Services menu. It is simply a matter of providing a name and an IP, then selecting the traffic managers to include in the group.

Managing Traffic IP Groups

From the same area in the UI as adding a Traffic IP Group, you can opt to manage existing groups, or as in this case the one we just added.

In the basic settings window, you can specify whether all the traffic IPs in the group are raised on one traffic manager or spread across multiple devices.

In the table below this, you can also select if a traffic manager is passive (i.e. in standby) or not. These two windows therefore tune the behavior of the way the TIPs are shared across the available traffic managers.

Owen Garrett [Zeus Dev Team] 01 July 2005  Permalink 1 comment  

Comments:

This public messageboard is not a forum for technical support. To report technical support problems, please contact our dedicated Support team using the instructions at the bottom of this page.

Comment from: Nick Bond [Zeus Systems Engineering]
I have added a couple of diagrams and expanded a couple of the sections to aid in understanding this important aspect of ZXTM...
Permalink 30 March 2006 @ 17:13
Leave a comment ...
Your email address will not be displayed.
Your URL will be displayed.
This public messageboard is not a forum for technical support. To report technical support problems, please contact our dedicated Support team using the instructions at the bottom of this page.
Options:
 
(Line breaks become <br />)
(Set cookies for name, email & url)
Download Free ZXTM Desktop Edition

Recent Articles

Other Resources



www.zeus.com