TrafficCluster Fault ToleranceThis technical note discusses ZXTM’s Fault Tolerance mechanism ('TrafficCluster'). It describes the concepts of clustering and traffic IP groups. It then outlines different fault-tolerance strategies, including active-active, active-standby and multiply-redundant configurations. OverviewZeus Extensible Traffic Manager (ZXTM) manages network services in a fault-tolerant manner. A ZXTM cluster is a group of ZXTM machines. Together, they receive requests from the network and forward them on to the most suitable and available back-end node. The machines in a ZXTM cluster monitor each other. If one or more machines were fails for any reason, the remaining machine(s) can be configured to automatically take over the services that the failed machines were hosting. This makes the hosted services resilient to a range of hardware or software failures, providing a very powerful degree of fault-tolerance. . ClusteringTo take advantage of ZXTM’s fault tolerance, you must deploy two or more ZXTM machines in a cluster. A cluster of ZXTM machines shares the same configuration. Configuration updates on one machine are automatically distributed to all other machines in the cluster. Each machine monitors the other machines in the cluster to detect failures. Traffic IP GroupsA traffic IP group is a set of IP addresses. A traffic IP group spans some or all of the machines in a cluster. All of the machines spanned by a traffic IP group cooperate to ensure that each traffic IP address is available. Each traffic IP address is managed by one of the machines. If one or more machines fail, the IP addresses that would have been lost are instead redistributed across the remaining machines. Provided that at least one spanned machine is active, all of the traffic IP addresses are always available. It is normal to publish your hosted services on an IP address that is contained in a traffic IP group, to ensure that the service is always available and fault-tolerant. IP Address TransferAll of the machines in a ZXTM cluster monitor each other to detect failures or recoveries from failures. ZXTM uses a robust, fully deterministic algorithm to monitor machines and distribute IP addresses. Failover is a simple and reliable process because the machines never need to negotiate or share state. Failure DetectionA ZXTM machine frequently checks its connectivity by pinging the back-end nodes and its configured gateway. You can tune this behaviour in the Traffic Manager Fault Tolerance section of the System/Global Settings page.
Each ZXTM machine listens for health notification messages from other machines in the cluster. If the machine fails to hear from another machine within a defined time period, or it receives a ‘failure’ notification from the other machine, it concludes that that machine has failed:
The ZXTM machines then initiate the IP address redistribution operation. You can debug this entire process by enabling the flipper!verbose setting. This causes each ZXTM to log every multicast message sent and received, and every state change and IP transfer. Address RedistributionA ZXTM machine uses its knowledge of which machines are active to calculate which traffic IP addresses it should be hosting. The distribution algorithm is:
All state changes are logged to each ZXTM machine’s error log file, and can be reviewed using the Admin Server. Additionally, ZXTM can be configured easily to email information about each state change to an administrator when it happens, or to run a custom command or script. ExampleSuppose you have 4 ZXTM traffic managers. You create a Traffic IP Group containing 3 ips, which spans 3 of the traffic managers, and a second Group that contains 2 IPs and spans two of the traffic manager
Fault Tolerant configurationsZXTM’s traffic IP groups provide a flexible way to implement fault tolerance. This flexibility can be applied to a range of different problems. Active-StandbyIn an active-standby configuration, one ZXTM machine manages traffic, and the second sits in a ‘standby’ state, ready to take over should the first machine fail. This can be achieved with a single traffic IP group containing a single traffic IP. This traffic IP group should span both ZXTM machines. One ZXTM machine (active) will acquire the traffic IP, and the other (the standby) will not. The active machine will handle incoming traffic directed to that IP address. If the active ZXTM should fail, the standby ZXTM will acquire the traffic IP address and handle further incoming traffic. When the original active ZXTM recovers, the standby will automatically relinquish the IP address. Active-ActiveIn an active-active configuration, incoming traffic is split across two ZXTM machines. Both machines are active, and if one fails, the remaining machine handles all incoming traffic. One way to achieve this configuration is to use a traffic IP group containing two externally visible traffic IP addresses. Incoming traffic can be split across the two addresses using a technique such as round-robin DNS. When both ZXTM machines are active, each will acquire one of the traffic IP address. If one fails, the other will acquire both addresses. Active-Active with Loopback (single external IP address)The Active-Active technique described above requires that you expose two traffic IP addresses for your service, and relies on round-robin DNS to split the traffic. An alternative solution is to use a ‘loopback’ configuration. First, create your primary virtual server that processes traffic and load-balances it across the back-end servers. Configure this virtual server so that it listens on internal IP addresses and ports on each ZXTM, rather than externally accessible ones. For example, the virtual server could listen on 10.100.1.1:80 and 10.100.1.2:80, where the IP addresses are internally visible only. Then you need to create a second 'loopback' virtual server that listens on a single external IP address and immediately distributes traffic to the primary virtual server across the various ZXTM machines in your cluster:
The loopback virtual server will use little processing power. Ensure that all of the CPU-intensive processing is performed by the primary virtual server – tasks such as SSL decryption, rules, content compression etc. This method splits the load of more intensive tasks between the two traffic managers. If either traffic manager fails, the service will continue to run (perhaps with a momentary interruption to some traffic). For example, if the ZXTM that is hosting the external Traffic IP address were to fail, the Traffic IP would be transferred to the remaining ZXTM. The loopback pool will detect that one of the nodes was unavailable and direct all traffic to the primary virtual server running on the remaining ZXTM. Multiply-Redundant TrafficCluster (N+M)A multiply-redundant configuration can suffer several failures without any loss in capacity. It is a common requirement for mission-critical services, where the system should remain fault-tolerant even after one or more failures. This configuration is commonly referred to as an N+M configuration, where N is the number of active machines, and M is the number of redundant standby machines. It can be achieved by creating a traffic IP group containing N traffic IP addresses, spanning all N+M machines. Often, N is sized according to the capacity required for the cluster, and M is chosen to be 2. In this case, a single hardware hardware failure does not leave the cluster vulnerable to a second hardware faulure. If N is greater than 1 and the organization wishes only to expose a single external IP address, this can be achieved by using the ‘loopback’ technique. Traffic on the external IP address is immediately load-balanced onto the N traffic IPs. To make the external IP address fault-tolerant, it could either reside in its own traffic IP group, or it could be one of the existing N addresses. In the second case, traffic should be ‘looped back’ onto a different port on the N traffic IP addresses, so that ZXTM can distinguish between new and ‘looped-back’ traffic Configuring a ZXTM ClusterThe following is a step-by-step guide on how to configure a simple ZXTM cluster. We will create a cluster consisting of three devices, two active and one standby. Using two IP addresses in one traffic IP group. Joining a clusterThe first step is for our ZXTMs to create a cluster. For this we need to run the configure command on each of the ZXTMs that need to join the cluster. When the command runs it first discovers all available clusters, then asks which is the one to join. The cluster password is then required to complete the operation. The traffic manager that was already present in this cluster, then replicates it’s configuration across to the new ZXTM. From this point any configuration changes made on any member of the cluster are replicated across all devices. Creating a Traffic IP GroupThis is done from within the UI in the Traffic IP Groups tab of the Services menu. It is simply a matter of providing a name and an IP, then selecting the traffic managers to include in the group. Managing Traffic IP GroupsFrom the same area in the UI as adding a Traffic IP Group, you can opt to manage existing groups, or as in this case the one we just added. In the basic settings window, you can specify whether all the traffic IPs in the group are raised on one traffic manager or spread across multiple devices. In the table below this, you can also select if a traffic manager is passive (i.e. in standby) or not. These two windows therefore tune the behavior of the way the TIPs are shared across the available traffic managers.
Owen Garrett
[Zeus Dev Team] 01 July 2005
Comments:This public messageboard is not a forum for technical support. To report technical support problems, please contact our dedicated Support team using the instructions at the bottom of this page.
Comment from:
Nick Bond [Zeus Systems Engineering]
I have added a couple of diagrams and expanded a couple of the sections to aid in understanding this important aspect of ZXTM...
|
Recent Articles
Other Resources
|





