Hiding Application Errors

App BombWhat can you do if an isolated problem causes one or more of your application servers to fail? How can you prevent vistors to your website seeing the error, and instead send them a valid response?

This article shows how to use TrafficScript™ to inspect responses from your application servers and retry the requests against several different machines if a failure is detected.

The Scenario

Consider the following scenario. You're running a web based service on a cluster of four application servers, running .NET, Java, PHP, or some other application environment. An occasional error on one of the machines means that one particular application sometimes fails on that one machine. It might be caused by a runaway process, a race condition when you update configuration, or by failing system memory.

With ZXTM, you can check the responses coming back from your application servers. For example, application errors may be identified by a '503 Service Unavailable' or '502 Bad Gateway' message (refer to the HTTP spec for a full list of error messages).

You can then write a Response rule that retries the request a certain number of times against different servers to see if it gets a better response before sending it back to the remote user.

if( http.getResponseCode() >= 500 ) {
   if( request.getRetries() < 3 ) {
      request.avoidNode( connection.getNode() );
      log.warn( "Request " . http.getPath() . 
                " to site " . http.getHostHeader() . 
                " from " . request.getRemoteAddr() . 
                " caused error " . http.getResponseCode() .
                " on node " . connection.getNode() );
      request.retry();
   }
}

How does the rule work?

The rule does a few checks before telling ZXTM to retry the request:

1. Did an error occur?

First of all, the rule checks to see if the response code indicated that an error occured:

if( http.getResponseCode() >= 500 ) {
   ...
}

If your service was prone to other types of error - for example, Java backtraces might be found in the middle of a response page - you could write a TrafficScript test for those errors instead.

2. Have we retried this request before?

Some requests may always generate an error response. We don't want to keep retrying a request in this case - we've got to stop at some point:

   if( request.getRetries() < 3 ) {
      ...
   }

request.getRetries() returns the number of times that this request has been resent to a back-end node. It's initially 0; each time you call request.retry(), it is incremented.

This code will retry a request 3 times, in addition to the first time that it was processed.

3. Don't use the same node again!

When you retry a request, the load-balancing decision is recalculated to select the target node. However, you will probably want to avoid the node that generated the error before, as it may be likely to generate the error again.

      request.avoidNode( connection.getNode() );

connection.getNode() returns the name of the node that was last used to process the request. request.avoidNode() gives the load balancing algorithm a hint that it should avoid that node. The hint is just advisory - if there are no other available nodes in the pool, that node will be used anyway.

4. Log what we're about to do.

This rule conceals problems with the service so that the end user does not see them. It it works well, these problems may never be found!

      log.warn( "Request " . http.getPath() . 
                " to site " . http.getHostHeader() . 
                " from " . request.getRemoteAddr() . 
                " caused error " . http.getResponseCode() .
                " on node " . connection.getNode() );

It's a sensible idea to log the fact that a request caused an unexpected error so that the problem can be investigated later.

5. Retry the request

Finally, tell ZXTM to resubmit the request again, in the hope that this time we'll get a better response:

      request.retry();

And that's it.

Notes

If a malicious user finds an HTTP request that always causes an error, perhaps because of an application bug, then this rule will replay the malicious request against 3 additional machines in your cluster. This makes it easier for the user to mount a DoS-style attack against your site, because he only needs to send 1/4 of the number of requests.

However, the rule explicitly logs that a failure occured, and logs both the request that caused the failure and the source of the request. This information is vital when performing triage, i.e., rapid fault fixing. Once you have noticed that the problem exists, you can very quickly add a request rule to drop the bad request before it is ever processed:

if( http.getPath() == "/known/bad/request" ) connection.discard();

Other techniques

The technique in this rule is similar to the one described in the 'No more 404 Not Found' article. In that article, a rule detects a '404 Not Found' error and redirects the user to an alternative location rather than retrying the request against a different node.

Owen Garrett [Zeus Dev Team] 24 May 2007  Permalink  
Leave a comment ...
Your email address will not be displayed.
Your URL will be displayed.
This public messageboard is not a forum for technical support. To report technical support problems, please contact our dedicated Support team using the instructions at the bottom of this page.
Options:
 
(Line breaks become <br />)
(Set cookies for name, email & url)
Download Free ZXTM Desktop Edition

Recent Articles

Other Resources



www.zeus.com