Hiding Application Errors
This article shows how to use TrafficScript™ to inspect responses from your application servers and retry the requests against several different machines if a failure is detected. The ScenarioConsider the following scenario. You're running a web based service on a cluster of four application servers, running .NET, Java, PHP, or some other application environment. An occasional error on one of the machines means that one particular application sometimes fails on that one machine. It might be caused by a runaway process, a race condition when you update configuration, or by failing system memory. With ZXTM, you can check the responses coming back from your application servers. For example, application errors may be identified by a '503 Service Unavailable' or '502 Bad Gateway' message (refer to the HTTP spec for a full list of error messages). You can then write a Response rule that retries the request a certain number of times against different servers to see if it gets a better response before sending it back to the remote user.
if( http.getResponseCode() >= 500 ) {
if( request.getRetries() < 3 ) {
request.avoidNode( connection.getNode() );
log.warn( "Request " . http.getPath() .
" to site " . http.getHostHeader() .
" from " . request.getRemoteAddr() .
" caused error " . http.getResponseCode() .
" on node " . connection.getNode() );
request.retry();
}
}
How does the rule work?The rule does a few checks before telling ZXTM to retry the request: 1. Did an error occur? First of all, the rule checks to see if the response code indicated that an error occured:
if( http.getResponseCode() >= 500 ) {
...
}
If your service was prone to other types of error - for example, Java backtraces might be found in the middle of a response page - you could write a TrafficScript test for those errors instead. 2. Have we retried this request before? Some requests may always generate an error response. We don't want to keep retrying a request in this case - we've got to stop at some point:
if( request.getRetries() < 3 ) {
...
}
This code will retry a request 3 times, in addition to the first time that it was processed. 3. Don't use the same node again! When you retry a request, the load-balancing decision is recalculated to select the target node. However, you will probably want to avoid the node that generated the error before, as it may be likely to generate the error again.
request.avoidNode( connection.getNode() );
4. Log what we're about to do. This rule conceals problems with the service so that the end user does not see them. It it works well, these problems may never be found!
log.warn( "Request " . http.getPath() .
" to site " . http.getHostHeader() .
" from " . request.getRemoteAddr() .
" caused error " . http.getResponseCode() .
" on node " . connection.getNode() );
It's a sensible idea to log the fact that a request caused an unexpected error so that the problem can be investigated later. 5. Retry the request Finally, tell ZXTM to resubmit the request again, in the hope that this time we'll get a better response:
request.retry();
And that's it. NotesIf a malicious user finds an HTTP request that always causes an error, perhaps because of an application bug, then this rule will replay the malicious request against 3 additional machines in your cluster. This makes it easier for the user to mount a DoS-style attack against your site, because he only needs to send 1/4 of the number of requests. However, the rule explicitly logs that a failure occured, and logs both the request that caused the failure and the source of the request. This information is vital when performing triage, i.e., rapid fault fixing. Once you have noticed that the problem exists, you can very quickly add a request rule to drop the bad request before it is ever processed: if( http.getPath() == "/known/bad/request" ) connection.discard(); Other techniquesThe technique in this rule is similar to the one described in the 'No more 404 Not Found' article. In that article, a rule detects a '404 Not Found' error and redirects the user to an alternative location rather than retrying the request against a different node.
Owen Garrett
[Zeus Dev Team] 24 May 2007
|
Recent Articles
Other Resources
|

What can you do if an isolated problem causes one or more of your application servers to fail? How can you prevent vistors to your website seeing the error, and instead send them a valid response?
