Integrating Google Search

What can you do if you need to replace one component of a complex, integrated web site? This demonstration shows how you can use ZXTM to completely outsource the search application of a website to Google, without requiring any modifications to the website content or URLs.

It showcases lots of features of ZXTM in one go - TrafficScript rules, XML processing, protocol translation, request and response rewriting.

Overview

Complex web sites are built from a number of different components, and lots of work goes into ensuring that a single, integrated site is presented to the customer. A web site may contain several dynamic content sources, a search application, agents to query databases and other discrete applications.

Imagine a scenario where, due to a security vulnerability in one component, or because of a resourcing, acquisition or outsourcing decision, it was necessary to replace one tightly integrated component in a website.

In this worked example, we're going to replace the search application running on www.zeus.com, and outsource our searches to Google instead. The demonstration will illustrate request and response inspection and rewriting, request routing, XML processing and protocol translation.

Network diagram

We will load-balance traffic to www.zeus.com, intercepting requests to the online search system and replacing them with SOAP requests to Google's search system (restricted to site:www.zeus.com). The SOAP XML responses are XSLTed and XPathed, then inserted into a template to give branded search results.

Before you begin...

This example assumes that you have access to a ZXTM machine, which has access to the internet. If you want to set this up quickly, you can skip down to the quick setup guide.

You'll find it useful to upload the following files to your $ZEUSHOME/zxtm/conf/extra directory on your ZXTM machine:

You should also create a new pool, called 'google api pool', containing the one node 'api.google.com:80'.

Step 1 - Create the virtual server

Create an HTTP virtual server load-balancing traffic onto www.zeus.com, port 80.

Create virtual server

Traffic to this virtual server returns a '400 Bad Request' error because the machines running the zeus.com website host several other websites as well, and they do not recognize the host header (which indicates web site the client wants).

400 Bad Request

This is because of our unusual 'forward proxy' configuration.

Step 2 - Sort out the 'forward proxy' configuration

ZXTM is typically deployed as a 'reverse proxy' (i.e., right in front of the servers), and listens for traffic on behalf of the servers. In this case, we can't update the DNS for www.zeus.com to point to our test ZXTM machine, so it's necessary to rewrite requests and responses as follows:

  • Add the Set Hostheader rule as a request rule:
    http.setheader( "Host", "www.zeus.com" );
  • Add the Rewrite response links rule as a response rule:

    if( http.getResponseHeader( "Content-Type" ) != "text/html" ) break;
    $response = http.getResponseBody();
    $response = string.regexsub( $response, "\"http://www.zeus.com/", "\"/", "g" );
    http.setResponseBody( $response );

You can now browse the site through ZXTM. ZXTM will load-balance requests across the two back-end servers, and you can even use the search application through ZXTM:

searches on www.zeus.com

Step 3 - Send searches to google

Now, add the Intercept search request rule as a request rule.

if( http.getPath() != "/htdig-bin/htsearch" ) break;
$search = http.getFormParam( "words");
$search = $search . " site:www.zeus.com";
# now construct the google search document
$doGoogleSearch = resource.get( "doGoogleSearch.xml" );
$doGoogleSearch = string.regexsub( $doGoogleSearch, "!!SEARCH!!", $search );
http.setBody( $doGoogleSearch );
http.setPath( "/search/beta2" );
http.setHeader( "Host", "api.google.com" );
http.setHeader( "Content-Type", "text/xml" );
pool.use( "google api pool" );

This:

  1. Intercepts requests to /htdig-bin/htsearch;
  2. Extracts the query from the HTTP POST body;
  3. Appends 'site:www.zeus.com' to the query
  4. Reads a template doGoogleSearch SOAP request from disk and substitutes in the query;
  5. Modifies the request paramaters to construct a HTTP-based SOAP request;
  6. Selects the google api pool as the destonation rather than the www.zeus.com servers.

Try a search - you'll see the XML response from the google search servers in your browser.

Response from google's SOAP API service

Step 4 - Reformat the XML response

We've converted an HTTP request to a SOAP request; we now need to convert the response back again.

Add the Format google response as a response rule, and move it so that it runs before the Rewrite response links rule.

if( connection.getPool() != "google api pool" ) break;
$response = http.getResponseBody();
$xsl = resource.get( "zeus.xsl" );
$results = string.htmldecode( xml.xslt.transform( $response, $xsl ) );
$searchterm = xml.xpath.matchNodeSet( $response, "", "//searchQuery/text()" );
# strip site:www.zeus.com
$searchterm = string.regexsub( $searchterm, "site:www.zeus.com", "" );
$template = resource.get( "search.html" );
$html = string.regexsub( $template, "!!RESULTS!!", $results );
$html = string.regexsub( $html, "!!SEARCHTERM!!", $searchterm, "g" );
http.setResponseHeader( "Content-Type", "text/html" );
http.setResponseBody( $html );

This rule:

  1. Checks that it's processing a response from the google servers;
  2. Reads an XSLT file from disk and runs the transform against the response to get an HTML version of the search response;
  3. Runs an XPath query against the response to get the stored search query;
  4. Reads an HTML template file off disk and substitutes in the search query and response;
  5. Returns the result as an HTTP html response to the browser.

Reload the search - you'll see a correctly formatted HTML response, in the style of www.zeus.com:

Reformatted response from google's SOAP API service

Summary

In this example, we've illustrated a great many capabilities of ZXTM:

  • Request inspection - detect and act on individual requests:
    if( http.getPath() == "/htdig-bin/htsearch" ) {
    # ...
    }
  • Request routing - send a request to a specific server node:
    pool.use( "google api pool" );
  • Request rewriting and protocol translation - in this example, we transformed an HTTP request into a SOAP request.
  • Response inspection (honoring HTTP keepalives, decoding chunk transfers, etc.):
    $response = http.getResponseBody();
  • Using external resources (files on disk):
    $xsl = resource.get( "zeus.xsl" );
    $template = resource.get( "search.html" );
  • XML operations: XPath searches and XSLT translations:
    $results = xml.xslt.transform( $response, $xsl );
    $searchterm = xml.xpath.matchNodeSet( $response, "", "//searchQuery/text()" );

All made possible by the power and flexibility of ZXTM's TrafficScript language.

Fast deployment

If you want to run through this demonstration quickly, download this tar file and issue the following command:

# tar -C $ZEUSHOME/zxtm/conf -xvf doGoogleSearch.tar

This will create a virtual server listening on port 8000, preconfigured with the necesary rules. Verify that the virtual server is running in your Admin interface, then try browsing via port 8000 (for example, http://localhost:8000/).

Use the Admin interface to disable the 'Intercept search request' and 'Format Google response' rules, and observe the effects on the search system.

Owen Garrett [Zeus Dev Team] 01 July 2005 Bookmark with del.icio.us Post this article to Digg Post this article to reddit Post this article to Facebook Tweet this article 6 comments  

Comments:

This public messageboard is not a forum for technical support. To report technical support problems, please contact our dedicated Support team using the instructions at the bottom of this page.

Comment from: Ben Pinnick [Visitor]
Is it possible to insert the transformed content (eg the Google results) into a dynamic page returned from one of the "normal" application/web servers in the cluster the ZXTM is managing? I would luke to be able to insert branded XML transformed content from a distributed system into the middle of my page as shown in the example but the page itself being dynamic with placeholders to show where the remote content should go. Many thanks
Permalink 24 November 2005 @ 14:54
Comment from: Owen Garrett [Zeus Dev Team]
Yes - ZXTM gives you everything you need to do something like this! For example, get your webservers to return an html page with some sort of 'insert here' placeholder tag where you want the additional content to go. Then, use a TrafficScript response rule to process this response page:
# We only want to process html responses
if( http.getResponseHeader( "Content-Type" ) != "text/html" ) break;

$body = http.getResponseBody();

if( ! string.contains( $body, "<!--INSERT HERE-->" ) ) break;
Note that if we call 'break', this just breaks out of the current rule. ZXTM will then send the existing response back to the client (or will process the next response rule).

Let's suppose that you can retrieve your additional XML content using an HTTP GET request to http://10.100.1.1/content.aspx, and that you need to provide the original cookie data in a Cookie header:
# Get the original request cookie
$cookie = http.getHeader( "Cookie" );

# Send the special request to get the XML content
$xml = http.request.get( "http://10.100.1.1/content.aspx",
   "Cookie: ".$cookie );

# Check that the remote server sent a '200 OK' response
if( $1 != 200 ) break;
Do whatever transformation is appropriate to the XML data, then insert the new text into the original response body:
# $xsl is the XSLT stylesheet
$replacement = xml.xslt.transform( $xml, $xsl );

$body = string.regexsub( $body, "<!--INSERT HERE-->", $replacement );
http.setResponseBody( $body );
A couple of things to think about:
  1. With a little more work, you could cache the $xml response for a couple of seconds, or round-robin a few cached versions - use the data.set() and data.get() functions. This would reduce the load on the '/content.aspx' server and would improve the response times for the main page.
  2. You could use ZXTM to load-balance the '/content.aspx' requests by creating a new virtual server and directing the http.request.get() calls to the virtual server.
  3. The main content page could include additional tags in the placeholder, which you could extract and send to the XML server. For example, it could contain a URL or query string.
Ben - if you do manage to build something like this, please let me know (knowledgehub@zeus.com) as I'd be very happy to showcase it on our KnowledgeHub.
Permalink 28 November 2005 @ 10:12
Comment from: Eddy [Visitor] · http://www.world-warcraft-gold.com
nicely wrriten, thanks for your info
Permalink 30 November 2005 @ 15:13
Comment from: Mike [Visitor] · http://www.cheaperholidays.com/
Hello We need to set up rewrite rules in htccess so that http://www.cheaperholidays.com/holiday_guides.php?id=9 is made search engine friendly. Can you advse how to convert the apache code below to the Zeus equivelent. Options +FollowSymLinks RewriteEngine on RewriteRule holiday_guides-id-(.*)\.htm$ holiday_guides.php?id=$1 Thank you
Permalink 11 February 2006 @ 14:44
Comment from: Owen Garrett [Zeus Dev Team]

Hi Mike,

The following TrafficScript rule will do the trick:

$url = http.getPath();
if( string.regexMatch( $url, "^/holiday_guides-id-(.*)\\.htm$" ) ) {
   $newurl = "/holiday_guides.php?id=".string.escape( $1 );
   http.setPath( $newurl );
}

Here, we use a regular expression match to test if the incoming request looks like '/holiday_guides-id-foo.htm'. If it does, we extract the 'foo' bit into the $1 variable - this is done as a side effect of the regular expression match.

We then URL-escape the foo bit and set the URL to the new value (with 'id=foo').

So, external links (seen by a search engine) look like '/holiday_guides-id-foo.htm', but internal links (seen by your web application) look like '/holiday_guides.php?id=foo'.

Note - it you're using Zeus Web Server (not ZXTM), then you'll want to look at ZWS's Request Rewriting language. Head over to the ZWS support site and take a look at the Request Rewriting examples.

Permalink 14 February 2006 @ 11:25
Comment from: Owen Garrett [Zeus Dev Team]

We've just updated www.zeus.com and replaced the search system (with one rather similar to the one in this article), so this demo does not work so well any more - sorry!

Hang about a bit and we'll see if we can sort out a replacement target site...

Permalink 22 March 2006 @ 13:54
Leave a comment ...
Your email address will not be displayed.
Your URL will be displayed.
This public messageboard is not a forum for technical support. To report technical support problems, please contact our dedicated Support team using the instructions at the bottom of this page.
Options:
 
(Line breaks become <br />)
(Set cookies for name, email & url)

Recently...

Other Resources