Merging RSS feeds using Java Extensions

One of ZXTM's most powerful features is the ability to run Java on your traffic manager, allowing you to use a wide variety of existing libraries. For example, using Java's XML APIs, you can manipulate data on the fly more intelligently than with TrafficScript alone. As a simple demonstration, this article includes a code walkthrough to fetch RSS feeds from several locations and produce one merged, sorted feed, which is more convenient to subscribe to and can be manipulated in other ways at the same time.

Why use ZXTM for this?

The Servlet API lets you write Java code for this sort of task, but setting up and maintaining a Java application server can be a pain, especially considering that you might have to set up new host and OS. In many situations, this is overkill. Fortunately, ZXTM includes a Java application server, which is a good place to develop this sort of functionality (you can attach a remote debugger to your servlet), quickly deploy it and manage it side-by-side with other services. For more information on ZXTM's Java capabilities, see the documentation.

Anatomy of an RSS feed

Before we walk through the source, let's take a look at the structure of an RSS feed. We're only considering version 2.0 here, to keep the code simple. Wikipedia has a complete example feed, but the important elements for our purposes are as follows:

<?xml version="1.0"?>
<rss version="2.0">
<channel>
<title>My RSS Feed</title>
...
<item>
<title>First item</title>
<link>http://www.example.com/item1</link>
<pubDate>Tue, 09 Dec 2008 17:15:06 +0000</pubDate>
...
</item>
<item>
...
</item>
...
</channel>
</rss>

We're going to read in several XML documents like this one, and produce a single, similar document with all of the item nodes.

See the servlet in action

There are several different libraries for handling XML in Java. We're using JAXP, the Java API for XML Processing, which is included in the JDK on ZXTM appliances, so this example will work out of the box. To see it in action, download MergeFeeds.class and add it to your ZXTM (upload it under Catalogs/Java, then add the resulting rule to a virtual server as a request rule). To try it on different feeds, find the extension under Catalogs/Java and put a space-separated list of RSS2 URLs in a parameter called feeds. You can also add a title, and a dateformat if your feeds use a different date format.

Code walkthrough

The code below is slightly abridged; you can download the full source. We begin with the usual servlet skeleton and a couple of factories we'll use later, one for building DOMs, the other for transforming them back to XML:

public class MergeFeeds extends HttpServlet
{
static final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
static final TransformerFactory tf = TransformerFactory.newInstance();
public void doGet( HttpServletRequest req,HttpServletResponse res )
throws ServletException, IOException
{

The first thing we need to do is look at the configuration we mentioned earlier. You can retrieve parameters set in the ZXTM UI using getInitParameter(), which will return either a String or null.

String[] urls = {
"http://knowledgehub.zeus.com/xmlsrv/rss2.php",
"http://knowledgehub.zeus.com/xmlsrv/rss2.comments.php"
};
String urlList = getInitParameter( "feeds" );
if( urlList != null ) urls = urlList.split(" ");

We handle the other parameters similarly, and then set our output content type. Many sites still serve RSS as text/html, which might be accepted by most readers, but is obviously incorrect.

res.setContentType( "application/rss+xml" );

Next we create our output document, the channel element (but not the root element yet) and a TreeMap, which will keep the entries in order. The Java libraries already know how to compare Date objects and can trivially be told to reverse the comparison. Note that an element like <title>My Feed</title>, also created here, is really two nodes: the title node and a separate text node inside it.

DocumentBuilder db;
try {
synchronized( dbf ) {
db = dbf.newDocumentBuilder();
}
} catch( ParserConfigurationException e ) { throw new ServletException(e); }
Document xml = db.newDocument();
Element channel = xml.createElement( "channel" );
if( title != null ) {
Element titleNode = xml.createElement( "title" );
channel.appendChild( titleNode );
titleNode.appendChild( xml.createTextNode( title ) );
}
// Store our items in reverse date order
TreeMap<Date,Node> items = new TreeMap<Date,Node>( Collections.reverseOrder() );

Fetching an XML document and building a DOM from it is relatively easy using the DocumentBuilderFactory we made earlier. We just need to handle some exceptions:

for( String url : urls ) {
Document d;
// Fetch and parse a feed
try { d = db.parse( url ); }
catch( SAXException e ) { continue; } // Just skip this feed

We now have the entire structure of a feed in d. As well as pulling out all the items, we're going to use a slight hack here to get all the correct attributes on the root element, which will mostly be XML namespaces, such as xmlns:dc="http://purl.org/dc/elements/1.1/". We'll simply copy the root element and its attributes (but not its children) from the first feed we process. You could easily construct the root element manually and use setAttribute() if you prefer. We then connect that to our channel element from earlier.

// Copy the root element from the first feed, for xmlns attributes
if( xml.getFirstChild() == null ) {
Node rss = xml.importNode( d.getElementsByTagName("rss").item(0), false );
xml.appendChild( rss );
rss.appendChild( channel );
}

Now we just need to pull out the item elements, parse their dates using the SimpleDateFormat and put them into our sorted list. We use importNode() again to import the nodes into our document, like we did with the root element, but this time we copy the children too.

// For each item in the feed...
NodeList feedItems = d.getElementsByTagName( "item" );
for( int i = 0; i < feedItems.getLength(); i++ ) {
// Get the date
NodeList nl = feedItems.item(i).getChildNodes();
Date date = new Date(); // now
for( int j = 0; j < nl.getLength(); j++ ) {
if( ! nl.item(j).getNodeName().equalsIgnoreCase( "pubDate" ) ) continue;
try { date = sdf.parse( nl.item(j).getFirstChild().getNodeValue() ); }
catch( ParseException ignored ) {} // use current time
}
// Store the item (in reverse date order)
items.put( date, xml.importNode( feedItems.item(i), true ) );
}
}

Finally, we just check that we have a valid document and transform it back into XML.

if( xml.getFirstChild() == null ) throw new ServletException( "No valid feeds!" );
// Append all the items (sorted), and output the resulting document
for( Node n : items.values() ) channel.appendChild( n );
PrintWriter out = res.getWriter();
try {
tf.newTransformer().transform( new DOMSource( xml ), new StreamResult( out ) );
out.flush();
} catch( TransformerConfigurationException e ) { throw new ServletException(e); }
catch( TransformerException e ) {} // Probably the client went away
}
}

Exercises for the reader

Depending on the nature of your feeds, you might want to include support for:

  • Atom
  • older RSS formats
  • other date formats (some sites use non-RFC822 formats)
  • duplicate removal using the guid or the link address (perhaps several feeds post the same link and you only want to see it once)

Java Extensions don't just allow you to do arbitrary XML processing; you can also choose which vendor's XML implementation you want to use. As Michael noted in his article on XML validation, you can install the Intel XML Suite on your ZXTM, and since it provides the same JAXP API, the code we've used here will start using it, no source changes or recompilation required.

Chris Boyle [Zeus Dev Team] 17 December 2008 Bookmark with del.icio.us Post this article to Digg Post this article to reddit Post this article to Facebook Tweet this article  
Leave a comment ...
Your email address will not be displayed.
Your URL will be displayed.
This public messageboard is not a forum for technical support. To report technical support problems, please contact our dedicated Support team using the instructions at the bottom of this page.
Options:
 
(Line breaks become <br />)
(Set cookies for name, email & url)

Recently...

Other Resources