Categorized | Development

Boost performance with LibXML

We’re working on a project at the moment that has a lot of XML flying about, for example we wrap data coming out of Amazon SimpleDB in XML and then consume that data in the rest of the program.

I’ve been using XML::XPath to extract the data from the xml, so I can write this sort of thing;

my $xp = XML::XPath->new( xml => $xml );
foreach my $walk ($xp->findnodes('/walks/walk'))
{
my $walkid = $walk->findvalue('./@itemname');
etc ...
}


It’s easy to write, easy to read and works well. However recently I’ve begun noticing that the project has become a bit, well, sluggish. I was kind of hoping that XPath would be using the C (and hence very fast) LibXML under the hood since I had recently installed that parser on the system, however the lack of speed led me to think this might not be the case.

Reading around, I discovered that there already is XPath support built in to LibXML and so I was able to rewrite my code as follows;
my $parser = XML::LibXML->new();
my $doc = $parser->parse_string($xml);
my $xp = XML::LibXML::XPathContext->new($doc->documentElement());

foreach my $walk ($xp->findnodes('/walks/walk'))
{
my $walkid = $walk->findvalue('./@itemname');
etc ...
}


Note how it is just the setup that has changed, the actual data processing stays the same (in most cases).

This makes things *MUCH* speedier as you would expect. My perception is perhaps as much as 10 times faster for large XML files, but I haven’t done any quantitative analysis.

BEWARE though, it’s not a completely transparent drop-in as the parser in LibXML has some quirks. For example if there is a namespace declared in the xml file, then you will get no data returned unless you correctly attach this to the context.

For example, when writing an Atom parser, note the registerNs line

$PARSER = XML::LibXML->new();
$DOC = $PARSER->parse_string($xml);
$XP = XML::LibXML::XPathContext->new($DOC->documentElement());
$XP->registerNs( xatom => "http://www.w3.org/2005/Atom" );

foreach my $data ($XP->findnodes('//xatom:entry/xatom:content[@type="text/xml"]'))

This despite the fact that inside the atom feed, NO namespace is explicitly used in elements. The atom file contains <entry> and NOT <xatom:entry> but you MUST attach a namespace to be able to read the data. You could choose any namespace, I picked xatom but it could just as well have been fred. Go figure …

Share this:
  • Print
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • LinkedIn
  • MySpace
  • StumbleUpon
  • Twitter
  • Yahoo! Buzz

This post was written by:

- who has written 22 posts on Mindsizzlers.


Contact the author

Leave a Reply

Advert

For more information about our services…

Contact Us

Wordle of the Day

Image from software at http://wordle.net
Data by Web Trends Now

Categories