XML Processing Advanced

As mentioned in an earlier article I’ve written an XML extractor from a SOAP-Response XML to represent the data with iText as a PDF. The XML-Extractor was big, bogus and obtrusive. Lately I’ve got to know XMLBeam which does the same thing I’ve created — but it is more lightweight and reusable. So I switched from my implementation to XMLBeam: removing around 8 classes (average 100 LoC each) and substituting the whole functionality with 136 LoC.

Let’s see how it happened.

XML Extractor — my version of solving the problem (brute force)

Now I’ll tell you about how I parsed the XML and extracted the needed data. If I think about some code parts could have been more generic to be reused — but fortunately I came across XMLBeam before I started to create my own framework. Again a good example of DRtW (Don’t Reinvent the Wheel).

To extract the data I created a Java class with some methods to be called from other classes. The methods were not static because I needed to parse the input and create a DOM Document from the XML to be able to parse it and the parsing was done in the constructor itself. The response was provided as a String object, if we worked in offline mode the test data was an InputStream.

// javax.xml.parsers.DocumentBuilder
// javax.xml.parsers.DocumentBuilderFactory
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
// org.w3c.dom.Document
// org.xml.sax.InputSource
// java.io.StringReader
Document xmlDocument = builder.parse(new InputSource(new StringReader(xmlInput)));
// the InputStream parsing would look like: Document xmlDocument = builder.parse(xmlInputStream);

As you can see above creating a DOM-Document is not a big deal: you need a DocumentBuilder to parse the input. However converting a String to an InputSource can be tricky but you can search through the internet and find a solution.

The core of the extracting is evaluating an XPath expression with XPath and iterate over the node set returned by the expression. Evaluating is a one-liner:

String xpathExpression = "/root/skills/tab/category";
XPath xPath = XPathFactory.newInstance().newXPath();
NodeList nodeList = (NodeList) xPath.compile(xpathExpression).evaluate(xmlDocument, XPathConstants.NODESET);

If you have to extract only the text() from your XPath expression you can write:

String xpathExpression = "/root/personal_data/resultSet/core_areas";
XPath xPath = XPathFactory.newInstance().newXPath();
String coreAreas = xPath.evaluate(xpathExpression, xmlDocument);

The XML extractor worked itself through the document, evaluating the given XPath expression and then create a class for each sub-tree of the structure filling the variables with the contents of the selected children.

It was a lot of code, too many classes and too many points if you have to fix something. Extending was a pain in the neck too. Naturally, 800 LoC can do such things.

Beam me up, Sven

Sven Ewald is the creator of XMLBeam. The website is a good source of documentation so I will not go into detail about how to use the framework. However I’ll introduce some part of the code to show you how I managed to convert ump-hundred code lines below one-and-a-half hundred.

As I mentioned above the XML document was loaded into memory before parsing so XMLBeam could be used here because XMLBeam loads the XML documents into memory too to work on them. So large files are out of the question. For such purposes you have to look for other alternatives.

But because I’ve loaded the response for the XML extractor into memory too XMLBeam was a nice choice. I did not imagine that I can remove around 700 lines of code but at the end I was happy with it.

The key feature of the XMLBeam extraction is to create an interface holding getter methods for every part of data you want to access from the XML file. For complex data structures you can define internal interfaces which encapsulate other parts you want to represent as a group belonging together.

After this you have to place the required XBRead annotation with the XPath expression over your methods and you’re done. Note that for sub-interfaces you need to define the XPath defining the sub-tree of the collection on the getter method in the parent interface and for the sub-interface’s methods you only need to walk the sub-tree. The example below should clarify this.

public interface ProfileData {
    @XBRead("/root/personal_data/resultSet/firstname")
    String getFirstName();
    @XBRead("/root/personal_data/resultSet/surname")
    String getSurname();

    @XBRead("/root/personal_data/resultSet/core_areas")
    String getCoreAreas();

    interface Skill {
        @XBRead("skillname")
        String getName();
        @XBRead("skillval")
        String getExperience();
    }
    interface SkillSet {
        @XBRead("../gencatname")
        String getGenericCategory();
        @XBRead("catname")
        String getCategory();
        @XBRead("skill")
        List<Skill> getSkills();
    }

    @XBRead("/root/skills/tab/category")
    List<SkillSet> getSkillSets();
}

After you’re set you only need to call XMLBeam with your resource and you can use the implementation of the interface as you want.

ProfileData profileData = new XBProjector().io().stream(offlineMode ? offlineXML : new ByteArrayInputStream(xmlResponse.getBytes())).read(ProfileData.class);

The offlieXML variable above is of the type InputStream — if it was not obvious because the methods of XMLBeam are very talkative: with stream() you want access a stream.

Conclusion

XMLBeam is a good tool. I will use it as part of my daily-work when I encounter XML extraction in Java.

However to extract HTML data I think I’ll think of another tool where you do not need to define the whole XPath (however you could do so too), like JSoup. But this will be another article where I will compare the same application written with JSoup and XMLBeam.

Advertisement

One thought on “XML Processing Advanced

  1. Pingback: Website scraping with JSoup and XMLBeam — Part 1 | HaHaMo Group

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s