Website scraping with JSoup and XMLBeam — Part 3

In the last two articles I introduced website scraping with XMLBeam and JSoup respectively.

In this article I’ll do a final comparison of the two libraries — however do not expect anything professional. It will be just a simple run-and-measure-time analysis of the two libraries on my dataset — and on two machines.

Continue reading

Website scraping with JSoup and XMLBeam — Part 2

In the last article I covered XMLBeam for scraping a not so well formed HTML site which gave me a lot of pain. Now I’ll look at the same task implemented with JSoup. I guess I can mention it at the beginning that the ill-formedness caused no pain with this tool.

Continue reading

Website scraping with JSoup and XMLBeam — Part 1

Last time I was writing about XMLBeam, a new tool for parsing XML documents based on XPath. At the end of the article I mentioned that I’ll write about parsing an HTML website with XMLBeam and JSoup to compare them which one is better to use.

This article is the first part which is introducing the task and covering the XMLBeam implementation. The next article will tell more about JSoup and a comparison between the two tools.

Continue reading

XML Processing Advanced

As mentioned in an earlier article I’ve written an XML extractor from a SOAP-Response XML to represent the data with iText as a PDF. The XML-Extractor was big, bogus and obtrusive. Lately I’ve got to know XMLBeam which does the same thing I’ve created — but it is more lightweight and reusable. So I switched from my implementation to XMLBeam: removing around 8 classes (average 100 LoC each) and substituting the whole functionality with 136 LoC.

Let’s see how it happened.

Continue reading