In the last article I covered XMLBeam for scraping a not so well formed HTML site which gave me a lot of pain. Now I’ll look at the same task implemented with JSoup. I guess I can mention it at the beginning that the ill-formedness caused no pain with this tool.
Last time I was writing about XMLBeam, a new tool for parsing XML documents based on XPath. At the end of the article I mentioned that I’ll write about parsing an HTML website with XMLBeam and JSoup to compare them which one is better to use.
This article is the first part which is introducing the task and covering the XMLBeam implementation. The next article will tell more about JSoup and a comparison between the two tools.
As mentioned in an earlier article I’ve written an XML extractor from a SOAP-Response XML to represent the data with iText as a PDF. The XML-Extractor was big, bogus and obtrusive. Lately I’ve got to know XMLBeam which does the same thing I’ve created — but it is more lightweight and reusable. So I switched from my implementation to XMLBeam: removing around 8 classes (average 100 LoC each) and substituting the whole functionality with 136 LoC.
Let’s see how it happened.
There is a feature of Twitter where you can ask for the trends among the tweets by location. Twitter is proud to have over 160 locations where you can browse the trends (see here) unfortunately there are only 62 known countries as of writing this article. And Hungary is not among these 62…
If you ever wondered how can you clear your screen fast in Cygwin (under Windows) for example to hide from your boss what you are doing at your working hours (playing with your pet-project instead of your tasks) here is the solution:
- Install the ncurses package and then you can execute tput clear. This gives you the clear command known from Unix-like systems.
- Press CTRL+L. Same result — faster, you do not need to install anything and it comes always handy.
Happy (hidden) coding.
The problem with some articles are that I get the idea of them as I work on a specific topic however I end up writing the article itself weeks or months later. This article has the same issue: I thought about it at the middle of may and now I’ve forgotten why I wanted the Google App Engine (GAE) to be a part of the topic.
Let’s see, what I get at the end.