Website scraping with JSoup and XMLBeam — Part 1

Last time I was writing about XMLBeam, a new tool for parsing XML documents based on XPath. At the end of the article I mentioned that I’ll write about parsing an HTML website with XMLBeam and JSoup to compare them which one is better to use.

This article is the first part which is introducing the task and covering the XMLBeam implementation. The next article will tell more about JSoup and a comparison between the two tools.

Introduction of the task

The need for website scraping came along with my job where one of out clients needed a specific application which gathers some data from a given website’s search and stores the resulting documents in a CSV file.

I’ll give a brief description of the application’s flow which I implemented with XMLBeam and JSoup — the same steps with a different tool. Naturally the programming language was Java (with another language I’d have use other tools) — however because this was a project for a client I do not have permission to share the source code with you — but I can give you some bits without revealing information.

The website was built with TYPO3 and I could use the URLs of the sites directly because there is no URL-altering.

So the steps of the application:

  1. Parse the provided arguments. In this case there are 5 arguments for the application, 2 are mandatory and 3 optional. Mandatory: path of the resulting CSV file and name of the resulting CSV file. Optional parameters are the timeout of the search in seconds (for JSoup), the URL of the proxy and the port of the proxy. You can provide either the 2 mandatory parameters or the 2 mandatory and the timeout or all the five — in the order listed above. Currently there is no sophisticated method to add i.e. –-timeout:25 as an argument at any position. Maybe if I have time I’ll introduce it to the next version.
  2. Call the search without a term with 10 results on each page to get the count of all the results (it is displayed in the result-navigator how many results you got). Extract the number of results and validate it (is there a result or went something wrong).
  3. Call the search without a search term with 50 results on each page (the 100 result per page has a bug and it does not provide a navigation and you cannot navigate with the custom URLs neither). Extract all the links to the subpages of the results. (Eventually I could combine steps 1 and 2 into one and get all results from the first 50-query however it makes the validation and terminating the application difficult — and you have an awkward if in your for loop. But I will think about this in the performance monitoring session and I’ll rewrite the code to measure the time changes).
  4. Extract the ca. 40 fields required for the CSV from the list of all subpages and write it line after line into the file. (Again I could have combined this with the steps above but again I think it is better to have all the URLs needed and then work them through and export them into a file — and you have one case less to handle if an exception is thrown somewhere).
  5. Close the CSV file (I think this should be obvious but sometimes everybody misses it — but let’s make it Java 7 style: try-with-resources).
  6. Add a proxy. The client accesses the Internet from behind a proxy so I had to add some mechanism to enable this. Well you have to think about this too.

I’ll not mention the simple part with parsing and validating the program arguments.

XMLBeam

I had an article about XMLBeam previously so I won’t go into detail what this tool does. In one sentence: it parses XML with XPath expressions.

Some minor problem when parsing

I have to tell you that first I wanted to write an XSLT for the clients and not a Java application because I worked on other XSLTs for them and they know how to handle them. However I had to notice that the website is ill-formed (see below) and that’s why I had to skip the XSL transformation (because Saxon could not parse the site). And this problem became a problem with XMLBeam too but I had a workaround for it because the application was written in Java.

The ill-formedness was not the only problem but let’s not run too far ahead.

Ill-formed HTML

The first problem I encountered was a problem in the website: there are some ill-formed XHTML tags (for example in the search a radio button has a property “checked” without any “=true” or “=false” appended which leads to an XML parsing error. Or another example is in the navigation: it contains an URL with an & however no ; at the end — and this throws another exception. The third problem was in the results: some &s are in the resulting texts — again a parsing problem. These problems I managed to handle with a String.replaceAll call because the are not needed in the CSV or in the process later on. And this means I had to dispose the feature of XMLBeam to annotate the interface holding the result object with the URL to make the parsing easier. But this should not make any problems later when it comes to performance.

The problem was solved with replacing the ill-formed elements until I got to the detail site. There was a bigger problem: a paragraph was opened (<p>) but a span was closed (</span>) and this is a problem not to solve with String replace. For this I looked at another tool and I found JTidy. This is old too, released in 2009. It has some bugs however for my purposes it is good. I needed it to replace this small problem but I used it to delete my replaceAll method too.

Configuring Tidy was easy too: just get the JAR as a maven dependency and you can use it:

Tidy tidy = new Tidy();
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setXmlOut(true);
tidy.setShowErrors(0);
tidy.setShowWarnings(false);
tidy.setQuiet(true);

Quiet mode is recommended because Tidy can speak so much it is annoying — because it tells you after tidying every page.

And how to use Tidy with XMLBeam? This was a bit complex because I needed an InputStream from my tidied data but Tidy provides only OutputStreams. In Java you can convert those two easily. The second problem was the input for Tidy: it needs a Reader or an InputStream. I tried to create an InputStream with Apache IOUtils to the site but at the end Tidy did not read the site’s content. So I switched to IOUtils.toString() which reads the whole URL and returns the contents as a String.

Slow parsing

The second problem was a bit more difficult. I started the scraping and it needed over 60 seconds to get the result count from the site (with JSoup it only needed 3 seconds). This gave me a bit of a headache. I debugged the code and narrowed it to one point: the XML Parser used by XMLBeam (so the native one) does some parsing really slow when it comes to namespace-awareness and loading external DTS-s. It took me some hours to figure out the solution. However XMLBeam is a good one and let you change the XML parser and XPath engine too. So I modified the used XML parser as follows:

XMLFactoriesConfig validationDisabled = new DefaultXMLFactoriesConfig() {
    @Override
    public javax.xml.parsers.DocumentBuilderFactory createDocumentBuilderFactory() {
        DocumentBuilderFactory factory = super.createDocumentBuilderFactory();
        factory.setNamespaceAware(false);
        try {
            factory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
        } catch (ParserConfigurationException e) {
            e.printStackTrace();
        }
        return factory;
    }
};

// some code omitted

// set my factory configuration
IOBuilder ioBuilder = new XBProjector(validationDisabled).io();

As you can see it is not a big thing: I just let XMLBeam create its own DocumentBuilderFactory and then set the two features making it slow to false. With this I ended up at 3 seconds — as with JSoup. Nice catch.

The solution

After the problems at start (and during development) I came to a point where only the XPath structure matters. First of all I did the version with the least effort on my site: I included relative paths where possible. Fortunately the site maintains a well-identified set of HTML tags (classes or ids on most of them) and they rarely overlap. So I could make them to my use and have String objects generated by XMLBeam.

However the default Java XPath engine has the version 1.0 so some functions (replace or substring-after-last for example) are not available. A possible solution would be using another XPath engine (Saxon) however when I switched the engine I got no results back after parsing the site. I think it has the same issue as with the default XPath engine which made the parsing slow but Saxon parses the results very fast. Nevertheless I skipped changing the parser and XPath engine and stayed without these convenience methods. This is why I stay with nested substring methods like this one:

@XBRead("concat(substring-before(substring-after(//span[@class='register-number'],'Registration number '),' - '),'-',substring-after(substring-after(//span[@class='register-number'],'Registration number '),' - '))")

This XPath extracts a registration number (contains always a dash) separated with spaces around the dash. Originally it is not formatted with spaces only on the website.

As for the projection I created 2 interfaces, one holding the result count and the links to the results, the other all the data needed for a line to export to a CSV. I won’t go into detail because it is as the same as in my previous article about XMLBeam only the XPath expressions changed.

Exporting the results to a CSV file was easy too however I did not use any 3rd-party library for this purpose. I’ve written my own export function which assembled each line with a StringBuilder and wrote it to a file. The whole solution (including the projection interfaces and the calling methods) is around 300 well formatted LoC.

Although the XMLBeam documentation mentions overriding a toString method of the projection this is not the same as printing the results of the projection to a CSV line. Overriding the toString method allows you to access the DOM structure of the projection (i.e. the site where you want to execute the XBRead functionality) and print it as you like. Or if there is any way to access the getter methods annotated with XBRead and to execute them — I did not find it but I’m open to get the information and try it out myself.

Adding a proxy

As mentioned previously the clients sit behind a proxy so I needed to configure some settings for them to be able to create the needed CSV.

Fortunately there is a Java solution for enabling working with proxies: you have to set the right system variables (you can do it at runtime) and you go with your proxy.

System.getProperties().put("proxySet", "true");
System.setProperty("http.proxyHost", "the URL of the proxy server");
System.setProperty("http.proxyPort", "port of the proxy server");

Next time…

I’ll take a look on the same task however with JSoup because this became again a big article with some unforeseen obstacles I had to overcome.

Stay tuned.

Advertisement

4 thoughts on “Website scraping with JSoup and XMLBeam — Part 1

  1. Pingback: Website scraping with JSoup and XMLBeam — Part 3 | HaHaMo Group
  2. Pingback: Parsing command line arguments — JSAP | JaPy Software
    • I thank you because it was a pleasure to switch to XMLBeam instead of a home-made rigid something.
      In the near future I’ll write a post covering the toString of the projections.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s