Website scraping with JSoup and XMLBeam — Part 3

In the last two articles I introduced website scraping with XMLBeam and JSoup respectively.

In this article I’ll do a final comparison of the two libraries — however do not expect anything professional. It will be just a simple run-and-measure-time analysis of the two libraries on my dataset — and on two machines.

Comparison of runtime

Let me state it first: do not expect anything complicated in this section. I did not perform any perfect measurement of runtime, preparing workload or anything. I just let the code run on my working machine and at home.

I prepared 4 scenarios for the run:

  1. Execute the application only with JSoup extraction (so read all the results and write the CSV file).
  2. Execute the application only with XMLBeam extraction (the same as above).
  3. Execute the application first with JSoup then with XMLBeam (in the same run, calling both from the main method after they finished).
  4. Execute the application first with XMLBeam then with JSoup (the same as above).

And if I have time I’ll repeat it multiple times because it can really vary how much time it takes. And I tried not to do anything else on the computers because it could change the performance accordingly. Currently I’m working on a migration project where the processor is running at 100% for 30-50 hours (file comparison and other nasty things with data in the files) so if I’d start the extracting application parallel too the normal running time of 10-15 minutes for each extractor could be arouse to 1 hour.

If I’d have more time I could make some runs of the scenarios 1 and 2 in a loop however it takes too much time (500 seconds are almost ten minutes and 5 loops with the scenario is almost one hour — this is too much time to spend not working in the office or at home).

The machines

Before I start digging into the test results, I guess I should tell you the machines’ configuration I run my tests. So here it comes

  • Home: MacBookPro @ 2.30 GHz (x 8), 16 GB 1600 MHz DDR3
  • Work:  MacBookPro @ 2.66 GHz (x 4),   8 GB 1067 MHz DDR3

The results

So here are the results.

First the last two scenarios compared per extractor on my notebook at home:

Extractor/Scenario 3 4
JSoup 454 493
XMLBeam 534 452

As you can see above, which extractor comes first is finished in less time. This can happen because of Java garbage collection while the second extractor is working.

I got some free CPU time at work (in the night) so I added a loop to the execution. Below you can see the results of the tests:

Scenario 3
Extractor/Loop 1 2 3 4 5 Average
JSoup 503 434 466 491 491 477
XMLBeam 441 454 461 508 511 475
Scenario 4
Extractor/Loop 1 2 3 4 5 Average
JSoup 575 433 434 463 486 478.2
XMLBeam 732 448 446 475 481 524.4

As you can see, the first iteration results in a slightly overtime running of the first extraction (remember: in scenario 4 XMLBeam runs first followed by JSoup).

I altered the first two scenarios in the mean time: I ran the whole process of extracting for each tool in a 5-times loop at a time to measure performance with workload and garbage collection running (if needed — it depends always on the VM). This yields a better result than running one extractor only once.

At work At home
Scenario 1 and 2
Extractor/Loop 1 2 3 4 5 Average
JSoup 447 616 872 446 430 562
XMLBeam 899 505 500 476 500 576
Scenario 1 and 2
Extractor/Loop 1 2 3 4 5 Average
JSoup 482 484 480 482 491 483.8
XMLBeam 499 514 517 514 489 506.6

As you can see, there is one peek at work with XMLBeam of about 900 seconds for the over 900 data to fetch and write. I guess something ran on my machine in the mean time or the network connection was bad — as it is sometimes.

Maybe I could implement such a looping-variation for the scenarios 3 and 4 although it would take about 2 hours for each to finish. And at work I do not have 4 hours of free time where I can monitor the application and the runtimes (4 hours because there are two scenarios).

Eventually I’ll do this at an evening — starting the application with a loop for each scenario and get the results next day. This would be it. If I can finish it till the article you’ll get to know.

Another option would be setting up my old RaspberryPi with Java and a new image and let the processes run there. I guess it would take a little bit more time because the performance is much smaller than on my two Macs. But it would make it possible to have another benchmark. If I do this I’ll tell you the results. But first I need to find my RasPi somewhere in the vault 🙂

Conclusion

There is no golden rule I can give you which tool to use further on. I only have some advices:

  • If you have an ill-formed HTML: use JSoup. It can handle this problem for you and you do not have to do any errands (tidying the HTML, disabling validation and so on).
  • If you have a well-formed HTML: use JSoup. This is because you have more freedom to play with cssquery than with XPath 1.0. Naturally sometimes you need some trick to get all the text you want — but it is not so bad. And JSoup is a little bit faster.

I prefer XMLBeam because it is less code and you can only get lost in XPath but you do not need any looping through child nodes — however sometimes this is the best solution.

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s