Custom printing for HTML with JSoup

Lately I’ve written about HTML parsing with JSoup. It just happened that I came across a problem for pretty printing or it is better to call it prettier printing of HTML documents with JSoup in another project and I thought this would be another good thing to share it with you through an article.

Prettier printing

I could generalise the term and say: custom printing. I had an HTML input where was few line wrapping but I needed a new line after each node tag. So I looked a way to accomplish this.

JSoup offers you a way to print an HTML document or just a part of it as its string representation: nodes as tags and indentation. However if you want to use your own printing mechanism you have to tinker a bit — and read a lot of source code.

Fortunately it is not as hard as it sounds.

The solution is to traverse your HTML part and then apply your needed format while traversing through each node of the HTML document or a part of it.

For traversing a node-tree you need an org.jsoup.select.NodeTraversor which takes an org.jsoup.select.NodeVisitor as parameter. The NodeVisitor is an interface which you have to implement — and your visitor will handle the nodes visited. Yes, the Visitor is the part of code where you can apply your formatting for example.

public String myFormattedHTML(Element element) {
    FormattingVisitor visitor = new FormattingVisitor();
    NodeTraversor traversor = new NodeTraversor(visitor);
    traversor.traverse(element);
    return visitor.toString();
}

In the method above the input Element is a JSoup HTML element. If you want to print the whole document with a custom format, you can change it to Document and provide the parsed document as input — or a single node. This is because the NodeTraversor.traverse needs an org.jsoup.nodes.Node as input parameter and Document and Element are extended from Node. If you want to print Elements then you need to loop through the list of Element objects and traverse them individually.

private class FormattingVisitor implements NodeVisitor {
    private StringBuilder accumulatedText = new StringBuilder();

    @Override
    public void head(Node node, int depth) {
        String name = node.nodeName();
        if (node instanceof TextNode) {
            append(((TextNode) node).text(), depth); 
        }
        else if (!name.equals("body")) {
            appendNode(node, depth);
        }
    }

    @Override
    public void tail(Node node, int depth) {
        String name = node.nodeName();
        if (!(node instanceof TextNode) && !"body".equals(name) && !"br".equals(name)) {
            append("</" + name + ">", depth);
        }
    }

    private void appendNode(Node node, int depth) {
        accumulatedText.append(StringUtils.repeat(" ", (depth - 1) * 4), 0);
        accumulatedText.append("<" + node.nodeName());
        for (Attribute attr : node.attributes()) {
            accumulatedText.append(" " + attr.html());
        }
        if ("br".equals(node.nodeName())) {
            accumulatedText.append("/");
        }
        accumulatedText.append(">\n");
    }

    private void append(String text, int depth) {
        text = StringUtils.normalizeSpace(text);
        if (StringUtils.isNotEmpty(text)) {
            accumulatedText.append(StringUtils.repeat(" ", (depth - 1) * 4));
            accumulatedText.append(text);
            accumulatedText.append("\n");
        }
    }

    public String toString() {
        return accumulatedText.toString();
    }
}

 

As you can see, this formatting is very straightforward and prints a new line after each node.

The head method gets called when a node starts, the tail when the end is reached of the node. For empty nodes (like <br> in my case) you have to be careful because you can end up with opening and closing tags in separate lines but you’d need a one-liner.

The highlighted lines show you how to handle element-attributes: you read them from the node, loop through them and add it as you wish. If you do not need them, you can forget  the looping completely.

As you can see, this version of the formatter results in an XML-like awkward structure if you have some formatting tags as strong, bold, italic, underline and so on. List elements could be the same. However as I told this is a sample solution I needed in a hurry to get one project look better.

Source code

Because this code is independent of the project I can share it with you. You’ll find it at GitHub as usually.

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s