Pretty Print HTML using PHP 8.4's new HTML DOM


Those whom the gods would send mad, they first teach recursion.

PHP 8.4 introduces a new Dom\HTMLDocument class it is a modern HTML5 replacement for the ageing XHTML based DOMDocument. You can read more about how it works - the short version is that it reads and correctly sanitises HTML and turns it into a nested object. Hurrah!

The one thing it doesn't do is pretty-printing. When you call $dom->saveHTML() it will output something like:

 HTML<html lang="en-GB"><head><title>Test</title></head><body><h1>Testing</h1><main><p>Some <em>HTML</em> and an <img src="example.png"></p><ol><li>List</li><li>Another list</li></ol></main></body></html>

Perfect for a computer to read, but slightly tricky for humans.

As was written by the sages:

A computer language is not just a way of getting a computer to perform operations but rather … it is a novel formal medium for expressing ideas about methodology. Thus, programs must be written for people to read, and only incidentally for machines to execute.

HTML is a programming language. Making markup easy to read for humans is a fine and noble goal. The aim is to turn the single line above into something like:

 HTML<html lang="en-GB">
    <head>
        <title>Test</title>
    </head>
    <body>
        <h1>Testing</h1>
        <main>
            <p>Some <em>HTML</em> and an <img src="example.png"></p>
            <ol>
                <li>List</li>
                <li>Another list</li>
            </ol>
        </main>
    </body>
</html>

Cor! That's much better!

I've cobbled together a script which is broadly accurate. There are a million-and-one edge cases and about twice as many personal preferences. This aims to be quick, simple, and basically fine. I am indebted to this random Chinese script and to html-pretty-min.

Step By Step

I'm going to walk through how everything works. This is as much for my benefit as for yours! This is beta code. It sorta-kinda-works for me. Think of it as a first pass at an attempt to prove that something can be done. Please don't use it in production!

Setting up the DOM

The new HTMLDocument should be broadly familiar to anyone who has used the previous one.

 PHP$html = '<html lang="en-GB"><head><title>Test</title></head><body><h1>Testing</h1><main><p>Some <em>HTML</em> and an <img src="example.png"></p><ol><li>List<li>Another list</body></html>'
$dom = Dom\HTMLDocument::createFromString( $html, LIBXML_NOERROR, "UTF-8" );

This automatically adds <head> and <body> elements. If you don't want that, use the LIBXML_HTML_NOIMPLIED flag:

 PHP$dom = Dom\HTMLDocument::createFromString( $html, LIBXML_NOERROR | LIBXML_HTML_NOIMPLIED, "UTF-8" );

Where not to indent

There are certain elements whose contents shouldn't be pretty-printed because it might change the meaning or layout of the text. For example, in a paragraph:

 HTML<p>
    Some
    <em>
        HT
        <strong>M</strong>
        L
    </em>
</p>

I've picked these elements from text-level semantics and a few others which I consider sensible. Feel free to edit this list if you want.

 PHP$preserve_internal_whitespace = [
    "a",
    "em", "strong", "small",
    "s", "cite", "q",
    "dfn", "abbr",
    "ruby", "rt", "rp",
    "data", "time",
    "pre", "code", "var", "samp", "kbd",
    "sub", "sup",
    "b", "i", "mark", "u",
    "bdi", "bdo",
    "span",
    "h1", "h2", "h3", "h4", "h5", "h6",
    "p",
    "li",
    "button", "form", "input", "label", "select", "textarea",
];

The function has an option to force indenting every time it encounters an element.

Tabs 🆚 Spaces

Tabs, obviously. Users can set their tab width to their personal preference and it won't get confused with semantically significant whitespace.

 PHP$indent_character = "\t";

Recursive Function

This function reads through each node in the HTML tree. If the node should be indented, the function inserts a new node with the requisite number of tabs before the existing node. It also adds a suffix node to indent the next line appropriately. It then goes through the node's children and recursively repeats the process.

This modifies the existing Document.

 PHPfunction prettyPrintHTML( $node, $treeIndex = 0, $forceWhitespace = false )
{    
    global $indent_character, $preserve_internal_whitespace;

    //  If this node contains content which shouldn't be separately indented
    //  And if whitespace is not forced
    if ( property_exists( $node, "localName" ) && in_array( $node->localName, $preserve_internal_whitespace ) && !$forceWhitespace ) {
        return;
    }

    //  Does this node have children?
    if( property_exists( $node, "childElementCount" ) && $node->childElementCount > 0 ) {
        //  Move in a step
        $treeIndex++;
        $tabStart = "\n" . str_repeat( $indent_character, $treeIndex );
        $tabEnd   = "\n" . str_repeat( $indent_character, $treeIndex - 1);

        //  Remove any existing indenting at the start of the line
        $node->innerHTML = trim($node->innerHTML);

        //  Loop through the children
        $i=0;

        while( $childNode = $node->childNodes->item( $i++ ) ) {
            //  Was the *previous* sibling a text-only node?
            //  If so, don't add a previous newline
            if ( $i > 0 ) {
                $olderSibling = $node->childNodes->item( $i-1 );

                if ( $olderSibling->nodeType == XML_TEXT_NODE  && !$forceWhitespace ) {
                    $i++;
                    continue;
                }
                $node->insertBefore( $node->ownerDocument->createTextNode( $tabStart ), $childNode );
            }
            $i++;
            //  Recursively indent all children
            prettyPrintHTML( $childNode, $treeIndex, $forceWhitespace );
        };

        //  Suffix with a node which has "\n" and a suitable number of "\t"
        $node->appendChild( $node->ownerDocument->createTextNode( $tabEnd ) );
    }
}

Printing it out

First, call the function. This modifies the existing Document.

 PHPprettyPrintHTML( $dom->documentElement );

Then call the normal saveHtml() serialiser:

 PHPecho $dom->saveHTML();

Note - this does not print a <!doctype html> - you'll need to include that manually if you're intending to use the entire document.

Licence

I consider the above too trivial to licence - but you may treat it as MIT if that makes you happy.

Thoughts? Comments? Next steps?

I've not written any formal tests, nor have I measured its speed, there may be subtle-bugs, and catastrophic errors. I know it doesn't work well if the HTML is already indented. It mysteriously prints double newlines for some unfathomable reason.

I'd love to know if you find this useful. Please get involved on GitLab or drop a comment here.


Share this post on…

One thought on “Pretty Print HTML using PHP 8.4's new HTML DOM”

  1. says:

    I LOVE PHP 8.4's new DOM processing stuff. I recently had to write some DOM processing code in PHP and was really hoping to make use of it, but it turns out my code had to be compatible with PHP 8.2 (for complicated reasons) and I couldn't make the most of the new functionality. Boo!



    I pretty-print all the HTML output by DanQ.me, but my approach is different (and works with older versions of PHP, as a fringe bonus). I put an ob_start in my header, with a callback function that (among other things) prettifies everything that's been written. It does this using an instance of tidy for PHP, with 'indent' => true and 'output-html' => true parameters to its `parseString()` before calling `cleanRepair()`. I've been doing this for a few years and it seems to work pretty well. It seems my approach is probably more-performant than doing it all in PHP, but so long as you're caching it probably doesn't make a significant difference.

    Reply

What links here from around this blog?

  1. The HTML5 Logo.An opinionated HTML Serializer for PHP 8.4

What are your reckons?

All comments are moderated and may not be published immediately. Your email address will not be published.

See allowed HTML elements: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <p> <pre> <br> <img src="" alt="" title="" srcset="">