Pretty Print HTML using PHP 8.4's new HTML DOM
Those whom the gods would send mad, they first teach recursion.
PHP 8.4 introduces a new Dom\HTMLDocument class it is a modern HTML5 replacement for the ageing XHTML based DOMDocument. You can read more about how it works - the short version is that it reads and correctly sanitises HTML and turns it into a nested object. Hurrah!
The one thing it doesn't do is pretty-printing. When you call $dom->saveHTML()
it will output something like:
HTML
<html lang="en-GB"><head><title>Test</title></head><body><h1>Testing</h1><main><p>Some <em>HTML</em> and an <img src="example.png"></p><ol><li>List</li><li>Another list</li></ol></main></body></html>
Perfect for a computer to read, but slightly tricky for humans.
As was written by the sages:
A computer language is not just a way of getting a computer to perform operations but rather … it is a novel formal medium for expressing ideas about methodology. Thus, programs must be written for people to read, and only incidentally for machines to execute.
HTML is a programming language. Making markup easy to read for humans is a fine and noble goal. The aim is to turn the single line above into something like:
HTML
<html lang="en-GB">
<head>
<title>Test</title>
</head>
<body>
<h1>Testing</h1>
<main>
<p>Some <em>HTML</em> and an <img src="example.png"></p>
<ol>
<li>List</li>
<li>Another list</li>
</ol>
</main>
</body>
</html>
Cor! That's much better!
I've cobbled together a script which is broadly accurate. There are a million-and-one edge cases and about twice as many personal preferences. This aims to be quick, simple, and basically fine. I am indebted to this random Chinese script and to html-pretty-min.
Step By Step
I'm going to walk through how everything works. This is as much for my benefit as for yours! This is beta code. It sorta-kinda-works for me. Think of it as a first pass at an attempt to prove that something can be done. Please don't use it in production!
Setting up the DOM
The new HTMLDocument should be broadly familiar to anyone who has used the previous one.
PHP
$html = '<html lang="en-GB"><head><title>Test</title></head><body><h1>Testing</h1><main><p>Some <em>HTML</em> and an <img src="example.png"></p><ol><li>List<li>Another list</body></html>'
$dom = Dom\HTMLDocument::createFromString( $html, LIBXML_NOERROR, "UTF-8" );
This automatically adds <head>
and <body>
elements. If you don't want that, use the LIBXML_HTML_NOIMPLIED
flag:
PHP
$dom = Dom\HTMLDocument::createFromString( $html, LIBXML_NOERROR | LIBXML_HTML_NOIMPLIED, "UTF-8" );
Where not to indent
There are certain elements whose contents shouldn't be pretty-printed because it might change the meaning or layout of the text. For example, in a paragraph:
HTML
<p>
Some
<em>
HT
<strong>M</strong>
L
</em>
</p>
I've picked these elements from text-level semantics and a few others which I consider sensible. Feel free to edit this list if you want.
PHP
$preserve_internal_whitespace = [
"a",
"em", "strong", "small",
"s", "cite", "q",
"dfn", "abbr",
"ruby", "rt", "rp",
"data", "time",
"pre", "code", "var", "samp", "kbd",
"sub", "sup",
"b", "i", "mark", "u",
"bdi", "bdo",
"span",
"h1", "h2", "h3", "h4", "h5", "h6",
"p",
"li",
"button", "form", "input", "label", "select", "textarea",
];
The function has an option to force indenting every time it encounters an element.
Tabs 🆚 Spaces
Tabs, obviously. Users can set their tab width to their personal preference and it won't get confused with semantically significant whitespace.
PHP
$indent_character = "\t";
Recursive Function
This function reads through each node in the HTML tree. If the node should be indented, the function inserts a new node with the requisite number of tabs before the existing node. It also adds a suffix node to indent the next line appropriately. It then goes through the node's children and recursively repeats the process.
This modifies the existing Document.
PHP
function prettyPrintHTML( $node, $treeIndex = 0, $forceWhitespace = false )
{
global $indent_character, $preserve_internal_whitespace;
// If this node contains content which shouldn't be separately indented
// And if whitespace is not forced
if ( property_exists( $node, "localName" ) && in_array( $node->localName, $preserve_internal_whitespace ) && !$forceWhitespace ) {
return;
}
// Does this node have children?
if( property_exists( $node, "childElementCount" ) && $node->childElementCount > 0 ) {
// Move in a step
$treeIndex++;
$tabStart = "\n" . str_repeat( $indent_character, $treeIndex );
$tabEnd = "\n" . str_repeat( $indent_character, $treeIndex - 1);
// Remove any existing indenting at the start of the line
$node->innerHTML = trim($node->innerHTML);
// Loop through the children
$i=0;
while( $childNode = $node->childNodes->item( $i++ ) ) {
// Was the *previous* sibling a text-only node?
// If so, don't add a previous newline
if ( $i > 0 ) {
$olderSibling = $node->childNodes->item( $i-1 );
if ( $olderSibling->nodeType == XML_TEXT_NODE && !$forceWhitespace ) {
$i++;
continue;
}
$node->insertBefore( $node->ownerDocument->createTextNode( $tabStart ), $childNode );
}
$i++;
// Recursively indent all children
prettyPrintHTML( $childNode, $treeIndex, $forceWhitespace );
};
// Suffix with a node which has "\n" and a suitable number of "\t"
$node->appendChild( $node->ownerDocument->createTextNode( $tabEnd ) );
}
}
Printing it out
First, call the function. This modifies the existing Document.
PHP
prettyPrintHTML( $dom->documentElement );
Then call the normal saveHtml()
serialiser:
PHP
echo $dom->saveHTML();
Note - this does not print a <!doctype html>
- you'll need to include that manually if you're intending to use the entire document.
Licence
I consider the above too trivial to licence - but you may treat it as MIT if that makes you happy.
Thoughts? Comments? Next steps?
I've not written any formal tests, nor have I measured its speed, there may be subtle-bugs, and catastrophic errors. I know it doesn't work well if the HTML is already indented. It mysteriously prints double newlines for some unfathomable reason.
I'd love to know if you find this useful. Please get involved on GitLab or drop a comment here.
I pretty-print all the HTML output by DanQ.me, but my approach is different (and works with older versions of PHP, as a fringe bonus). I put an
ob_start
in my header, with a callback function that (among other things) prettifies everything that's been written. It does this using an instance of tidy for PHP, with'indent' => true
and'output-html' => true
parameters to its `parseString()` before calling `cleanRepair()`. I've been doing this for a few years and it seems to work pretty well. It seems my approach is probably more-performant than doing it all in PHP, but so long as you're caching it probably doesn't make a significant difference.More comments on Mastodon.