Create a Table of Contents based on HTML Heading Elements


Some of my blog posts are long0. They have lots of HTML headings like <h2> and <h3>. Say, wouldn't it be super-awesome to have something magically generate a Table of Contents? I've built a utility which runs server-side using PHP. Give it some HTML and it will construct a Table of Contents.

Let's dive in!

Background

HTML has six levels of headings1 - <h1> is the main heading for content, <h2> is a sub-heading, <h3> is a sub-sub-heading, and so on.

Together, they form a hierarchy.

Heading Example

HTML headings are expected to be used a bit like this (I've nested this example so you can see the hierarchy):

 HTML<h1>The Theory of Everything</h1>
   <h2>Experiments</h2>
      <h3>First attempt</h3>
      <h3>Second attempt</h3>
   <h2>Equipment</h2>
      <h3>Broken equipment</h3>
         <h4>Repaired equipment</h4>
      <h3>Working Equipment</h3>

What is the purpose of a table of contents?

Wayfinding. On a long document, it is useful to be able to see an overview of the contents and then immediately navigate to the desired location.

The ToC has to provide a hierarchical view of all the headings and then link to them.

Code

I'm running this as part of a WordPress plugin. You may need to adapt it for your own use.

Load the HTML

This uses PHP's DOMdocument. I've manually added a UTF-8 header so that Unicode is preserved. If your HTML already has that, you can remove the addition from the code.

 PHP//  Load it into a DOM for manipulation
$dom = new DOMDocument();
//  Suppress warnings about HTML errors
libxml_use_internal_errors( true );
//  Force UTF-8 support
$dom->loadHTML( "<!DOCTYPE html><html><head><meta charset=UTF-8></head><body>" . $content, LIBXML_NOERROR | LIBXML_NOWARNING );
libxml_clear_errors();

Using PHP 8.4

The latest version of PHP contains a better HTML-aware DOM. It can be used like this:

 PHP$dom = Dom\HTMLDocument::createFromString( $content, LIBXML_NOERROR , "UTF-8" );

Parse the HTML

It is not a good idea to use Regular Expressions to parse HTML - no matter how well-formed you think it is. Instead, use XPath to extract data from the DOM.

 PHP//  Parse with XPath
$xpath = new DOMXPath( $dom );

//  Look for all h* elements
$headings = $xpath->query( "//h1 | //h2 | //h3 | //h4 | //h5 | //h6" );

This produces an array with all the heading elements in the order they appear in the document.

PHP 8.4 querySelectorAll

Rather than using XPath, modern versions of PHP can use querySelectorAll:

 PHP$headings = $dom->querySelectorAll( "h1, h2, h3, h4, h5, h6" );

Recursive looping

This is a bit knotty. It produces a nested array of the elements, their id attributes, and text. The end result should be something like:

array (
  array (
    'text' => '<h2>Table of Contents</h2>',
    'raw' => true,
  ),
  array (
    'text' => 'The Theory of Everything',
    'id' => 'the-theory-of-everything',
    'children' =>
    array (
      array (
        'text' => 'Experiments',
        'id' => 'experiments',
        'children' =>
        array (
          array (
            'text' => 'First attempt',
            'id' => 'first-attempt',
          ),
          array (
            'text' => 'Second attempt',
            'id' => 'second-attempt',

The code is moderately complex, but I've commented it as best as I can.

 PHP//  Start an array to hold all the headings in a hierarchy
$root = [];
//  Add an h2 with the title
$root[] = [
    "text"     => "<h2>Table of Contents</h2>",
    "raw"      => true,
    "children" => []
];

// Stack to track current hierarchy level
$stack = [&$root];

//  Loop through the headings
foreach ($headings as $heading) {

    //  Get the information
    //  Expecting <h2 id="something">Text</h2>
    $element = $heading->nodeName;  //  e.g. h2, h3, h4, etc
    $text    = trim( $heading->textContent );  
    $id      = $heading->getAttribute( "id" );

    //  h2 becomes 2, h3 becomes 3 etc
    $level = (int) substr($element, 1);

    //  Get data from element
    $node = array(
        "text"     => $text,
        "id"       => $id ,
        "children" => []
    );

    //  Ensure there are no gaps in the heading hierarchy
    while ( count( $stack ) > $level ) {
        array_pop( $stack );
    }

    //  If a gap exists (e.g., h4 without an immediately preceding h3), create placeholders
    while ( count( $stack ) < $level ) {
        //  What's the last element in the stack?
        $stackSize = count( $stack );
        $lastIndex = count( $stack[ $stackSize - 1] ) - 1;
        if ($lastIndex < 0) {
            //  If there is no previous sibling, create a placeholder parent
            $stack[$stackSize - 1][] = [
                "text"     => "",   //  This could have some placeholder text to warn the user?
                "children" => []
            ];
            $stack[] = &$stack[count($stack) - 1][0]['children'];
        } else {
            $stack[] = &$stack[count($stack) - 1][$lastIndex]['children'];
        }
    }

    //  Add the node to the current level
    $stack[count($stack) - 1][] = $node;
    $stack[] = &$stack[count($stack) - 1][count($stack[count($stack) - 1]) - 1]['children'];
}
Missing content

The trickiest part of the above is dealing with missing elements in the hierarchy. If you're sure you don't ever skip from an <h3> to an <h6>, you can get rid of some of the code dealing with that edge case.

Converting to HTML

OK, there's a hierarchical array, how does it become HTML?

Again, a little bit of recursion:

 PHPfunction arrayToHTMLList( $array, $style = "ul" )
{
    $html = "";

    //  Loop through the array
    foreach( $array as $element ) {
        //  Get the data of this element
        $text     = $element["text"];
        $id       = $element["id"];
        $children = $element["children"];
        $raw      = $element["raw"] ?? false;

        if ( $raw ) {
            //  Add it to the HTML without adding an internal link
            $html .= "<li>{$text}";
        } else {
            //  Add it to the HTML
            $html .= "<li><a href=#{$id}>{$text}</a>";
        }

        //  If the element has children
        if ( sizeof( $children ) > 0 ) {
            //  Recursively add it to the HTML
            $html .=  "<{$style}>" . arrayToHTMLList( $children, $style ) . "</{$style}>";
        }
    }

    return $html;
}

Semantic Correctness

Finally, what should a table of contents look like in HTML? There is no <toc> element, so what is most appropriate?

ePub Example

Modern eBooks use the ePub standard which is based on HTML. Here's how an ePub creates a ToC.

 HTML<nav role="doc-toc" epub:type="toc" id="toc">
<h2>Table of Contents</h2>
<ol>
  <li>
    <a href="s01.xhtml">A simple link</a>
  </li>
  …
</ol>
</nav>

The modern(ish) <nav> element!

The nav element represents a section of a page that links to other pages or to parts within the page: a section with navigation links. HTML Specification

But there's a slight wrinkle. The ePub example above use <ol> an ordered list. The HTML example in the spec uses <ul> an unordered list.

Which is right? Well, that depends on whether you think the contents on your page should be referred to in order or not. There is, however, a secret third way.

Split the difference with a menu

I decided to use the <menu> element for my navigation. It is semantically the same as <ul> but just feels a bit closer to what I expect from navigation. Feel free to argue with me in the comments.

Where should the heading go?

I've put the title of the list into the list itself. That's valid HTML and, if my understanding is correct, should announce itself as the title of the navigation element to screen-readers and the like.

Conclusion

I've used slightly more heading in this post than I would usually, but hopefully the Table of Contents at the top demonstrates how this works.

If you want to reuse this code, I consider it too trivial to licence. But, if it makes you happy, you can treat it as MIT.

Thoughts? Comments? Feedback? Drop a note in the box.


  1. Too long really, but who can be bothered to edit? ↩︎

  2. Although Paul McCartney disagrees↩︎


Share this post on…

One thought on “Create a Table of Contents based on HTML Heading Elements”

What are your reckons?

All comments are moderated and may not be published immediately. Your email address will not be published.

Allowed HTML: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <p> <pre> <br> <img src="" alt="" title="" srcset="">