How bad is link-rot on my blog?


I read this brilliant blog post by Wouter Groeneveld looking at how many dead links there were on his blog. I thought I'd try something similar.

Every day, I look at the On This Day page of my blog and look at that day's historic posts. I click on every link to see if it is still working. If it isn't, I have a few options.

  1. If the site is working, but the content has moved, I change the link to point at the content.
  2. If the site is dead, or the content isn't there, I stick the URl into the WayBack Machine and pick the archive closest to the publication date.
  3. If the archive doesn't have it, I try to find a link from around the same date and point it to that.
  4. If not, I remove the link.

Why don't I just cURL the URl and see if it responds with a 2xx HTTP code?

I used to use an automated checker to test my links. But spam sites lie. Lots of the dead sites have been taken over by scammers, spammers, and AI grifters. They return an HTTP OK and then serve up advertising. Yeuch! So now I spend each morning Tending To My Digital Garden and manually checking the links.

So, let's scan through every link within my blog posts and see which ones have been archived.

Results

Here's a stacked chart showing blog posts from 2008 until today:

Stacked Bar Chart.

And, if you prefer percentages, here's another chart:

Line chart showing percentages.

Rather unsurprisingly, the older posts have the most link-rot.

Here's the data if you want to play with it yourself:

CSV CSVYear,Total,Archives
2008,159,27
2009,1335,156
2010,1166,91
2011,1130,85
2012,1392,98
2013,1889,84
2014,1858,35
2015,3016,43
2016,2274,37
2017,1309,17
2018,1561,25
2019,1580,18
2020,3362,33
2021,3621,40
2022,2904,22
2023,2358,15
2024,2136,19

Deficiencies

There will be some links that I haven't spotted. There will be dead links in review metadata and in comments. My code might not have accounted for some weird edge cases in the HTML.

But, generally, it should be a roughly accurate assessment.

Generating the data

Using the WP CLI, this can be run with wp eval-file archive.php:

PHP PHP<?php
//  Load the WordPress environment
require_once( "wp-load.php" );

//  Loop through the desired years
for ( $year = 2008; $year <= 2024; $year++ ) {
        //  Initialise the counters
        $total_links   = 0;
        $archive_links = 0;

        //  Get all the published posts for that year
        $posts = get_posts( [
            "date_query" => [
                [ "year" => $year, ],
            ],
            "post_type"   => "post",
            "post_status" => "publish",
            "numberposts" => -1,
        ] );

        //  Loop through each post
        foreach ( $posts as $post ) {
            $ID = $post->ID;

            //  Render the content of the post
            $content = apply_filters( "the_content", get_the_content( null, false, $ID ) );

            //  Count total occurrences of "<a href="
            preg_match_all( '/<a href=/i', $content, $matches );
            $total_links += count( $matches[0] );

            //  Count total occurrences of "<a href=" which contain "archive.org"
            preg_match_all( '/<a href=["\'][^"\']*archive.org/i', $content, $archive_matches );
            $archive_links += count( $archive_matches[0] );
        }

    //  Display the results as CSV
    echo "{$year},";
    echo "{$total_links},";
    echo "{$archive_links}\n";
}

Share this post on…

  • Mastodon
  • Facebook
  • LinkedIn
  • BlueSky
  • Threads
  • Reddit
  • HackerNews
  • Lobsters
  • WhatsApp
  • Telegram

6 thoughts on “How bad is link-rot on my blog?”

  1. said on mastodon.social:

    @prijks for some things, I can see that a broken link might have semantic value. E.g. example.com/recipies/soufle.html

    But most of the time a broken link is just a frustrating user experience. I'll sometimes put a <del> through it, or leave an <ins> with a note.

    Of course, I'm not your boss. Do what's right for your site 😄

    Reply | Reply to original comment on mastodon.social
  2. says:

    I care about link rot on my blog as well, but rather than fixing every post I just fix the ones that still get enough views. "Enough" depends on how I feel on any given day, I admit. If an old post suddenly gets a surge of views, I'll review the links on it and fix or remove the broken ones. I never thought about linking to the Wayback Machine; thanks for that tip.

    Reply
  3. says:

    I would love to be able to, when adding links, to be able to add specific keywords to it. Then have a spider (like Xenu's Link Sleuth or the old ODP's Robozilla) occasionally scan all links on my site and if a linked page doesn't have those keywords on it, it will then check archive.org around that date for a matching page and auto-link it or alert me that the link is stale.

    Reply

What are your reckons?

All comments are moderated and may not be published immediately. Your email address will not be published.

Allowed HTML: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <p> <pre> <br> <img src="" alt="" title="" srcset="">