Strategies for linking to obsolete websites
I've been blogging for a long time. Over the years, I've linked to tens of thousands of websites. Inevitably, some of those sites have gone. Even when sites still exist, webmasters seem to have forgotten that Cool URls Don't Change.
I use the WordPress Broken Link Checker plugin. It periodically monitors the links on my site and lets me know which ones are dead. It also offers to link to Wayback Machine snapshots of the page.

It doesn't always work, of course. Sometimes the page will have been taken over by spammers, and the snapshot reflects that.
This isn't some SEO gambit. I believe that the web works best when users can seamlessly surf between sites. Forcing them to search for information is user-hostile.
What I'm trying to achieve
When a visitor clicks on a link, they should get (in order of preference):
- The original page
- An archive.org view of the page
- Ideally the most recent snapshot
- If the recent snapshot doesn't contain the correct content, a snapshot of the page around the time the link was made
- A snapshot of the site's homepage around the time the link was made
- A replacement page. For example, Topsy used to show who had Tweeted about your page. Apple killed Topsy - so now I point to Twitter's search results for a URl.
- If there is no archive, and no replacement, and the link contains useful semantic information - leave it broken.
- Remove the link.
Some links are from people leaving comments, and setting their comments. Is it useful for future web historians to know that Blogger Profile 1234 commented on my blog and your blog?
Some links are only temporarily dead (for tax reasons?) - so I tend to leave them broken.
The Internet Archive say that "If you see something, save something". So, going forward, I'll submit every link out from my blog to the Archive. I'm hoping to find a plugin to automate that - any ideas?
Reply to original comment on alisonw.uk
|I do like the idea of grabbing the content of what you originally referenced though, especially if context is lost or irrelevant through missing page or revamped domain.
You might struggle to match up old IA content where crawling was sparse or a page doesn’t exist, especially if a spammer has control of the domain and of course not all new domain owners are spammers either. How are you going to decide if a page returning a 200 ( if it’s a link to root) is the original domain content or new content where no IA exists?
Would you have a list of bad words? Pr0n etc Would you check for a google cache? Would you parse the page title and see if the content ranks for the string?
All of those would potentially help indicate that the page is at least worthy in some regard or not and based upon the response you could make a decision to retain the link or push them to your custom gone page or other page created with the IA scrape.
Sounds fun either way.
Reply to original comment on www.jvt.me
|Reply to original comment on twitter.com
|So yeah, some “link hygiene” on occasion is probably a good idea.
Reply to original comment on beko.famkos.net
|It's not just little fly-by-night companies that go away, either - I used to have links to quite a few clips on Google Video - remember that? I had to go away and try and find them on YouTube and re-link...
Lots of nice ideas there, though, thanks - gonna try that plugin...