WebMentions, Privacy, and DDoS - Oh My!
Mastodon - the distributed social network - has two interesting challenges when it comes to how users share links. I'd like to discuss those issues and suggest a possible way forward.
When you click on a link on my website which takes you to another website, your browser sends a Referer0. This says to the other site "Hey, I came here using a link on shkspr.mobi
". This is useful because it lets a site owner know who is linking to them. I love seeing which weird and wonderful sites have linked to my content.
It is also something of a privacy nightmare as it lets sites see who is clicking and from where they're clicking. So Mastodon sets a noreferrer
1 attribute on all links. This tells the browser not to send the Referer.
This means sites no longer know who is sending them traffic.
That's either a good thing from a privacy perspective or a disaster from a marketing perspective. Or a little bit of both.
Here's a related issue. When a user posts a link to your website on Mastodon, the server checks your page to see if there are any oEmbed tags for a rich link preview. But, at the moment, it doesn't check your website's robots.txt
file - which lets it know whether it is allowed to scrape your content.
In the case of something like Twitter or Facebook, this is fine. If a million users post a link, the centralised social network checks the link once and caches the result.
With - potentially - thousands of distributed Mastodon sites, this presents a problem. If a popular account posts a link, their instance fetches a rich preview. Then every instance which has users following them also requests that URL. Essentially, this is a DDoS attack.
I can fix you
So here's my thoughts on how to fix this.
When a user posts a link to Mastodon, their instance should send a WebMention to the site hosting the link. This informs the website that someone has shared their content. Perhaps a user could adjust their privacy settings to allow or deny this.
The instance would check the site's robots.txt
and, if allowed, scrape the site to see if there were any Open Graph Protocol metadata elements on it.
That metadata should be included in the post as it is shared across the network.
For example, a status could look like this:
JSON{
"id": "123",
"created_at": "2022-03-16T14:44:31.580Z",
"in_reply_to_id": null,
"in_reply_to_account_id": null,
"visibility": "public",
"language": "en",
"uri": "https://mastodon.social/users/Edent/statuses/123",
"content": "<p>Check out https://example.com/</p>",
"ogp_allowed": true,
"ogp": {
"og:title": "My amazing site",
"og:image:url": "https://cdn.mastodon.social/cache/example.com/preview.jpg",
"og:description": "A long description. Perhaps the first paragraph of the text."
...
}
...
}
When a post is boosted across the network, the instances can see that there is rich metadata associated with the link. If there is an image associate with the post, that will be loaded from the cache on the original Mastodon instance - avoiding overloading the website.
Now, there is a flaw in this idea. A malicious Mastodon server could serve up a fake OGP image and description. So a link to McDonald's might display a fake image promoting Burger King.
To protect against this, a receiving instance could randomly or periodically check the OGP metadata that they receive. If it has been changed, they can update it.
Perhaps a diagram would help?
What other people say about the problem
Feedback?
Is this a problem? Does this present a viable solution? Have I missed something obvious? Please leave a comment and let me know 😃
Ralf said on noc.social:
@Edent I can see your point that it is a "DDoS" attack -- but a typical modern website would have media served via CDN (either anycast or geodns).This should have the effect of making it ... not a DDoS, just a high traffic volume distributed across a number of CDN endpoints.I guess it depends on the number of instances and how/where the media is being cached.The privacy aspect of not referring is a design feature -- IMHO intentionally done to thwart marketing / corporate interests.
Adam Dalliance said on boing.world:
@Edent That's a good summary of the situation and I certainly agree scrapers should be checking robots.txt - including Big Centralized Social should be doing it too.I don't think randomly checking the validity is likely to work well, but perhaps just user-reports would be fine.Or perhaps a standard where a website can publish a public key with which they sign all their OpenGraph cards so their validity can be checked?Though guess you're still ddosing with fetching the robots.txt and or the public key. Suppose they can both be static at least.
keef said on mastodon.online:
@Edent @ncweaver @davidgerard Interesting, but.. having a site implement WebMentions feels much like saying a site has to implement a CDN - how many sites have this enabled by default? I know mine doesn't... I'd be concerned about the '0-day' impact of a bad actor sharing a link with a spoofed, defamatory card - imagine if this happened to a politician, for example. OK, that Mastodon instance would get defederated quite quickly but the damage might already be done.
keef said on mastodon.online:
@davidgerard @Edent @ncweaver In the longer term, I suspect we need some kind of 'signed' OG data card that can be validated as belonging to the originating website, without requiring a website fetch. Much like a JWT or similar can be verified without a callback.But that would, presumably need a new W3C or similar standard of some kind, and web server support.
keef said on mastodon.online:
@davidgerard @Edent @ncweaver I suppose the issue is, if it's "just" a post, then it can be traced back to the Mastodon user and instance quite easily. However, if it's a link to e.g. a public figure's website with "added" defamatory content, they may not even see it to report it at first. Perhaps this is no different from the user just writing the defamatory content but it will look different.Still, this may remain the only sensible option if we get an order-of-magnitude growth in Mastodon.
Manton Reece said on micro.blog:
@Edent@mastodon.social Good post and thoughts. My 2 cents, I don't like including more data sent over ActivityPub because I feel like the current post data has already become quite bloated with various Mastodon fields, making it harder for implementers to know what is required.
David Gerard said on circumstances.run:
@Edent you may have trouble getting this past the project leader - here's the bug, open five years, last comment is a highly tech-enabled person annoyed at the obnoxious software https://github.com/mastodon/mastodon/issues/4486 Mastodon can be used as a DDOS tool · Issue #4486 · mastodon/mastodon
Ralf said on noc.social:
@davidgerard @Edent I agree it should check robots.txt -- any site auto-pulling any data without a human at the controls should do that.
Brian Hawthorne said on mastodon.lol:
@Edent Unless you are running a web server on an old Palm Pilot connected to the net with paper cups and string, I have a hard time considering 1000 (or even 10,000) hits to be a DDoS attack. This seems to be trying to fix something that isn’t really broken.
Richard Bairwell said on mastodon.org.uk:
@KevinMarks Problem is it is "strongly encouraged" to use discovery to find the oEmbed (instead of just download the only 288 listed providers). Discovery means fetching the page anyway to parse its headers (HEAD and http headers if you are lucky, GET and html head of not).
James said on mastodon.online:
@Edent Interesting. One issue is that you are effectively caching site metadata - you comment on some aspects of this. Another aspect: what happens if I as website owner want to change my site meta data? Do i just have to put up with stale cache content? I have already seen this problem on FB.
Phil Ashby: :marmite:, NHS 💙 said on mastodon.me.uk:
@Edent Would this be a reasonable measure of success - if we reduced the impact to the same level as being posted to a popular news site?If so, then proposals I've seen to delay page preview generation /until a user views a post/ may help? Of course that now depends on what 'viewing a post' means - it would have to be /clicking/ on a post, not an automated feed, and caching the preview per instance will reduce impact compared to hackernews or similar... maybe?
Ryan Barrett said on snarfed.org:
Interesting ideas! And I definitely love me some webmention. Sadly though, Mastodon 4 went js;dr and put all posts and profiles behind JS, which means webmentions wouldn’t work because server side fetches of the source pages wouldn’t contain the target URL.
Osma A said on mas.to:
@Edent Not necessarily to either. A shared cache doesn't need to be centralized, it can just as well be federated (shared between co-operating servers, similar to relays, but replaceable by any admin) or a DHT - and content in a cache could be randomly refreshed from the canonical source. That said, the OGP tags themselves could contain malicious/misrepresented content at the source itself. Again, a federated, shared cache could police that by comparing OGP tags to page content.@neil @russss
Dan Q says:
Hell, I'm not busy, I'll take it on.
😂
Matt Godden said on mastodon.social:
@Edent Honestly, I’m having trouble believing that this is a thing… like that this wasn’t a “waitaminute” the very moment the protocol was invented…
More comments on Mastodon.