Is this a bug in every Markdown (Extra) parser?


Markdown is, I think it is fair to say, a frustrating "specification". It's origins are a back-of-a-fag-packet document and a buggy Perl script - and we've been dealing with the consequences ever since.

There are now multiple Markdown parsers, each with their own idiosyncrasies. To make matters worse, there's a set of extensions popularly known as "Markdown Extra".

Extra has support for things like tables, footnotes, and - in some dialects - autolinks.

Most of the time, when an author writes the text Visit https://example.com they want the URl to be automatically turned into a hyperlink. Most Markdown parsers support that. Hurrah!

But there's a rather nasty little edge case.

Markdown is explicitly designed so that authors can mix and match HTML and Markdown in the same document. This is perfectly valid:

I <em>love</em> the delicious taste of **fresh** oranges!

Which becomes:

I <em>love</em> the delicious taste of <strong>fresh</strong> oranges!

This is also valid:

<a href="https://example.com/">Visit my *favourite* site https://example.com/</a>!

The parser is smart enough to ignore the link inside the href="" but will process all the Markdown contents of the <a> element.

The text favourite is converted to <em>favourite</em> correctly.

But what about the link? Should that be autolinked?

Here's how a few dozen different Markdown parsers fare.

Nearly all of the ones which support Autolink end up producing broken HTML. They nest an anchor within an anchor. Something explicitly forbidden by the HTML specification.

HTML HTML<a href="https://example.com/">Visit my
   <em>favourite</em> site
   <a href="https://example.com/">https://example.com/</a>
</a>!

Others break in weird and unexpected ways.

Is this a bug?

Markdown is an excellent example of "do what I mean, not what I say" software. To the human reading the text, it might seem obvious which parts need to be transformed and which don't.

There are various specifications for how autolinking should work - but I couldn't find any documents which explicitly discuss where it shouldn't work.

At this point, you're probably going to leave a comment saying that it is the users who are wrong. They should wrap links in brackets, or stick to pure Markdown, or some other tosh.

Markdown was supposed to simplify the process of writing HTML. Anything which forces the user to write in an unnatural or confusing way is a bug.

Markdown allows you to write using an easy-to-read, easy-to-write plain text format, then convert it to structurally valid XHTML (or HTML).

Markdown Introduction

I don't think any of the authors of Markdown parsers have been naughty here. They mostly just follow the spec. But Markdown was designed without ever being tested with real users. And real users break things in all sorts of unexpected and delightful ways.

That's where the real bug is. When we don't test with users and fail to meet their expectations, we produce faulty software.


Share this post on…

  • Mastodon
  • Facebook
  • LinkedIn
  • BlueSky
  • Threads
  • Reddit
  • HackerNews
  • Lobsters
  • WhatsApp
  • Telegram

3 thoughts on “Is this a bug in every Markdown (Extra) parser?”

  1. says:

    It’s interesting that Commonmark was an attempt to write a specification to provide a unified basis on which parsers could be written… but their definition of ‘autolink’ doesn’t match the way that I’ve seen any people do it!

    I think markdown is deceptively simple, in that it appears very simple (and for a lot of use cases is) but when you get into the details there is all kinds of complexity waiting to bite people!

    Reply

What are your reckons?

All comments are moderated and may not be published immediately. Your email address will not be published.

Allowed HTML: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <p> <pre> <br> <img src="" alt="" title="" srcset="">