What programming language is in this <code> block?


I'm a little bit obsessed with the idea of Semantic markup. I want the words that I write to be understood my humans and machines.

Imagine this piece of code: print( "Hello, world!" )

Is that code example written in Python? C++? Basic? Go? Perhaps you're familiar enough with every programming language to tell - but most people aren't. Wouldn't it be nice to give an indication of what programming language is used in an example?

Here's how we might represent it in HTML:

<pre>
    <code>
        print( "Hello, world!" )
    </code>
</pre>

How do we let the browser, search engines, and humans know what language that's written in? It might seem obvious to use the lang attribute, right? We're writing in a programming language, so just use <code lang="python">. Sadly, the HTML specification disagrees.

The lang attribute specifies the primary language for the element's contents and for any of the element's attributes that contain text. Its value must be a valid BCP 47 language tag, or the empty string.
HTML Specification 3.2.6.2 The lang and xml:lang attributes (emphasis added)

That means it must be a human language like en or en-GB. No Klingon or Elvish - and certainly no computer languages!

Does the specification give any clues about the <code> element?

There is no formal way to indicate the language of computer code being marked up. Authors who wish to mark code elements with the language used, e.g. so that syntax highlighting scripts can use the right rules, can use the class attribute, e.g. by adding a class prefixed with "language-" to the element.
HTML Specification 4.5.15 The code element (emphasis added)

So we have to turn to our old friend Schema.org! There is a SoftwareSourceCode type which is used for exactly this case. Sadly, there is no example documentation because Google likes to start up a project but never quite finish it.

Here's how to write a code snippet in HTML and have it semantically expose the programming language used:

HTML HTML<pre itemscope itemtype="https://schema.org/SoftwareSourceCode">
    <span itemprop="programmingLanguage">Python</span>
    <meta itemprop="codeSampleType" content="Example">
    <code itemprop="text">
        print( "Hello, world!" )
    </code>
</pre>

If you run that through the validator you'll see what a computer sees:

Semantic representation of the code.

The programmingLanguage is a string - so you can write anything you like in there. You can optionally add a codeSampleType which, again, is a free-text field.

The <meta> items are only viewable to machines. You could also them to the user if you wanted, using a <span> or other suitable element.

Alternatives

It is possible to define private subtags of languages for example en-x-python - which could mean "Comments written in English, using the private Python extension. Or even just x-python. That then leads on to how you describe a language - but while COBOL has a MIME type not all languages do. There are some unofficial ones like text/x-python though.

But, of course, programming languages aren't really languages - so using lang probably isn't suitable.

A data- attribute might also work. Adding data-code="python" would allow CSS to style specific code blocks. But data attributes are private to a page, and generally aren't standardised.

I think this is a gap in the specification. I think there ought to be a code-lang attribute or similar. Perhaps something like:

HTML HTML<code code-lang="python;3.6">
    print( "Hello, world!" )
</code>

Which could allow authors to semantically give the name - and possibly version - of the programming language they are writing in.

Thoughts?


Share this post on…

  • Mastodon
  • Facebook
  • LinkedIn
  • BlueSky
  • Threads
  • Reddit
  • HackerNews
  • Lobsters
  • WhatsApp
  • Telegram

7 thoughts on “What programming language is in this <code> block?”

  1. said on urbanists.social:

    @Edent because this information is useful to people as well as to machines, I think any solution really ought to involve the programming language name being visible in a default rendering, rather than hidden in metadata? The same argument could be made about human languages, but there are fewer examples of authors wanting to show that explicitly, and lots of cases where people label up which programming language is used for a particular example block.

    Reply | Reply to original comment on urbanists.social
  2. Dave Cridland says:

    Small correction - and a dollop of trivia - but Klingon has a language tag of tlh (and the long-deprecated i-klingon), and Elvish has two, depending on which Elvish language you mean, Sindarin (sjn) or Quenya (qya).

    Reply
  3. said on hachyderm.io:

    @georgelund @Edent I think the problem is that it depends so much on context.

    Some blog posts compare and contrast 10 languages, and it's helpful to have every code snippet labelled clearly with its language.

    Other posts are part 5 of a series of 37 entirely about Rust, or belong to an entire blog with "Rust" in the title, and prominently writing "Rust" above every tiny code snippet in the whole thing just gets tedious.

    Reply | Reply to original comment on hachyderm.io

What are your reckons?

All comments are moderated and may not be published immediately. Your email address will not be published.

Allowed HTML: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <p> <pre> <br> <img src="" alt="" title="" srcset="">