What programming language is in this <code> block?
I'm a little bit obsessed with the idea of Semantic markup. I want the words that I write to be understood my humans and machines.
Imagine this piece of code: print( "Hello, world!" )
Is that code example written in Python? C++? Basic? Go? Perhaps you're familiar enough with every programming language to tell - but most people aren't. Wouldn't it be nice to give an indication of what programming language is used in an example?
Here's how we might represent it in HTML:
<pre>
<code>
print( "Hello, world!" )
</code>
</pre>
How do we let the browser, search engines, and humans know what language that's written in? It might seem obvious to use the lang
attribute, right? We're writing in a programming language, so just use <code lang="python">
. Sadly, the HTML specification disagrees.
The lang attribute specifies the primary language for the element's contents and for any of the element's attributes that contain text. Its value must be a valid BCP 47 language tag, or the empty string.
HTML Specification 3.2.6.2 The lang and xml:lang attributes (emphasis added)
That means it must be a human language like en
or en-GB
. No Klingon or Elvish - and certainly no computer languages!
Does the specification give any clues about the <code>
element?
There is no formal way to indicate the language of computer code being marked up. Authors who wish to mark code elements with the language used, e.g. so that syntax highlighting scripts can use the right rules, can use the class attribute, e.g. by adding a class prefixed with "language-" to the element.
HTML Specification 4.5.15 The code element (emphasis added)
So we have to turn to our old friend Schema.org! There is a SoftwareSourceCode
type which is used for exactly this case. Sadly, there is no example documentation because Google likes to start up a project but never quite finish it.
Here's how to write a code snippet in HTML and have it semantically expose the programming language used:
HTML<pre itemscope itemtype="https://schema.org/SoftwareSourceCode">
<span itemprop="programmingLanguage">Python</span>
<meta itemprop="codeSampleType" content="Example">
<code itemprop="text">
print( "Hello, world!" )
</code>
</pre>
If you run that through the validator you'll see what a computer sees:
The programmingLanguage
is a string - so you can write anything you like in there. You can optionally add a codeSampleType
which, again, is a free-text field.
The <meta>
items are only viewable to machines. You could also them to the user if you wanted, using a <span>
or other suitable element.
Alternatives
It is possible to define private subtags of languages for example en-x-python
- which could mean "Comments written in English, using the private Python extension. Or even just x-python
. That then leads on to how you describe a language - but while COBOL has a MIME type not all languages do. There are some unofficial ones like text/x-python
though.
But, of course, programming languages aren't really languages - so using lang
probably isn't suitable.
A data-
attribute might also work. Adding data-code="python"
would allow CSS to style specific code blocks. But data attributes are private to a page, and generally aren't standardised.
I think this is a gap in the specification. I think there ought to be a code-lang
attribute or similar. Perhaps something like:
HTML<code code-lang="python;3.6">
print( "Hello, world!" )
</code>
Which could allow authors to semantically give the name - and possibly version - of the programming language they are writing in.
Thoughts?
devPanda said on fosstodon.org:
@Edent The code-lang would certainly make it clearer.
George Lund said on urbanists.social:
@Edent because this information is useful to people as well as to machines, I think any solution really ought to involve the programming language name being visible in a default rendering, rather than hidden in metadata? The same argument could be made about human languages, but there are fewer examples of authors wanting to show that explicitly, and lots of cases where people label up which programming language is used for a particular example block.
GitHub said on github.com:
What problem are you trying to solve? In @edent's recent blog post about the code element he mentions that there could be a gap in the HTML spec for defining the programming language of code in a c...
Dave Cridland says:
Small correction - and a dollop of trivia - but Klingon has a language tag of
tlh
(and the long-deprecatedi-klingon
), and Elvish has two, depending on which Elvish language you mean, Sindarin (sjn
) or Quenya (qya
).@edent says:
vISov. jIQoSbej.
Simon Tatham said on hachyderm.io:
@georgelund @Edent I think the problem is that it depends so much on context.
Some blog posts compare and contrast 10 languages, and it's helpful to have every code snippet labelled clearly with its language.
Other posts are part 5 of a series of 37 entirely about Rust, or belong to an entire blog with "Rust" in the title, and prominently writing "Rust" above every tiny code snippet in the whole thing just gets tedious.
Andy Mabbett says:
"The programmingLanguage is a string - so you can write anything you like in there."
Presumably you can use sameAs, somewhere, to indicate that your plaintext "Python" is really the thing described at, say, https://wikidata.org/wiki/Q28865 ?
More comments on Mastodon.