Convert DOI to a HTML5 / Schema citation
This is a quick and dirty way to turn a DOI (Digital Object Identifiers for academic papers) into an HTML & Microdata citation. I use this to power my Citations page.
Schema.org is a Microdata standard which allows machines to read your HTML and create semantic relations between documents.
Here's a minimum viable citation:
HTML<blockquote itemprop="citation" itemscope itemtype="http://schema.org/ScholarlyArticle">
<span itemprop="author" itemscope itemtype="http://schema.org/Person">
<span itemprop="name">
Terence Eden
</span>
</span>
<cite itemprop="headline">
Proof that P ≠ NP
</cite>
(<span itemprop="datePublished">2025</span>)
<a itemprop="url" href="https://ex.doi.org/99.9999/1234">
https://ex.doi.org/99.9999/1234
</a>
<span itemprop="publisher" itemscope itemtype="http://schema.org/Organization">
<span itemprop="name">
Journal of Impossible Research
</span>
</span>
</blockquote>
That says: This citation is a scholarly article which has a headline, date, and URl. It has an author who is a person with a name. It has a publisher which is an organisation with a name.
Here's a more full example, including ORCID, page numbers, etc:
HTML<blockquote itemscope itemtype="http://schema.org/ScholarlyArticle">
<span itemprop="citation">
<span itemprop="author" itemscope itemtype="http://schema.org/Person">
<link itemprop="url" href="http://orcid.org/0000-0003-4542-8599"/>
<span itemprop="name">
<span itemprop="familyName">Losavio</span>, <span itemprop="givenName">Michael M.</span>
</span>
</span>,
<cite itemprop="headline">The Internet of Things and the Smart City: Legal challenges with digital forensics, privacy, and security</cite>
(<time itemprop="datePublished" datetime="2018-04-27T03:13:15Z">2018</time>)
Page: <span itemprop="pagination">e23</span>.
<span itemprop="publisher" itemscope itemtype="http://schema.org/Organization">
<span itemprop="name">Wiley</span>
</span>
<span itemprop="publication">Security and Privacy</span>
<a itemprop="url" href="https://doi.org/10.1002/spy2.23">https://doi.org/10.1002/spy2.23</a>
</span>
</blockquote>
APIs
Sadly, CrossRef doesn't do Schema.org, so we have to use their API to get some JSON and convert it to JSON+LD.
Here's how to install their official Python library:
BASHpip3 install --user crossref-commons
The Code
Python 3import crossref_commons.retrieval
import os
import re
# Set a friendly header
os.environ['CR_API_MAILTO'] = 'yourEmail@example.com'
reference = input("Enter a DOI: ")
# Check DOI is valid
# https://www.crossref.org/blog/dois-and-matching-regular-expressions/
regex = r"^10.\d{4,9}/[-._;()/:A-Z0-9]+$"
matches = re.search(regex, reference, re.IGNORECASE)
if matches:
# Call the API
data = crossref_commons.retrieval.get_publication_as_json(reference)
# Start the citation
citation_html = '<span itemscope itemtype="http://schema.org/ScholarlyArticle"><span itemprop="citation">'
# Get all the authors
authors = []
for a in data["author"]:
author_html = '<span itemprop="author" itemscope itemtype="http://schema.org/Person">'
if "ORCID" in a:
author_html += '<link itemprop="url" href="' + a["ORCID"] + '"/>'
author_html += '<span itemprop="name"><span itemprop="familyName">' + a["family"] +'</span>, <span itemprop="givenName">' + a["given"] + '</span></span></span>'
authors.extend({author_html})
# Add the authors to the citation
for author in authors:
citation_html += author
citation_html += " & "
# Remove the last &
citation_html = citation_html.rstrip(' &')
# Title and any language information
if "title" in data:
headline = data["title"][0]
if "language" in data:
lang = data["language"]
citation_html += ' <q><cite itemprop="headline" lang="'+lang+'"><span itemprop="inLanguage" content="'+lang+'">'+headline+'</span></cite></q> '
else:
citation_html += ' <q><cite itemprop="headline">'+headline+'</cite></q> '
# Add the date
if "published-print" in data:
year = data["published-print"]["date-parts"][0][0]
citation_html += '(<time itemprop="datePublished" datetime="'+str(year)+'">' + str(year) + '</time>) '
elif "issued" in data:
year = data["issued"]["date-parts"][0][0]
citation_html += '(<time itemprop="datePublished" datetime="'+str(year)+'">' + str(year) + '</time>) '
elif "created" in data:
datetime = data["created"]["date-time"]
year = data["created"]["date-parts"][0][0]
citation_html += '(<time itemprop="datePublished" datetime="'+datetime+'">' + str(year) + '</time>) '
elif "deposited" in data:
datetime = data["deposited"]["date-time"]
year = data["deposited"]["date-parts"][0][0]
citation_html += '(<time itemprop="datePublished" datetime="'+datetime+'">' + str(year) + '</time>) '
# Page number information
if "page" in data:
citation_html += ' page: <span itemprop="pagination">' + data["page"] + '</span>. '
# Publisher
if "publisher" in data:
citation_html += '<span itemprop="publisher" itemscope itemtype="http://schema.org/Organization"><span itemprop="name">'+data["publisher"]+'</span></span>. '
# Publication
if "container-title" in data:
citation_html += '<span itemprop="publication">'+data["container-title"][0]+'</span>. '
# DOI link
if "DOI" in data:
doi = data["DOI"]
doi_url = "https://doi.org/" + doi
citation_html += '<a itemprop="url" href="'+doi_url+'">'+doi_url+'</a>'
# End the citation
citation_html += '</span></span>'
print(citation_html)
Opinions about citations
There are too many citation styles. And most of them suck. I used to hate seeing "Smith (1991)" as the only reference. Theoretically, a DOI is the only citation you need - but I discovered that some were missing from major resolvers. And, it is probably helpful to have some human readable information to aid discoverability.
I've tried to keep to author name(s), title, year, publisher, publication, page, DOI. That's more-or-less MLA. But, because of the microdata, a machine can understand the citation and you can convert to your preferred style.
Or, if you wish, you can adapt this code to pump out a different citation style.