Convert DOI to a HTML5 / Schema citation
This is a quick and dirty way to turn a DOI (Digital Object Identifiers for academic papers) into an HTML & Microdata citation. I use this to power my Citations page.
Schema.org is a Microdata standard which allows machines to read your HTML and create semantic relations between documents.
Here's a minimum viable citation:
HTML
<blockquote itemprop="citation" itemscope itemtype="http://schema.org/ScholarlyArticle"> <span itemprop="author" itemscope itemtype="http://schema.org/Person"> <span itemprop="name"> Terence Eden </span> </span> <cite itemprop="headline"> Proof that P ≠ NP </cite> (<span itemprop="datePublished">2025</span>) <a itemprop="url" href="https://ex.doi.org/99.9999/1234"> https://ex.doi.org/99.9999/1234 </a> <span itemprop="publisher" itemscope itemtype="http://schema.org/Organization"> <span itemprop="name"> Journal of Impossible Research </span> </span> </blockquote>
That says: This citation is a scholarly article which has a headline, date, and URl. It has an author who is a person with a name. It has a publisher which is an organisation with a name.
Here's a more full example, including ORCID, page numbers, etc:
HTML
<blockquote itemscope itemtype="http://schema.org/ScholarlyArticle"> <span itemprop="citation"> <span itemprop="author" itemscope itemtype="http://schema.org/Person"> <link itemprop="url" href="http://orcid.org/0000-0003-4542-8599"/> <span itemprop="name"> <span itemprop="familyName">Losavio</span>, <span itemprop="givenName">Michael M.</span> </span> </span>, <cite itemprop="headline">The Internet of Things and the Smart City: Legal challenges with digital forensics, privacy, and security</cite> (<time itemprop="datePublished" datetime="2018-04-27T03:13:15Z">2018</time>) Page: <span itemprop="pagination">e23</span>. <span itemprop="publisher" itemscope itemtype="http://schema.org/Organization"> <span itemprop="name">Wiley</span> </span> <span itemprop="publication">Security and Privacy</span> <a itemprop="url" href="https://doi.org/10.1002/spy2.23">https://doi.org/10.1002/spy2.23</a> </span> </blockquote>
APIs
Sadly, CrossRef doesn't do Schema.org, so we have to use their API to get some JSON and convert it to JSON+LD.
Here's how to install their official Python library:
Bash
pip3 install --user crossref-commons
The Code
Python 3
import crossref_commons.retrieval import os import re # Set a friendly header os.environ['CR_API_MAILTO'] = 'yourEmail@example.com' reference = input("Enter a DOI: ") # Check DOI is valid # https://www.crossref.org/blog/dois-and-matching-regular-expressions/ regex = r"^10.\d{4,9}/[-._;()/:A-Z0-9]+$" matches = re.search(regex, reference, re.IGNORECASE) if matches: # Call the API data = crossref_commons.retrieval.get_publication_as_json(reference) # Start the citation citation_html = '<span itemscope itemtype="http://schema.org/ScholarlyArticle"><span itemprop="citation">' # Get all the authors authors = [] for a in data["author"]: author_html = '<span itemprop="author" itemscope itemtype="http://schema.org/Person">' if "ORCID" in a: author_html += '<link itemprop="url" href="' + a["ORCID"] + '"/>' author_html += '<span itemprop="name"><span itemprop="familyName">' + a["family"] +'</span>, <span itemprop="givenName">' + a["given"] + '</span></span></span>' authors.extend({author_html}) # Add the authors to the citation for author in authors: citation_html += author citation_html += " & " # Remove the last & citation_html = citation_html.rstrip(' &') # Title and any language information if "title" in data: headline = data["title"][0] if "language" in data: lang = data["language"] citation_html += ' <q><cite itemprop="headline" lang="'+lang+'"><span itemprop="inLanguage" content="'+lang+'">'+headline+'</span></cite></q> ' else: citation_html += ' <q><cite itemprop="headline">'+headline+'</cite></q> ' # Add the date if "published-print" in data: year = data["published-print"]["date-parts"][0][0] citation_html += '(<time itemprop="datePublished" datetime="'+str(year)+'">' + str(year) + '</time>) ' elif "issued" in data: year = data["issued"]["date-parts"][0][0] citation_html += '(<time itemprop="datePublished" datetime="'+str(year)+'">' + str(year) + '</time>) ' elif "created" in data: datetime = data["created"]["date-time"] year = data["created"]["date-parts"][0][0] citation_html += '(<time itemprop="datePublished" datetime="'+datetime+'">' + str(year) + '</time>) ' elif "deposited" in data: datetime = data["deposited"]["date-time"] year = data["deposited"]["date-parts"][0][0] citation_html += '(<time itemprop="datePublished" datetime="'+datetime+'">' + str(year) + '</time>) ' # Page number information if "page" in data: citation_html += ' page: <span itemprop="pagination">' + data["page"] + '</span>. ' # Publisher if "publisher" in data: citation_html += '<span itemprop="publisher" itemscope itemtype="http://schema.org/Organization"><span itemprop="name">'+data["publisher"]+'</span></span>. ' # Publication if "container-title" in data: citation_html += '<span itemprop="publication">'+data["container-title"][0]+'</span>. ' # DOI link if "DOI" in data: doi = data["DOI"] doi_url = "https://doi.org/" + doi citation_html += '<a itemprop="url" href="'+doi_url+'">'+doi_url+'</a>' # End the citation citation_html += '</span></span>' print(citation_html)
Opinions about citations
There are too many citation styles. And most of them suck. I used to hate seeing "Smith (1991)" as the only reference. Theoretically, a DOI is the only citation you need - but I discovered that some were missing from major resolvers. And, it is probably helpful to have some human readable information to aid discoverability.
I've tried to keep to author name(s), title, year, publisher, publication, page, DOI. That's more-or-less MLA. But, because of the microdata, a machine can understand the citation and you can convert to your preferred style.
Or, if you wish, you can adapt this code to pump out a different citation style.