Convert DOI to a HTML5 / Schema citation

by @edent | # # # # #

This is a quick and dirty way to turn a DOI (Digital Object Identifiers for academic papers) into an HTML & Microdata citation. I use this to power my Citations page.

Schema.org is a Microdata standard which allows machines to read your HTML and create semantic relations between documents.

Here's a minimum viable citation:

<blockquote itemprop="citation" itemscope itemtype="http://schema.org/ScholarlyArticle">
    <span itemprop="author" itemscope itemtype="http://schema.org/Person">
        <span itemprop="name">
            Terence Eden
        </span>
    </span> 
    <cite itemprop="headline">
        Proof that P ≠ NP
    </cite>
    (<span itemprop="datePublished">2025</span>)
    <a itemprop="url" href="https://ex.doi.org/99.9999/1234">
        https://ex.doi.org/99.9999/1234
    </a>
    <span itemprop="publisher" itemscope itemtype="http://schema.org/Organization">
        <span itemprop="name">
            Journal of Impossible Research
        </span>
    </span>
</blockquote>

That says: This citation is a scholarly article which has a headline, date, and URl. It has an author who is a person with a name. It has a publisher which is an organisation with a name.

Here's a more full example, including ORCID, page numbers, etc:

<blockquote itemscope itemtype="http://schema.org/ScholarlyArticle">
    <span itemprop="citation">
        <span itemprop="author" itemscope itemtype="http://schema.org/Person">
            <link itemprop="url" href="http://orcid.org/0000-0003-4542-8599"/>
            <span itemprop="name">
                <span itemprop="familyName">Losavio</span>, <span itemprop="givenName">Michael M.</span>
            </span>
        </span>,
        <cite itemprop="headline">The Internet of Things and the Smart City: Legal challenges with digital forensics, privacy, and security</cite>
        (<time itemprop="datePublished" datetime="2018-04-27T03:13:15Z">2018</time>)
        Page: <span itemprop="pagination">e23</span>.
        <span itemprop="publisher" itemscope itemtype="http://schema.org/Organization">
            <span itemprop="name">Wiley</span>
        </span>
        <span itemprop="publication">Security and Privacy</span>
        <a itemprop="url" href="https://doi.org/10.1002/spy2.23">https://doi.org/10.1002/spy2.23</a>
    </span>
</blockquote>

APIs

Sadly, CrossRef doesn't do Schema.org, so we have to use their API to get some JSON and convert it to JSON+LD.

Here's how to install their official Python library:

pip3 install --user crossref-commons

The Code

import crossref_commons.retrieval
import os
import re

# Set a friendly header
os.environ['CR_API_MAILTO'] = 'yourEmail@example.com'

reference = input("Enter a DOI: ")

# Check DOI is valid
# https://www.crossref.org/blog/dois-and-matching-regular-expressions/
regex = r"^10.\d{4,9}/[-._;()/:A-Z0-9]+$"
matches = re.search(regex, reference, re.IGNORECASE)

if matches:
    # Call the API
    data = crossref_commons.retrieval.get_publication_as_json(reference)

    # Start the citation
    citation_html = '<span itemscope itemtype="http://schema.org/ScholarlyArticle"><span itemprop="citation">'

    # Get all the authors
    authors = []
    for a in data["author"]:
        author_html = '<span itemprop="author" itemscope itemtype="http://schema.org/Person">'

        if "ORCID" in a:
            author_html += '<link itemprop="url" href="' + a["ORCID"] + '"/>'

        author_html += '<span itemprop="name"><span itemprop="familyName">' + a["family"] +'</span>, <span itemprop="givenName">' + a["given"] + '</span></span></span>'

        authors.extend({author_html})

    # Add the authors to the citation
    for author in authors:
        citation_html += author
        citation_html += " & "
    # Remove the last &
    citation_html = citation_html.rstrip(' &')

    # Title and any language information
    if "title" in data:
        headline  = data["title"][0]

        if "language" in data:
            lang = data["language"]
            citation_html += ' <q><cite itemprop="headline" lang="'+lang+'"><span itemprop="inLanguage" content="'+lang+'">'+headline+'</span></cite></q> '
        else:
            citation_html += ' <q><cite itemprop="headline">'+headline+'</cite></q> '

    # Add the date
    if "published-print" in data:
        year = data["published-print"]["date-parts"][0][0]
        citation_html += '(<time itemprop="datePublished" datetime="'+str(year)+'">' + str(year) + '</time>) '
    elif "issued" in data:
        year = data["issued"]["date-parts"][0][0]
        citation_html += '(<time itemprop="datePublished" datetime="'+str(year)+'">' + str(year) + '</time>) '
    elif "created" in data:
        datetime = data["created"]["date-time"]
        year     = data["created"]["date-parts"][0][0]
        citation_html += '(<time itemprop="datePublished" datetime="'+datetime+'">' + str(year) + '</time>) '
    elif "deposited" in data:
        datetime = data["deposited"]["date-time"]
        year     = data["deposited"]["date-parts"][0][0]
        citation_html += '(<time itemprop="datePublished" datetime="'+datetime+'">' + str(year) + '</time>) '

    # Page number information
    if "page" in data:
        citation_html += ' page: <span itemprop="pagination">' + data["page"] + '</span>. '

    # Publisher
    if "publisher" in data:
        citation_html += '<span itemprop="publisher" itemscope itemtype="http://schema.org/Organization"><span itemprop="name">'+data["publisher"]+'</span></span>. '

    # Publication
    if "container-title" in data:
        citation_html += '<span itemprop="publication">'+data["container-title"][0]+'</span>. '

    # DOI link
    if "DOI" in data:
        doi = data["DOI"]
        doi_url = "https://doi.org/" + doi
        citation_html += '<a itemprop="url" href="'+doi_url+'">'+doi_url+'</a>'

    # End the citation
    citation_html += '</span></span>'

    print(citation_html)

Opinions about citations

There are too many citation styles. And most of them suck. I used to hate seeing "Smith (1991)" as the only reference. Theoretically, a DOI is the only citation you need - but I discovered that some were missing from major resolvers. And, it is probably helpful to have some human readable information to aid discoverability.

I've tried to keep to author name(s), title, year, publisher, publication, page, DOI. That's more-or-less MLA. But, because of the microdata, a machine can understand the citation and you can convert to your preferred style.

Or, if you wish, you can adapt this code to pump out a different citation style.

Source code on GitLab.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.