Getting lots of BIMI images using Python

@edent — Fri, 07 Jun 2024 11:34:09 +0000

I've written before about the moribund BIMI specification. It's a way for brands to include a trusted logo when they send emails. It isn't much used and, apparently, is riddled with security issues.

I thought it might be fun to grab all the BIMI images from the most popular websites, so I can potentially use them in my SuperTinyIcons project.

BIMI images are SVGs. Links to a site's BIMI are stored in a domain's DNS records. All BIMI records must be on a default._bimi. subdomain.

If you run dig TXT default._bimi.linkedin.com you'll receive back:

;; ANSWER SECTION:
default._bimi.linkedin.com. 3600 IN TXT "v=BIMI1; l=https://media.licdn.com/media/AAYQAQQhAAgAAQAAAAAAABrLiVuNIZ3fRKGlFSn4hGZubg.svg; a=https://media.licdn.com/media/AAYAAQQhAAgAAQAAAAAAALe_JUaW1k4JTw6eZ_Gtj2raUw.pem;"

Awesome! We can grab the .svg URl and download the file.

Getting a list of BIMI enabled domains is difficult. Thankfully, Freddie Leeman has done some excellent analysis work and was happy to share a list of over 7,000 domains which have BIMI.

Let's get cracking with a little Python. First up, install DNSPython if you don't already have it.

This gets the TXT record from the domain name:

import socket
import dns.resolver

response = dns.resolver.query('default._bimi.linkedin.com', 'TXT')
result = response.rrset.to_text()
print(result)

There are various ways of extracting the URl. I decided to be lazy and use a regex. Sue me.

import re

pattern = r'l=(https[^;"]*[;"])'
match = re.search(pattern, result)
if match:
   # Remove the trailing semicolon or quote mark
   url = match.group(1).rstrip(';\"')
   print(f'Matched URL: {url}')
else:
   print(f'No match: {result}')

Putting it all together, this reads in the list of domains, finds the BIMI TXT record, grabs the URl, and saves the SVG.

import socket
import dns.resolver
import re
import requests

headers = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; rv:91.0) Gecko/20100101 Firefox/91.0'
}

pattern = r'l=(https[^;"]*[;"])'

domain_file = open('domains.txt', 'r')
domains = domain_file.readlines()
domains.sort()

for domain in domains:
   bimi_domain = "default._bimi." + domain.strip()
   try:
      response = dns.resolver.query(bimi_domain, 'TXT')
      result = response.rrset.to_text()
      match = re.search(pattern, result)
      if match:
         # Remove the trailing semicolon or quote mark
         svg_url = match.group(1).rstrip(';\"')
         print(f'Downloading: {svg_url}')
         try:
            svg = requests.get(svg_url, allow_redirects=True, timeout=30, headers=headers)
            open(domain.strip() +'.svg', 'wb').write(svg.content)
         except:
            print(f'Error with {domain}: {result}')
      else:
         print(f'No match from {domain}: {result}')
   except:
      print(f'DNS error with {bimi_domain}')

Obviously, it could be made a lot more efficient and download the files in parallel.

I found a few bugs in various BIMI implementations, including:

ted.com and homeadvisor.com uses a http URl
consumerreports.org and sleepfoundation.org has a misplaced space in their TXT record
audubon.org had an invalid certificate
mac.com was blank - as was discogs.com, livechatinc.com, icloud.com, me.com, lung.org, miro.com, protonmail.ch and many others.
alabama.gov had a timeout - as did nebraska.gov, uclahealth.org and several others.
politico.com had a 404 for their BIMI - as do lots of others.
coopersurgical.com is 8MB!
There are loads of SVGs which bust the 32KB maximum size requirement - some by multiple megabytes.

I might spend some time over the next few weeks optimising the code and looking for any other snafus. I didn't find any with ECMAScript in them. Yet!

BIMI – Terence Eden’s Blog

Getting lots of BIMI images using Python