Getting lots of BIMI images using Python
I've written before about the moribund BIMI specification. It's a way for brands to include a trusted logo when they send emails. It isn't much used and, apparently, is riddled with security issues.
I thought it might be fun to grab all the BIMI images from the most popular websites, so I can potentially use them in my SuperTinyIcons project.
BIMI images are SVGs. Links to a site's BIMI are stored in a domain's DNS records. All BIMI records must be on a default._bimi.
subdomain.
If you run dig TXT default._bimi.linkedin.com
you'll receive back:
DNS;; ANSWER SECTION:
default._bimi.linkedin.com. 3600 IN TXT "v=BIMI1; l=https://media.licdn.com/media/AAYQAQQhAAgAAQAAAAAAABrLiVuNIZ3fRKGlFSn4hGZubg.svg; a=https://media.licdn.com/media/AAYAAQQhAAgAAQAAAAAAALe_JUaW1k4JTw6eZ_Gtj2raUw.pem;"
Awesome! We can grab the .svg URl and download the file.
Getting a list of BIMI enabled domains is difficult. Thankfully, Freddie Leeman has done some excellent analysis work and was happy to share a list of over 7,000 domains which have BIMI.
Let's get cracking with a little Python. First up, install DNSPython if you don't already have it.
This gets the TXT record from the domain name:
Python 3
import socket
import dns.resolver
response = dns.resolver.query('default._bimi.linkedin.com', 'TXT')
result = response.rrset.to_text()
print(result)
There are various ways of extracting the URl. I decided to be lazy and use a regex. Sue me.
Python 3
import re
pattern = r'l=(https[^;"]*[;"])'
match = re.search(pattern, result)
if match:
# Remove the trailing semicolon or quote mark
url = match.group(1).rstrip(';\"')
print(f'Matched URL: {url}')
else:
print(f'No match: {result}')
Putting it all together, this reads in the list of domains, finds the BIMI TXT record, grabs the URl, and saves the SVG.
Python 3
import socket
import dns.resolver
import re
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; rv:91.0) Gecko/20100101 Firefox/91.0'
}
pattern = r'l=(https[^;"]*[;"])'
domain_file = open('domains.txt', 'r')
domains = domain_file.readlines()
domains.sort()
for domain in domains:
bimi_domain = "default._bimi." + domain.strip()
try:
response = dns.resolver.query(bimi_domain, 'TXT')
result = response.rrset.to_text()
match = re.search(pattern, result)
if match:
# Remove the trailing semicolon or quote mark
svg_url = match.group(1).rstrip(';\"')
print(f'Downloading: {svg_url}')
try:
svg = requests.get(svg_url, allow_redirects=True, timeout=30, headers=headers)
open(domain.strip() +'.svg', 'wb').write(svg.content)
except:
print(f'Error with {domain}: {result}')
else:
print(f'No match from {domain}: {result}')
except:
print(f'DNS error with {bimi_domain}')
Obviously, it could be made a lot more efficient and download the files in parallel.
I found a few bugs in various BIMI implementations, including:
- ted.com and homeadvisor.com uses a
http
URl - consumerreports.org and sleepfoundation.org has a misplaced space in their TXT record
- audubon.org had an invalid certificate
- mac.com was blank - as was discogs.com, livechatinc.com, icloud.com, me.com, lung.org, miro.com, protonmail.ch and many others.
- alabama.gov had a timeout - as did nebraska.gov, uclahealth.org and several others.
- politico.com had a 404 for their BIMI - as do lots of others.
- coopersurgical.com is 8MB!
- There are loads of SVGs which bust the 32KB maximum size requirement - some by multiple megabytes.
I might spend some time over the next few weeks optimising the code and looking for any other snafus. I didn't find any with ECMAScript in them. Yet!