Getting lots of BIMI images using Python
I've written before about the moribund BIMI specification. It's a way for brands to include a trusted logo when they send emails. It isn't much used and, apparently, is riddled with security issues.
I thought it might be fun to grab all the BIMI images from the most popular websites, so I can potentially use them in my SuperTinyIcons project.
BIMI images are SVGs. Links to a site's BIMI are stored in a domain's DNS records. All BIMI records must be on a default._bimi.
subdomain.
If you run dig TXT default._bimi.linkedin.com
you'll receive back:
DNS;; ANSWER SECTION:
default._bimi.linkedin.com. 3600 IN TXT "v=BIMI1; l=https://media.licdn.com/media/AAYQAQQhAAgAAQAAAAAAABrLiVuNIZ3fRKGlFSn4hGZubg.svg; a=https://media.licdn.com/media/AAYAAQQhAAgAAQAAAAAAALe_JUaW1k4JTw6eZ_Gtj2raUw.pem;"
Awesome! We can grab the .svg URl and download the file.
Getting a list of BIMI enabled domains is difficult. Thankfully, Freddie Leeman has done some excellent analysis work and was happy to share a list of over 7,000 domains which have BIMI.
Let's get cracking with a little Python. First up, install DNSPython if you don't already have it.
This gets the TXT record from the domain name:
Python 3import socket
import dns.resolver
response = dns.resolver.query('default._bimi.linkedin.com', 'TXT')
result = response.rrset.to_text()
print(result)
There are various ways of extracting the URl. I decided to be lazy and use a regex. Sue me.
Python 3import re
pattern = r'l=(https[^;"]*[;"])'
match = re.search(pattern, result)
if match:
# Remove the trailing semicolon or quote mark
url = match.group(1).rstrip(';\"')
print(f'Matched URL: {url}')
else:
print(f'No match: {result}')
Putting it all together, this reads in the list of domains, finds the BIMI TXT record, grabs the URl, and saves the SVG.
Python 3import socket
import dns.resolver
import re
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; rv:91.0) Gecko/20100101 Firefox/91.0'
}
pattern = r'l=(https[^;"]*[;"])'
domain_file = open('domains.txt', 'r')
domains = domain_file.readlines()
domains.sort()
for domain in domains:
bimi_domain = "default._bimi." + domain.strip()
try:
response = dns.resolver.query(bimi_domain, 'TXT')
result = response.rrset.to_text()
match = re.search(pattern, result)
if match:
# Remove the trailing semicolon or quote mark
svg_url = match.group(1).rstrip(';\"')
print(f'Downloading: {svg_url}')
try:
svg = requests.get(svg_url, allow_redirects=True, timeout=30, headers=headers)
open(domain.strip() +'.svg', 'wb').write(svg.content)
except:
print(f'Error with {domain}: {result}')
else:
print(f'No match from {domain}: {result}')
except:
print(f'DNS error with {bimi_domain}')
Obviously, it could be made a lot more efficient and download the files in parallel.
I found a few bugs in various BIMI implementations, including:
- ted.com and homeadvisor.com uses a
http
URl - consumerreports.org and sleepfoundation.org has a misplaced space in their TXT record
- audubon.org had an invalid certificate
- mac.com was blank - as was discogs.com, livechatinc.com, icloud.com, me.com, lung.org, miro.com, protonmail.ch and many others.
- alabama.gov had a timeout - as did nebraska.gov, uclahealth.org and several others.
- politico.com had a 404 for their BIMI - as do lots of others.
- coopersurgical.com is 8MB!
- There are loads of SVGs which bust the 32KB maximum size requirement - some by multiple megabytes.
I might spend some time over the next few weeks optimising the code and looking for any other snafus. I didn't find any with ECMAScript in them. Yet!