python – Terence Eden’s Blog

Random File Format

@edent — Wed, 01 Apr 2026 11:34:57 +0000

This was an idea I had back in the days of Naptster.

At the turn of the century, it was common to listen to an "acquired" music file only to find it was missing a few seconds at the end due to a prematurely stopped download. Some video formats would refuse to play at all if the moov atom at the end of the file was missing.

I wondered if it would be possible to make a file format which was close to impossible to read unless the entire file was intact. I don't mean including a checksum to detect download errors - I mean a layout which was intrinsically fragile to corruption.

While digging through an old backup CD, I found my original notes. I'm rather impressed at what neophyte-me had constructed. My outline was:

The file ends with a 32 bit pointer. This points to the location of the first information block.
The information block describes the length of the data block which follows it.
At the end of the data block is another 32 bit pointer. This points to the location of the next information block.
The start of the file may be a pointer, or it may be padded with random data.
There may be random data padded between the data blocks.

This ensures that a file which has been only partially downloaded - whether truncated at the end or missing pieces elsewhere - cannot be successfully read.

Here's a worked example. Start at the end and follow the thread.

Random data.
Data block size is 2.
Data
Data
EOF.
Data block size is 1.
Data.
Go to location 1.
Random data.
Go to location 5.

There are, of course, a few downsides to this idea.

Most prominently, it bloats file size. If the data block size was a constant 1MB, that would pad the size a negligible amount. But with variable data block size, it could increase it significantly. Random padding also increases the size.

If the block size is consistent and there's no random padding data, the files can be mostly reconstructed.

Depending on which parts of the file are missing, it may be possible to recover the majority of the file.

A location block size of 32 bits restricts the file-size to less than 4GB. A 64 bit pointer might be excessive or might be future-proof!

Highly structured files with predictable patterns, or text files, may be easy to recover large bits of information.

A malformed file could contain an infinite loop of pointers.

Perhaps a magic number should be at the start (or end) of the file?

While reading the file is as simple as following the pointers, constructing the file is more complex, especially if blocks have variable lengths.

Code

Here's a trivial encoder. It reads a file in consistently sized chunks of 1,024 bytes. It shuffles them up and writes them to a new file. The last 4 bytes contain a pointer to the first block, which says the data length is 1,024. After that, there is a 4 byte pointer to the next block location.

import random

#   Size of data, headers, and pointers.
data_length = 1024
header_length  = 4
pointer_length = 4

#   Read the file into a data structure.
original_blocks = list()
with open( "test.jpg", "rb") as file:
    for data in iter( lambda: file.read( data_length ), b"" ):
        #   Add padding if length is less than the desired length.
        padding = data_length - len( data )
        data += b"\0" * padding
        original_blocks.append( data )

#   How many blocks are there?
original_length = len( original_blocks )

#   Create a random order of blocks.
order = list( range( 0, original_length ) )
random.shuffle( order )

#   Where is the start of the file?
first_block_index = order.index( 0 )
first_block_pointer = first_block_index * ( header_length + data_length + pointer_length )

#   Loop through the order and write to a new file.
i = 0
#   Open as binary file to add the pointers correctly.
with open( "output.rff", "wb" ) as output:
    while i < original_length:
        #   Where are we?
        current_block = i
        current_block_value = order[i]
        #   Write length of data in little-endian 32 bytes.
        output.write( data_length.to_bytes( header_length, "little") )
        #   Write data
        output.write( original_blocks[ current_block_value ] )
        i = i+1
        #   Last block. Write an EOF header.
        if ( current_block_value + 1 >= original_length ):
            eof = 4294967295
            output.write( eof.to_bytes( header_length, "little") )
        else:
            next_block = order.index( current_block_value + 1 )
            #   Write pointer to next block
            next_block_location = next_block * ( header_length + data_length + pointer_length )
            output.write( next_block_location.to_bytes( pointer_length, "little" ) )
    #   At the end of the file, write the pointer to block 0.
    output.write( first_block_pointer.to_bytes( pointer_length, "little" ) )

And here is a similarly trivial decoder. It reads the last 32 bits, moves to that location, reads the block size, reads the data and writes it to a new file, then reads the next pointer.

import os
#   Size of data, headers, and pointers.
header_length  = 4
pointer_length = 4
#   File name to write to.
decoded_file = "decoded.bin"

#   Create an empty file.
with open( decoded_file, "w") as file:
    pass

#   Function to loop through the blocks.
def read_block( position, i ):
    #   Move to the position in the file.
    input_file.seek( position, 0 )
    #   Read the data length header.
    data_length = int.from_bytes( input_file.read( header_length ), "little" )
    #   Move to the data block.
    input_file.seek( position + header_length, 0 )
    #   Read the data.
    data = input_file.read( data_length )
    #   Read the pointer header.
    next_position = int.from_bytes( input_file.read( pointer_length ), "little" )
    #   If this is the final block, it may have null padding. Remove it.
    if ( next_position == 4294967295 ) :
        data = data.rstrip(b"\0")
    #   Append the data to the decoded file.
    with open( decoded_file, "ab" ) as file:
        file.write( data )
    #   If this is the final block, finish searching.
    if ( next_position == 4294967295 ) :
         print("File decoded.")
    else:
        #   Move to the next position.
        read_block( next_position, i+1 )

#   Open the file as binary.
input_file = open( "output.rff", "rb" )

#   Read the last 4 bytes.
input_file.seek( -4, 2 )

#   Get position of first block
first_block = int.from_bytes( input_file.read(), "little" )

#   Start reading the file.
seek_to = first_block
read_block( seek_to, 0 )

As I said, these are both trivial. They are a bit buggy and contain some hardcoded assumptions.

Here are two files encoded as "RFF" - Random File Format - an image by Maria Sibylla Merian, and the text of Romeo and Juliet.

Have fun decoding them!

Removing "/Subtype /Watermark" images from a PDF using Linux

@edent — Thu, 22 Jan 2026 12:34:02 +0000

Problem: I've received a PDF which has a large "watermark" obscuring every page.

Investigating: Opening the PDF in LibreOffice Draw allowed me to see that the watermark was a separate image floating above the others.

Manual Solution: Hit page down, select image, delete, repeat 500 times. BORING!

Further Investigating: Using pdftk, it's possible to decompress a PDF. That makes it easier to look through manually.

pdftk input.pdf output output.pdf uncompress

Hey presto! A PDF you can open in a text editor! Deep joy!

Searching: On a hunch, I searched for "watermark" and found several lines like this:

<<
/Length 548
>>
stream
/Figure <>BDC q 0 0 477 733.464 re W n q /GS0 gs 479.2799893 0 0 735.5999836 -1.0800002 -1.0559941 cm /Im0 Do Q EMC 
/Figure <>BDC Q q 28.333 300.661 420.334 126.141 re W n q /GS0 gs 420.3339603 0 0 126.1418879 28.3330078 300.6610601 cm /Im1 Do Q EMC
/Figure <>BDC Q q 16.106 0 444.787 215.464 re W n q /GS0 gs 444.7874274 0 0 216.5921386 16.1062775 -1.1281493 cm /Im2 Do Q EMC
/Artifact <>BDC Q q 0.7361145 0 0 0.7361145 113.3616638 240.8575745 cm /GS1 gs /Fm0 Do Q EMC
endstream
endobj

Those are Marked Content Blocks. In theory you can just chop out the line with /Subtype /Watermark but each block has a /length variable - so you'd also need to adjust that to account for what you've changed - otherwise the layout goes all screwy.

That led me to PyMuPDF which claimed to solve the problem. But running that code only removed some of the watermarks. It got stuck on an infinite loop on certain pages.

So, now that I had more detailed knowledge, I managed to get an LLM to construct something which mostly seems to work.

Does it work with every PDF? I don't know. Does it contain subtle implementation bugs? Probably. Is there an easier way to do this? Not that I can find.

import re
import pymupdf

# Open the PDF
doc = pymupdf.open("output.pdf")

# Regex of the watermarks
pattern = re.compile(
    rb"/Artifact\s*<<[^>]*?/Subtype\s*/Watermark[^>]*?>>BDC.*?EMC",
    re.DOTALL
)

# Loop through the PDF's pages
for page_num, page in enumerate(doc, start=1):
    print(f"Processing page {page_num}")
    xrefs = page.get_contents()
    for xref in xrefs:
        cont = doc.xref_stream(xref)
        new_cont, n = pattern.subn(b"", cont)
        if n > 0:
            print(f"  Removed {n} watermark block(s)")
            doc.update_stream(xref, new_cont)

doc.save("no-watermarks.pdf")

One of the (many) problems with Vibe Coding is that trying to get a LLM to spit out something useful depends massively on how well you know the subject area. I'm proud to say I know vanishingly little about the baroque PDF specification - which meant that most of my attempts to use various "AI" tools consisted of me saying "No, that doesn't work" and the accurs'd machine saying back "Golly-gee! You're right! Let me fix that!" and then breaking something else.

I'm not sure this is the future we wanted, but it looks like the future we've got.

Improving PixelMelt's Kindle Web Deobfuscator

@edent — Sun, 19 Oct 2025 11:34:37 +0000

A few days ago, someone called PixelMelt published a way for Amazon's customers to download their purchased books without DRM. Well… sort of.

In their post "How I Reversed Amazon's Kindle Web Obfuscation Because Their App Sucked" they describe the process of spoofing a web browser, downloading a bunch of JSON files, reconstructing the obfuscated SVGs used to draw individual letters, and running OCR on them to extract text.

There were a few problems with this approach.

Firstly, the downloader was hard-coded to only work with the .com site. That fix was simple - do a search and replace on amazon.com with amazon.co.uk. Easy!

But the harder problem was with the OCR. The code was designed to visually centre each extracted glyph. That gives a nice amount of whitespace around the character which makes it easier for OCR to run. The only problem is that some characters are ambiguous when centred:

When I ran the code, lots of full-stops became midpoints, commas became apostrophes, and various other characters went a bit wonky.

That made the output rather hard to read. This was compounded by the way line-breaks were treated. Modern eBooks are designed to be reflowable - no matter the size of your screen, lines should only break on a new paragraph. This had forced linebreaks at the end of every displayed line - rather than at the end of a paragraph.

So I decided to fix it.

A New Approach

I decided that OCRing an entire page would yield better results than single characters. I was (mostly) right. Here's what a typical page looks like after de-obfuscation and reconstruction:

As you can see - the typesetting is good for the body text, but skew-whiff for the title. Bold and italics are preserved. There are no links or images.

Here's how I did it.

Extract the characters

As in the original code, I took the SVG path of the character and rendered it as a monochrome PNG. Rather than centring the glyph, I used the height and width provided in the glyphs.json file. That gave me a directory full of individual letters, numbers, punctuation marks, and ligatures. These were named by fontKey (bold, italic, normal, etc).

Create a blank page

The page_data_0_4.json has a width and height of the page. I created a white PNG with the same dimensions. The individual characters could then be placed on that.

Resize the characters

In the page_data_0_4.json each run of text has a fontKey - which allows the correct glyph to be selected. There's also a fontSize parameter. Most text seems to be (the ludicrously precise) 19.800001. If a font had a different size, I temporarily scaled the glyph in proportion to 19.8.

Each glyph has an associated xPosition, along with a transform which gives X and Y offsets. That allows for indenting and other text layouts.

The characters were then pasted on to the blank page.

Once every character from that page had been extracted, resized, and placed - the page was saved as a monochrome PNG.

OCR the page

Tesseract 5 is a fast, modern, and reasonably accurate OCR engine for Linux.

Running tesseract page_0022.png output -l eng produced a .txt file with all the text extracted.

For a more useful HTML style layout, the hOCR output can be used: tesseract page_0022.png output -l eng hocr

Or, a PDF with embedded text: tesseract page_0022.png output -l eng pdf

Mistakes

OCR isn't infallible. Even with a high resolution image and a clear font, there were some errors.

Superscript numerals for footnotes were often missing from the OCR.
Words can run together even if they are well spaced.
Tesseract can recognise bold and italic characters - but it outputs everything as plain text.

What's missing?

Images aren't downloaded. I took a brief look and, while there are links to them in the metadata, they're downloaded as encrypted blobs. I'm not clever enough to do anything with them.

The OCR can't pick out semantic meaning. Chapter headings and footnotes are rendered the same way as text.

Layout is flat. The image of the page might have an indent, but the outputted text won't.

What's next?

This is very far from perfect. It can give you a visually similar layout to a book you have purchased from Amazon. But it won't be reflowable.

The text will be reasonably accurate. But there will be plenty of mistakes.

You can get an HTML layout with hOCR. But it will be missing formatting and links.

Processing all the JSON files and OCRing all the images is relatively quick. But tweaking and assembling is still fairly manual.

There's nothing particularly clever about what I've done. The original code didn't come with an open source software licence, so I am unable to share my changes - but any moderately competent programmer could recreate this.

Personally, I've just stopped buying books from Amazon. I find that Kobo is often cheaper and their DRM is easy to bypass. But if you have many books trapped in Amazon - or a book is only published there - this is a barely adequate way to liberate it for your personal use.

Get alerted when your Kobo wishlist books drop in price

@edent — Thu, 01 May 2025 11:34:06 +0000

The brilliant kobodl Python package allows you to interact with your Kobo account programmatically. You can list all the books you've purchased, download them, and - as of version 0.12.0 - view your wishlist.

Here's a rough and ready Python script which will tell you when any the books on your wishlist have dropped below a certain amount.

Prerequisites

Install kobodl following their guide.
Log in with your account by running kobodl user add
Check that the configuration file is saved in the default location /home/YOURUSERNAME/.config/kobodl.json

Get your wishlist

The kobodl function GetWishList() takes a list of users and returns a generator. The generator contains the book's name and author. The price is a string (for example 5.99 GBP) so needs to be split at the space.

Here's a quick proof of concept:

import kobodl
wishlist = kobodl.book.actions.GetWishList( kobodl.globals.Settings().UserList.users )
for book in wishlist:
    print( book.Title + " - "  + book.Author + " " + book.Price.split()[0] )

Sort the wishlist

Using Pandas, the data can be added to a dataframe and then sorted by price:

import kobodl
import pandas as pd

#   Set up the lists
items  = []
prices = []
ids    = []

wishlist = kobodl.book.actions.GetWishList( kobodl.globals.Settings().UserList.users )

for book in wishlist:
    items.append( book.Title + " - "  + book.Author )
    prices.append( float( book.Price.split()[0] ) )
    ids.append( book.RevisionId )

#   Place into a DataFrame
all_items = zip( ids, items, prices )
book_prices = pd.DataFrame( list(all_items), columns = ["ID", "Name", "Price"])
book_prices = book_prices.reset_index()  

#   Get books cheaper than three quid
cheap_df = book_prices[ book_prices["Price"] < 3 ]

Create the Message

This will write the body text of the email. It gives you the price, book details, and a search link to buy the book.

from urllib.parse import quote_plus

#   Search Prefix
website = "https://www.kobo.com/gb/en/search?query="

#   Email Body
message = ""

for index, row in cheap_df.sort_values("Price").iterrows():
    name  = row["Name"]
    price = str(row["Price"])
    link = website + quote_plus( name )
    message += "£" + price + " - " + name + "\n" + link + "\n\n"

Send an Email

Python makes it fairly easy to send an email - assuming you have a co-operative mailhost.

import smtplib
from email.message import EmailMessage

#   Send Email
def send_email(message):
    email_user = 'you@example.com'
    email_password = 'P@55w0rd'
    to = 'destination@example.com'
    msg = EmailMessage()
    msg.set_content(message)
    msg['Subject'] = "Kobo price drops"
    msg['From'] = email_user
    msg['To'] = to
    server = smtplib.SMTP_SSL('example.com', 465)
    server.ehlo()
    server.login(email_user, email_password)
    server.send_message(msg)
    server.quit()

send_email( message )

Setting the settings

When running as a script, it is necessary to ensure the settings are correctly initialised.

from kobodl.settings import Settings

my_settings = Settings()
kobodl.Globals.Settings = my_settings

The End Result

I have a cron job which runs this every morning. It sends an email like this:

Next Steps

Some possible ideas. If you can code these, let me know!

Save the prices so it sees if there's been a drop since yesterday.
Compare prices to Amazon for eBook Arbitrage.
Automatically buy any book that hits 99p.

Happy reading!

Automatic Kobo and Kindle eBook Arbitrage

@edent — Wed, 19 Feb 2025 12:34:43 +0000

This post will show you how to programmatically get the cheapest possible price on eBooks from Kobo.

Background

Amazon have decided to stop letting customers download their purchased eBooks onto their computers. That means I can't strip the DRM and read on my non-Amazon eReader.

So I guess I'm not spending money with Amazon any more. I'm moving to Kobo for three main reasons:

They provide standard ePubs for download.
ePub DRM is trivial to remove.
Kobo will undercut Amazon's prices!

Here's the thing. I want to buy my eBooks. It is trivial to pirate almost any modern book. But, call me crazy, I like rewarding writers with a few pennies. That said, I'm not made of money, so I want to get the best (legal) deal possible.

Kobo do a price-match with other eBook retailers. It says:

We'll award a credit to your Kobo account equal to the price difference, plus 10% of the competitor’s price.

I found a book I wanted which was £4.99 on Kobo. The Amazon Kindle price was £4.31.

4.99 - ( (4.99 - 4.31) + (4.31 * 0.1) ) = 3.88

I purchased the book, sent a request for a price match, and got this email a few hours later:

OK! So what steps can we automate, and which will have to remain manual?

Amazon Pricing API

Amazon have a Product Advertising API. You will need to register for the Amazon Affiliate Program and make some qualifying sales before you get API access.

In order to search for an ISBN and get the price back, you need to send:

{
 "Keywords": "isbn:9781473613546",
 "Resources": ["Offers.Listings.Price"],
}

Using the updated Python API for PAAPI:

from paapi5_python_sdk import DefaultApi, SearchItemsRequest, SearchItemsResource, PartnerType

def search_items():
    access_key = "ABC"
    secret_key = "123"
    partner_tag = "shkspr-21"
    host = "webservices.amazon.co.uk"
    region = "eu-west-1"

    api = DefaultApi(access_key=access_key, secret_key=secret_key, host=host, region=region)

    request = SearchItemsRequest(
        partner_tag=partner_tag,
        partner_type=PartnerType.ASSOCIATES,
        keywords="isbn:9781473613546",
        search_index="All",
        item_count=1,
        resources=["Offers.Listings.Price"]
    )

    response = api.search_items(request)

    print(response)

search_items()

(Add your own access key, secret key, and tag. You may need to change the host and region depending on where you are in the world.)

That returns something like:

{
    "search_result": {
        "items": [
            {
                "asin": "B09JLQHHXN",
                "detail_page_url": "https://www.amazon.co.uk/dp/B09JLQHHXN?tag=shkspr-21&linkCode=osi&th=1&psc=1",
                "offers": {
                    "listings": [
                        {
                            "price": {
                                "amount": 2.99,
                                "currency": "GBP",
                                "display_amount": "£2.99"
                            }
                        }
                    ]
                }
            }
        ]
    }
}

(I've truncated the above so it only shows the relevant information.)

Kobo ISBN & Price

Let's get the ISBN and Price of a book on Kobo. There's no easy API to do this. But, thankfully, Kobo embeds some Schema.org metadata.

Look at the source code for https://www.kobo.com/gb/en/ebook/venomous-lumpsucker-1

Getting the data from the data-kobo-gizmo-config is a little tricky.

Using Python Requests won't work because Kobo seem to run a JS CAPTCHA to detect scraping.
There is a Calibre-Web Kobo plugin but it requires you to have a physical Kobo eReader in order to get an API key.
The Rakuten API is only for the Japanese store.

So we have to use the Selenium WebDriver to scrape the data:

from selenium import webdriver
from bs4 import BeautifulSoup
import json

#   Open the web page
browser = webdriver.Firefox()
browser.get("https://www.kobo.com/gb/en/ebook/venomous-lumpsucker-1")

#   Get the source
html_source = browser.page_source

#   Soupify
soup = BeautifulSoup(html_source, 'html.parser')

#   Get the encoded JSON Schema
schema = soup.find_all(id="ratings-widget-details-wrapper")[0].get("data-kobo-gizmo-config")

#   Convert to object from JSON
parsed_data = json.loads(schema)

#   Decode the nested JSON strings
parsed_data["googleBook"] = json.loads(parsed_data["googleBook"])

#    Get ISBN and Price
price = parsed_data["googleBook"]["workExample"]["potentialAction"]["expectsAcceptanceOf"]["price"]
isbn  = parsed_data["googleBook"]["workExample"]["isbn"]
print(isbn)
print(price)

Kobo Wishlist

OK, nearly there! Given a Kobo book URl we can get the price and ISBN, then use that ISBN to get the Kindle price. But how do we get the Kobo book URl in the first place?

I'm adding all the books I want to my Kobo Wishlist.

Inside the Wishlist is a scrap of JavaScript which contains this JSON:

{
    "value": {
        "Items": [
            {
                "Title": "Venomous Lumpsucker",
                "Price": "£2.99",
                "ProductUrl": "/gb/en/ebook/venomous-lumpsucker-1",
            }
        ],
        "TotalItemCount": 11,
        "ItemCountByProductType": {
            "book": 11
        },
        "PageIndex": 1,
        "TotalNumPages": 1,
       }
}

(Simplified to make it easier to understand.)

Although there's a price, there's no ISBN, So you'll need to use the "ProductUrl" to get the ISBN and Price as above.

Sadly, unlike Amazon, there's no way to publicly share a wishlist. Getting the JSON requires logging in, so it's back to Selenium again!

This should be enough:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import time

browser = webdriver.Firefox()
browser.get("https://www.kobo.com/gb/en/account/wishlist")

#       Log in
username_box = browser.find_element(By.NAME, "LogInModel.UserName")
username_box.clear()
username_box.send_keys('you@example.com')

password_box = browser.find_element(By.NAME, "LogInModel.Password")
password_box.clear()
password_box.send_keys('p455w0rd')

password_box.send_keys(Keys.RETURN)

time.sleep(5) # Wait for load and rendering

But the Kobo presents a CAPTCHA which prevents login.

There is an unofficial API which, sadly, doesn't seem to work at the moment.

Next Steps

For now, I'm saving specific Kobo book URls into a file and then running a scrape once per day. Hopefully, the unofficial Kobo API will be working again soon.

Liberate your daily statistics from JetPack

@edent — Thu, 17 Oct 2024 11:34:16 +0000

Because Ma.tt continues to burn all of the goodwill built up by WordPress, and JetPack have decided to charge a ridiculous sum for their statistics, I've decided to move to a new stats provider. But I don't want to lose all the statistics I've built up over the years.

How do I download a day-by-day export of my JetPack stats⁰?

Luckily, there is an API for downloading all your JetPack stats!

First, get your API key by visiting https://apikey.wordpress.com/ - it should be a 12 character string. For this example, I'm going to use 123456789012. You will need to use your own API key.

There is some brief documentation on that page. Here are the bits we are interested in:

api_key     String    A secret unique to your WordPress.com user account.
blog_uri    String    The full URL to the root directory of your blog. Including the full path.
table       String    One of views, postviews, referrers, referrers_grouped, searchterms, clicks, videoplays.
end         String    The last day of the desired time frame. Format is 'Y-m-d' (e.g. 2007-05-01) and default is UTC date.
days        Integer   The length of the desired time frame. Default is 30. "-1" means unlimited.
limit       Integer   The maximum number of records to return. Default is 100. "-1" means unlimited. If days is -1, limit is capped at 500.
format      String    The format the data is returned in, 'csv', 'xml' or 'json'. Default is 'csv'.

In order to get all of the statistics from a single day, the URl is:

https://stats.wordpress.com/csv.php?api_key=123456789012
     &blog_uri=https://shkspr.mobi/blog/
     &table=postviews
     &end=2024-10-07
     &days=1
     &limit=-1

That gets all of the statistics from one specific day. The limit=-1 means it will retrieve all the records of that day¹.

That will get you a CSV which looks like:

"date","post_id","post_title","post_permalink","views"
"2024-10-09",0,"Home page","https://shkspr.mobi/blog/",59
"2024-10-09",42171,"Review: HP's smallest laser printer - M140w + Linux set up","https://shkspr.mobi/blog/2022/04/review-hps-smallest-laser-printer-m140w-linux-set-up/",7
"2024-10-09",49269,"No, Oscar Wilde did not say ""Imitation is the sincerest form of flattery that mediocrity can pay to greatness""","https://shkspr.mobi/blog/2024/01/no-oscar-wilde-did-not-say-imitation-is-the-sincerest-form-of-flattery-that-mediocrity-can-pay-to-greatness/",7
"2024-10-09",53333,"The Cleaner 🆚 Der Tatortreiniger - Series 3","https://shkspr.mobi/blog/2024/10/the-cleaner-%f0%9f%86%9a-der-tatortreiniger-series-3/",7
"2024-10-09",49943,"Solved! ""Access Point Name settings are not available for this user""","https://shkspr.mobi/blog/2024/03/solved-access-point-name-settings-are-not-available-for-this-user/",7
"2024-10-09",43690,"WhatsApp Web for Android - a reasonable compromise?","https://shkspr.mobi/blog/2022/11/whatsapp-web-for-android-a-reasonable-compromise/",5

You can also get a JSON file using &format=json, although it doesn't contain the permalinks.

[
    {
        "date": "2024-10-09",
        "postviews": [
            {
                "post_id": 0,
                "post_title": "",
                "permalink": "",
                "views": 59
            },
            {
                "post_id": 49269,
                "post_title": "No, Oscar Wilde did not say \"Imitation is the sincerest form of flattery that mediocrity can pay to greatness\"",
                "permalink": "",
                "views": 9
            },
            {
                "post_id": 42171,
                "post_title": "Review: HP's smallest laser printer - M140w + Linux set up",
                "permalink": "",
                "views": 7
            },

From there, I wrote a scrap of Python to download every single date individually.

import requests
import datetime
import os
import json

# Directory to save the JSON files
save_dir = "jetpack_stats"
os.makedirs(save_dir, exist_ok=True)

# URL of the API
base_url = "https://stats.wordpress.com/csv.php?api_key=123456789012"+\
           "&blog_uri=https://example.com/"+\
           "&table=postviews"+\
           "&days=1"+\
           "&format=json"+\
           "&limit=-1"+\
           "&end="

# Make API call and save the response
def fetch_and_save_json(date):
    # Format the date as ISO8601 (YYYY-MM-DD)
    formatted_date = date.isoformat()

    # Make the API call
    url = f"{base_url}{formatted_date}"
    response = requests.get(url)

    if response.status_code == 200:
        data = response.json()
        file_name = f"{formatted_date}.json"
        file_path = os.path.join(save_dir, file_name)
        with open(file_path, "w") as f:
            json.dump(data, f, indent=4)

        print(f"Saved {formatted_date}")
    else:
        print(f"Failed! {formatted_date} status code: {response.status_code}")

# Iterate over a date range
start_date = datetime.date(2023,  1 , 1)
end_date   = datetime.date(2024, 10, 30)

# Loop through all dates 
current_date = start_date
while current_date <= end_date:
    fetch_and_save_json(current_date)
    current_date += datetime.timedelta(days=1)

You'll need to manually find the earliest date for which your blog has statistics.

Running the code is a little slow. Expect about 3 minutes per year of data. I'm sure you could parallelise it if you really needed to.

Now, for my next trick, how do I import these data into a new stats plugin? That's tomorrow's blog post!

When people ask on the official support forum, they're told to privately contact JetPack. There's a help page which shows how to download a summary. But I couldn't find anything more fine-grained than that. ↩︎
The maximum number of records for a specific day in my dataset was 978. ↩︎

Is "Dollar Cost Averaging" a Bad Idea?

@edent — Thu, 15 Aug 2024 11:34:50 +0000

It's sometimes useful to run experiments yourself, isn't it?

New investors are often told that, when investing for the long term rather than chasing individual stocks, it is better to be invested for the longest possible time rather than trying to do "dollar cost averaging". DCA is the process of spreading out over time the purchasing of your investments. That way, you don't lose it all if the market drops the day after you invest.

Let me explain...

Imagine that it is 1994 and your rich uncle, Scrooge McDuck, has decided to gift you $1,200 per year. How generous!

He has stipulated that you must invest it in the S&P 500 - that's the top 500 companies in the world⁰.

He gives you two choices:

Put $1,200 in on the 1st of January every year.
Put $100 in on the 1st of the month every year.

How much money do you make in each scenario?

Get The Data

Kaggle has a download for the historic S&P 500 data. It goes from 1993 to 2024.

The data looks like this:

Date	Open	High	Low	Close	Volume	Day	Weekday	Week	Month	Year
29/01/93	24.70	24.70	24.58	24.68	1003200	29	4	4	1	1993
01/02/93	24.70	24.86	24.70	24.86	480500	1	0	5	2	1993
02/02/93	24.84	24.93	24.79	24.91	201300	2	1	5	2	1993
03/02/93	24.95	25.19	24.93	25.18	529400	3	2	5	2	1993

Experiment 1 - Time In The Market

Here's the algorithm we want to run.

Start in 1994
Set the investment as 1200
Get the Opening price of the first entry of the year
Get the Closing price of the last entry of the year
Calculate the percentage difference
Multiply the investment by the growth / fall
Add 1200 to the investment
Repeat from (3) for the next year.

Here's the code. I've made some assumptions - for example there are no trading fees, you buy at the opening price, and fractional dollars disappear. I'm aware this doesn't track perfectly but it isn't intended to; this is a rough and ready reckoner.

Open for Python Code

Drawing PPM images on the Tildagon in MicroPython

@edent — Wed, 19 Jun 2024 11:34:15 +0000

The Tildagon has 2MB of RAM. That's not enough to do... well, most things you'd want to do with a computer! There's not much processing power, so running complex image decoding algorithms might be a bit beyond it.

Is there a simple image format which can be parsed and displayed? Yes! The ancient Portable PixMap (PPM) format.

The standard is beautiful in its simplicity. Here's the header:

P6
# Created by GIMP version 2.10.38 PNM plug-in
120 120
255
���t�{...

The P6 identifies it as a PPM file. The # is a comment. 120 120 says that the image's dimensions are 120 pixels horizontal, and 120 vertical. 255 is the maximum value for each colour.

Then comes a big blob of binary data. Each byte is a value from 0 to 255.

To find the Red, Green, and Blue values of the first pixel, read the first 3 bytes. The next 3 bytes are the RGB of the next pixel. And so on.

There's no compression. Just pure pixel values.

Because of the low memory limits of the Tildagon, I found it impossible to load the entire file into memory and then paint it on the screen. Instead, I read it in chunks.

First, load the file as a read-only binary. Then skip the header and get straight to the pixel data.

#   Open a 120px x 120px Raw / Binary PPM file
with open('/apps/ppm/chrome120.ppm', 'rb') as ppm_file:
    print("Skipping Header")
    # Skip the header
    header = b''
    while True:
        line = ppm_file.readline()
        header += line
        if header.endswith(b'\n255\n'):
            break

Images on the Tildagon are drawn from the top left, which has co-ordinates -120,-120

    #   Start at the top left
    x, y = -120, -120

Next, read in a line of pixels. The image is 120px wide, each pixel has 3 values, so that's 360 bytes. Grab the pixel values and draw them to screen:

    while True:
        #   Read in 1 line at a time (3 bytes * 120px)
        chunk = ppm_file.read(360)
        if not chunk:
            break  # End of file
        #   Read the RGB, convert to float
        for i in range(0, len(chunk), 3):
            r = chunk[i]    /255
            g = chunk[i + 1]/255
            b = chunk[i + 2]/255
            #   Draw the pixel in a 2x2 square
            ctx.rgb(r, g, b).rectangle(x, y, 2, 2).fill()

The screen's resolution is 240x240, so each pixel from the 120x120 image needs to be drawn as 2x2 rectangle.

Once that's done, move to the next square to be drawn. Once a full line has been drawn, move down to drawing the next line.

            #   Move the the next square
            x += 2
            #   If a complete line has been drawn
            #   Move down a line (2px) and reset the x coordinate
            if x >= 120:
                x = -120
                y += 2
        #   Clear the chunk from memory
        del chunk
        #   Perform garbage collection
        gc.collect()

A bit of manual garbage collection doesn't hurt! And then a bit more for good luck!

del(ppm_file)
#   Final collection
print("Collecting")
gc.collect()

Setting the time on the Tildagon

@edent — Mon, 17 Jun 2024 11:34:22 +0000

I'm beginning my adventures in MicroPython in the hope that I'll have something interesting working on the Tildagon Badge for EMF2026. Here's a basic implementation of a clockface.

Here's how to set the time on the badge. There's a hardware clock which should keep time between reboots.

Install mpremote on your computer.
Connect the Tildagon to your computer using a USB-C data cable
On your computer's command line, run mpremote. You should see: > Connected to MicroPython at /dev/ttyACM0 > Use Ctrl-] or Ctrl-x to exit this shell
Hold down the ctrl key on your computer. While holding it down, press the C key on your computer. This will open up a shell for you to enter commands.
Enter the following commands one at a time, followed by enter

from machine import RTC
rtc = RTC()
rtc.datetime()

That will display the time that the badge currently thinks it is.

For example: (2000, 1, 1, 5, 0, 1, 47, 984022)

This is a slightly unusual format. It is: year, month, day, weekday, hours, minutes, seconds, subseconds.

The "weekday" is 0 for Monday, 1 for Tuesday etc.

This is an array. So, to access individual elements of the time, you can say:

year = rtc.datetime()[0]

Alternatively, you can do:

import time
time.localtime()

That will return something like: (2024, 6, 11, 15, 11, 39, 1, 163)

Which, according to the documentation, is "year, month, mday, hour, minute, second, weekday, yearday".

Setting the clock

To manually set the date and time, run:

rtc.datetime((2023, 6, 16, 1, 15, 36, 0, 0))

NTP

If you want to use NTP to synchronise the time with an Internet-based atomic clock, here's what you need to do.

Connect your badge to WiFi.
As above, connect with USB and run mpremote, then obtain a shell.
Enter the following commands one at a time, followed by enter:

import ntptime
ntptime.settime()
rtc.datetime()

That will now show you the synchronised time. Note, it will be set to UTC. There's no way to set the timezone - you'll have to deal with that in your code elsewhere.

You can also read the reference documentation about the Real Time Clock.

Displaying a QR code in MicroPython on the Tildagon Badge

@edent — Sat, 15 Jun 2024 11:34:16 +0000

This was a bit of a labour of love - and something I wanted to get running during EMF Camp. I'm documenting in the hope it'll be useful for EMF 2026!

Here's the end result:

Background

I'm going to assume that you have updated your badge to the latest firmware version.

You will also need to install mpremote on your development machine.

You should also have successfully run the basic Hello, World! app.

Drawing surface

The Tildagon screen is 240x240 pixels. However, it is also a circle. This gives an internal square of 170x170 pixels. The drawing co-ordinates have 0,0 in the centre. Which means the target area is the red square as shown here:

Generate a QR code

As you can see, there isn't much space here. A Version 1 QR Code is a mere 21x21 pixels. When set to "Low" error correction, it can contain up to 25 characters. A URl should start with https:// - which is 8 characters. That leaves 17 characters for the domain and path.

Use your favourite QR generator to make the tiniest QR code you can. Make sure there's no border. It should be 21x21 pixels. Here's mine:

See? Tiny!

Prepare the QR code

Next, we need to turn the QR code into a binary matrix. There may be easier ways to do this, but I used a scrap of Python:

from PIL import Image
import numpy as np

#    Load the image
image = Image.open("qr.png")

#    Convert the image to grayscale
gray_image = image.convert("L")

#    Threshold the image to get binary black and white image
threshold = 128
bw_image = gray_image.point(lambda x: 0 if x > threshold else 1, '1')

#    Convert the image to a NumPy array
pixel_array = np.array(bw_image, dtype=int)

#    Convert the array to a string with commas between the elements
array_str = np.array2string(pixel_array, separator=',', formatter={'int':lambda x: str(x)})

print(array_str)

Copy the output - we'll need it later!

Calculate size

We have a canvas of 170 pixels and a QR code of 21 pixels. 170 / 21 = 8.1 pixels. Ah. Drawing fractional pixels isn't fun. Luckily, QR codes benefit from having a safe area around them. If we make each QR pixel 7 screen pixels, that gives us (21 x 7) = 147 pixels. Which gives us enough space for a small white border.

If the QR code is to be centred, the top left corner will be in position (147 / 2) = 74. That means it will need to be drawn at position -74,-74. The top left corner is -120,-120.

So the offset used to calculate the location is (-120 + 74) = 46.

(You might be able to get away with 8 pixels and an offset of 36 pixel. Try it!)

Remember those numbers!

Write the app

This reuses a lot of the Hello World code.

import app
from app_components import TextDialog, clear_background
from events.input import Buttons, BUTTON_TYPES

class QrApp(app.App):
    #   Define the colours
    black = (  0,   0,   0)
    white = (255, 255, 255)

    def __init__(self):
        self.button_states = Buttons(self)

    def update(self, delta):
        if self.button_states.get(BUTTON_TYPES["CANCEL"]):
            self.button_states.clear()
            self.minimise()

    def draw(self, ctx):
        clear_background(ctx)

        #   QR code data (21x21 matrix)
        qr_code =[[1,1,1,1,1,1,1,0,0,1,1,0,0,0,1,1,1,1,1,1,1],
                  [1,0,0,0,0,0,1,0,1,0,1,0,0,0,1,0,0,0,0,0,1],
                  [1,0,1,1,1,0,1,0,1,1,1,1,1,0,1,0,1,1,1,0,1],
                  [1,0,1,1,1,0,1,0,0,1,1,0,0,0,1,0,1,1,1,0,1],
                  [1,0,1,1,1,0,1,0,1,1,1,1,0,0,1,0,1,1,1,0,1],
                  [1,0,0,0,0,0,1,0,1,1,0,1,0,0,1,0,0,0,0,0,1],
                  [1,1,1,1,1,1,1,0,1,0,1,0,1,0,1,1,1,1,1,1,1],
                  [0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0],
                  [1,1,0,1,0,0,1,1,0,0,1,0,0,0,1,1,1,0,1,1,0],
                  [1,0,1,1,1,1,0,1,1,0,0,1,1,0,1,1,1,0,0,0,1],
                  [1,0,1,0,0,1,1,1,0,1,1,0,1,0,0,0,0,0,1,0,1],
                  [1,1,1,1,0,1,0,1,0,0,0,0,0,0,1,0,1,1,0,1,1],
                  [1,0,1,0,0,1,1,1,0,0,1,1,1,0,0,1,0,1,0,0,0],
                  [0,0,0,0,0,0,0,0,1,0,0,0,1,1,0,1,0,0,0,0,1],
                  [1,1,1,1,1,1,1,0,1,0,1,0,1,1,0,0,1,1,1,1,0],
                  [1,0,0,0,0,0,1,0,0,0,0,1,1,0,0,1,1,0,0,0,0],
                  [1,0,1,1,1,0,1,0,0,0,1,1,0,1,0,0,1,1,0,1,1],
                  [1,0,1,1,1,0,1,0,1,0,0,0,1,0,1,0,1,0,0,0,1],
                  [1,0,1,1,1,0,1,0,0,0,1,1,1,0,1,0,1,0,1,0,1],
                  [1,0,0,0,0,0,1,0,1,1,1,0,0,1,0,0,0,0,0,0,0],
                  [1,1,1,1,1,1,1,0,1,0,1,0,1,0,1,0,1,0,0,1,0]]

        #   Draw background
        ctx.rgb(*self.white).rectangle(-120, -120, 240, 240).fill()

        #   Size of each QR code pixel on the canvas
        pixel_size = 7

        #   Offset size in pixels
        offset_size = 46

        #   Calculate the offset to start drawing the QR code (centre it within the available space)
        offset = -120 + offset_size

        #   Loop through the array
        for row in range(21):
            for col in range(21):
                if qr_code[row][col] == 1:
                    x = (col * pixel_size) + offset
                    y = (row * pixel_size) + offset
                    ctx.rgb(*self.black).rectangle(x, y, pixel_size, pixel_size).fill()

__app_export__ = QrApp

Installation

Follow the instructions
Run mpremote cp ~/Documents/badge/* :/apps/qr/
Restart the badge
Scroll down the app list and launch the QR app

The non-stupid way!

OK, that was the hard way - here's the easy way.

Use the MicroPython QR Generation library uQR.

If you pop that file in your project directory, and upload it to the badge, then you can import it with:

from .uQR import QRCode

The QR code has its own white margin and is a 2D array of True & Falses.

# QR code data (29x29 matrix)
qr = QRCode()
qr.add_data("https://edent.tel")
qr_code = qr.get_matrix()
qr_size = len( qr_code )

#   Draw background
ctx.rgb(*self.white).rectangle(-120, -120, 240, 240).fill()

#   Size of each QR code pixel on the canvas
pixel_size = int( 170 / qr_size ) + 1

#   Border size in pixels
border_size = ( 240 - (pixel_size*qr_size) ) / 2

#   Calculate the offset to start drawing the QR code (centre it within the available space)
offset = -120 + border_size

#   Loop through the array
for row in range( len(qr_code) ):
    for col in range( len(qr_code) ):
        if qr_code[row][col] == True:
            x = (col * pixel_size) + offset
            y = (row * pixel_size) + offset
            ctx.rgb(*self.black).rectangle(x, y, pixel_size, pixel_size).fill()

Next steps

This is hardcoded for a single QR code - mine! Perhaps it should be configurable?
Add some text to the screen?
Animations? Colour? Flashing LEDs?

Got any thoughts? Stick them in the box!

Untappd to Mastodon - Updated!

@edent — Sun, 12 May 2024 11:34:19 +0000

A few years ago, I wrote some code to post Untappd check-ins to Mastodon. I've recently updated it to also post a photo of the beer you're enjoying.

First up, you'll need a file called config.py to hold all your API keys:

instance = "https://mastodon.social"
access_token          = "…"
write_access_token    = "…"
untappd_client_id     = "…"
untappd_client_secret = "…"

Then a file called untappd2mastodon.py to do the job of grabbing your data, finding your latest check-in, then posting it to the Fediverse:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from mastodon import Mastodon
import json
import requests
import config

#  Set up access
mastodon = Mastodon( api_base_url=config.instance, access_token=config.write_access_token )

#       Untappd API
untappd_api_url = 'https://api.untappd.com/v4/user/checkins/edent?client_id=' + config.untappd_client_id + '&client_secret='+ config.untappd_client_secret

r = requests.get(untappd_api_url)

untappd_data = r.json()

#       Latest checkin object
checkin = untappd_data["response"]["checkins"]["items"][0]
untappd_id = checkin["checkin_id"]

#       Was this ID the last one we saw?
check_file = open("untappd_last", "r")
last_id = int( check_file.read() )
print("Found " + str(last_id) )
check_file.close()

if (last_id != untappd_id ) :
        print("Found new checkin")
        check_file = open("untappd_last", "w")
        check_file.write( str(untappd_id) )
        check_file.close()
        #       Start creating the message
        message = ""

        if "checkin_comment" in checkin :
                message += checkin["checkin_comment"]

        if "beer" in checkin :
                message += "\nDrinking: " + checkin["beer"]["beer_name"]

        if "brewery" in checkin :
                message += "\nBy: "       + checkin["brewery"]["brewery_name"]

        if "venue" in checkin :
                if "venue_name" in checkin["venue"] :
                        message += "\nAt: "       + checkin["venue"]["venue_name"]
        #       Scores etc
        untappd_checkin_url = "https://untappd.com/user/edent/checkin/" + str(untappd_id)
        untappd_rating      = checkin["rating_score"]
        untappd_score       = "🍺" * int(untappd_rating)

        message += "\n" +  untappd_score + "\n" + untappd_checkin_url + "\n" + "#untappd"

        #       Get Image
        if checkin["media"]["count"] > 0 :
                photo_url = checkin["media"]["items"][0]["photo"]["photo_img_lg"]
                download = requests.get(photo_url)
                with open("untappd.tmp", 'wb') as temp_file:
                        temp_file.write(download.content)
                media = mastodon.media_post("untappd.tmp", description="A photo of some beer.")
                mastodon.status_post(status = message, media_ids=media, idempotency_key = str(untappd_id))
        else:   
                #       Post to Mastodon. Use idempotency just in case something went wrong
                mastodon.status_post(status = message, idempotency_key = str(untappd_id))
else :
        print("No new checkin")

You can treat this code as being MIT licenced if that makes you happy.

There should only ever be one way to express yourself

@edent — Sun, 11 Feb 2024 12:34:39 +0000

I've been thinking about programming languages and their design.

In her book about the divergence of the English and American languages, Lynne Murphy asks this question:

wouldn’t it be great if language were logical and maximally efficient? If sentences had only as many syllables as strictly needed? If each word had a single, unique meaning? If there were no homophones, so we’d not be able to mix up dear and deer or two and too?

That got me thinking about the creativity which can be expressed in code - and whether its a good thing.

Let's take an incredibly simple and common operation - incrementing an integer variable by one. How would you do that? You've probably see these variations:

$i = $i + 1;

$i = $i++;

$i = 1 + $i;

$i = int( float_adder( float($i), 1.00 ) );

i1, i2 = i1^i2, (i1&i2) << 1

I'm sure you can come up with a few more esoteric methods.

The Python programming language has a list of aphorisms for good programming practice. One of which is:

There should be one-- and preferably only one --obvious way to do it.

Is that right? As described in What is Pythonic?, the Python language itself has multiple ways to accomplish one thing.

But, is it a good idea?

Back to Lynne Murphy again:

No, absolutely not. No way. Quit even thinking that. What are you, some kind of philistine? If Shakespeare hadn’t played with the number of syllables in his sentences, he would not have been able to communicate in iambic pentameter.

Shakespeare wasn't writing Python though, was he?

Compressing Text into Images

@edent — Sat, 13 Jan 2024 12:34:11 +0000

(This is, I think, a silly idea. But sometimes the silliest things lead to unexpected results.)

The text of Shakespeare's Romeo and Juliet is about 146,000 characters long. Thanks to the English language, each character can be represented by a single byte. So a plain Unicode text file of the play is about 142KB.

In Adventures With Compression, JamesG discusses a competition to compress text and poses an interesting thought:

Encoding the text as an image and compressing the image. I would need to use a lossless image compressor, and using RGB would increase the number of values associated with each word. Perhaps if I changed the image to greyscale? Or perhaps that is not worth exploring.

Image compression algorithms are, generally, pretty good at finding patterns in images and squashing them down. So if we convert text to an image, will image compression help?

The English language and its punctuation are not very complicated, so the play only contains 77 unique symbols. The ASCII value of each character spans from 0 - 127. So let's create a greyscale image which each pixel has the same greyness as the ASCII value of the character.

Here's what it looks like when losslessly compressed to a PNG:

That's down to 55KB! About 40% of the size of the original file. It is slightly smaller than ZIP, and about 9 bytes larger than Brotli compression.

The file can be read with the following Python:

from PIL import Image
image  = Image.open("ascii_grey.png")
pixels = list(image.getdata())
ascii  = "".join([chr(pixel) for pixel in pixels])
with open("rj.txt", "w") as file:
    file.write(ascii)

But, even with the latest image compression algorithms, it is unlikely to compress much further; the image looks like random noise. Yes, you and I know there is data in there. And a statistician looking for entropy would probably determine that the file contains readable data. But image compressors work in a different realm. They look for solid blocks, or predictable gradients, or other statistical features.

But there you go! A lossless image is a pretty efficient way to compress ASCII text.

Converting MoneyDashboard's export file to a CSV - for Firefly III and others

@edent — Tue, 24 Oct 2023 11:34:04 +0000

As I mentioned last week, MoneyDashboard is shutting down. They are good enough to provide a JSON export of all your previous transactions.

It is full of entries like this:

{
    "Account": "My Mastercard",
    "Date": "2020-02-24T00:00:00Z",
    "CurrentDescription": null,
    "OriginalDescription": "SUMUP *Pizza palace, London, W1",
    "Amount": -12.34,
    "L1Tag": "Eating Out",
    "L2Tag": "Pizza",
    "L3Tag": ""
},
{
    "Account": "American Express",
    "Date": "2019-01-11T00:00:00Z",
    "CurrentDescription": null,
    "OriginalDescription": "Work Canteen,Norwich",
    "Amount": -5,
    "L1Tag": "Lunch",
    "L2Tag": "",
    "L3Tag": ""
}

Let's write a quick bit of Python to turn that into CSV. This will turn the above into two separate files.

My Mastercard.csv:

Data,       Description,                       Destination,  Amount
2020-02-24, "SUMUP *Pizza palace, London, W1", Pizza palace, -12.34

And American Express.csv:

Data,       Description,            Destination,  Amount
2019-01-11, "Work Canteen,Norwich", Work Canteen, -5

I didn't make much use of MoneyDashboard's tagging, so I've ignored them. The destination (which is the name of "opposing bank" in Firefly III speak) ignores the payment processor like SUMUP or PAYPAL and anything after the first comma.

It also sorts the CSV into date order. It's not very efficient, but you'll only run it once.

import json
import csv
import os
from datetime import datetime

#   Read the file
json_file = open( "md.json" )
data = json.load( json_file )
transactions = data["Transactions"]

#   Loop through the transactions
for transaction in transactions:
    #   Get the filename
    filename = transaction["Account"]

    #   Format the date
    date = datetime.strptime(transaction["Date"], "%Y-%m-%dT%H:%M:%SZ")
    formatted_date = date.strftime("%Y-%m-%d")

    #   The description
    description = transaction["OriginalDescription"]

    #   The destination is everything after the first " *" (if it exists) and before the first comma
    #   For example: "SUM *Pizza Place, London" becomes "Pizza Place"
    destination = description.split(',')[0]
    if " *" in destination:
        destination = destination.split(" *")[1]

    #   Monetary amount
    amount = transaction["Amount"]

    #   Create the file if it doesn't exist
    if not os.path.exists(f'{filename}.csv'):
        with open(f'{filename}.csv', mode='w', newline='') as file:
            writer = csv.writer(file)
            writer.writerow(["Date", "Description", "Destination", "Amount"])

    #   Read the file and split the header from the existing data
    with open(f'{filename}.csv', mode='r', newline='') as file:
        reader = csv.reader(file)
        existing_data = list(reader)
        header = existing_data[0]
        data_rows = existing_data[1:]

    data_rows.append( [formatted_date, description, destination, amount] )

    #   Sort the data by the first column (string)
    sorted_data = sorted(data_rows, key=lambda x: x[0])

    #   Save the file back again
    with open(f'{filename}.csv', mode='w', newline='') as file:
        writer = csv.writer(file)
        writer.writerows([header] + sorted_data)

Run that against your MoneyDashboard export and you can then import the CSV files into MoneyDashboard, GNUCash, or anything else.

How far did my post go on the Fediverse?

@edent — Tue, 26 Sep 2023 11:34:28 +0000

I wrote a moderately popular post on Mastodon. Lots of people shared it. Is it possible to find out how many different ActivityPub servers it went to?

Yes!

As we all know, the Fediverse is one big chain mail. I don't mean that in a derogatory way.

When I write a post, it appears on my server (called an "instance" in Mastodon-speak).

Everyone on my instance can see my post.

My instance looks at all my followers - some of whom are on completely different instances - and sends my post to their instances.

As an example:

I am on mastodon.social
John is on eggman_social.com
Paul is on penny_lane.co.uk
Both John and Paul follow me. So my post gets syndicated to their servers.

With me so far?

What happens when someone shares (reposts) my status?

John is on eggman_social.com
Ringo is on liverpool.drums
Ringo follows John
John reposts my status.
eggman_social.com syndicates my post to liverpool.drums

And so my post goes around the Fediverse! But can I see where it has gone? Well... sort of! Let's look at how.

A note on privacy

People on Mastodon and the Fediverse tend to be privacy conscious. So there are limits - both in the API and the culture - as to what is acceptable.

Some people don't share their "social graph". That is, it is impossible to see who follows them or who they follow.

Users can choose to opt-in or -out of publicly sharing their social graph. They remain in control of their privacy.

In the example above, if Ringo were to reshare John's reshare of my status - John doesn't know about it. Only the original poster (me) gets notified. If John doesn't share his social graph, it might be possible to work out where Ringo saw the status - but that's rather unlikely.

Mastodon has an API rate limit which only allows 80 results per request and 1 request per second. That makes it long and tedious to crawl thousands of results.

Similarly, some instances do not share their social data or expose anything of significance. Some servers may no longer exist, or might have changed names. It's impossible to get a comprehensive view of the entire Fediverse network.

And that's OK! People should be able to set limits on what others can do with their data. The code you're about to see doesn't attempt to breach anyone's privacy. All it does is show me which servers picked up my post. This is information which is already shown to me - but this makes it slightly easier to see.

The Result

I looked at this post of mine which was reposted over 100 times.

It eventually found its way to… 2,547 instances!

Ranging from 0ab.uk to թութ.հայ via godforsaken.website and many more!

And that's one of the things which makes me hopeful this rebellion will succeed. There are a thousand points of light out there - each a shining beacon to doing things differently. And, the more the social media giants tighten their grip, the more these systems will slip through their fingers.

The Code

This is not very efficient code - nor well written. It was designed to scratch an itch. It uses Mastodon.py to interact with the API.

It gets the instance names of all my followers. Then the instance names of everyone who reposted one of my posts.

But it cannot get the instance names of everyone who follows the users who reposted me - because: The only way to get a list of followers from a user on a different instance is to apply for an API key for that instance. Which seems a bit impractical.

But I can get the instance name of the followers of accounts on my instance who reposted me. Clear?

I can also get a list of everyone who favourited my post. If they aren't on my instance, or one of my reposter's follower's instances, they're probably from a reposter who isn't on my instance.

My head hurts.

Got it? Here we go!

import config
from mastodon import Mastodon
from rich.pretty import pprint

#  Set up access
mastodon = Mastodon( api_base_url=config.instance, access_token=config.access_token, ratelimit_method='pace' )

#   Status to check for
status_id = 111040801202691232
print("Looking up status: " + str(status_id))

#   Get my data
me = mastodon.me()
my_id = me["id"]
print("You have User ID: " + str(my_id))

#   Empty sets
instances_all        = set()
instances_followers  = set()
instances_reposters  = set()
instances_reposters_followers  = set()
instances_favourites = set()

#   My Followers
followers = mastodon.account_followers( my_id )
print( "Getting all followers" )
followers_all = mastodon.fetch_remaining( followers )
print("Total followers = " + str( len(followers_all) ) )

#   Get the server names of all my followers
for follower in followers_all:
    if ( "@" in follower["acct"]) :
        f = follower["acct"].split("@")[1]
        instances_all.add( f )
        if ( f not in instances_followers):
            print( "Follower: " + f )
            instances_followers.add( f )
    else :
        instances_all.add( "mastodon.social" )
        instances_followers.add( "mastodon.social" )
total_followers  = len(instances_followers)
print( "Total Unique Followers Instances = " + str(total_followers)  )

#   Reposters
#   Find the accounts which reposted my status
reposters     = mastodon.status_reblogged_by( status_id )
reposters_all = mastodon.fetch_remaining(reposters)

#   Get all the instance names of my reposters
for reposter in reposters_all:
    if ( "@" in reposter["acct"]) :
        r = reposter["acct"].split("@")[1]
        instances_all.add( r )
        if ( r not in instances_followers ) :
            print( "Reposter: " + r )
            instances_reposters.add( r )
total_reposters  = len(instances_reposters)
print( "Total Unique Reposters Instances = " + str(total_reposters)  )

# Followers of reposters     
# This can take a *long* time!   
for reposter in reposters_all:   
    if ( "@" not in reposter["acct"]) :  
        reposter_id = reposter["id"]
        print( "Getting followers of reposter " + reposter["acct"] + " with ID " + str(reposter_id) )
        reposter_followers = mastodon.account_followers( reposter_id )   
        reposter_followers_all = mastodon.fetch_remaining( reposter_followers )  
        for reposter_follower in reposter_followers_all:    
            if ( "@" in reposter_follower["acct"]) : 
                f = reposter_follower["acct"].split("@")[1]
                instances_all.add( f )
                if (f not in instances_reposters_followers) :
                    print( "   Adding " + f + " from " + reposter["acct"] )
                    instances_reposters_followers.add( f )   
total_instances_reposters_followers  = len(instances_reposters_followers)
print( "Total Unique Reposters' Followers Instances = " + str(total_instances_reposters_followers)  )

#   Favourites
#   Find the accounts which favourited my status
favourites     = mastodon.status_favourited_by( status_id )
favourites_all = mastodon.fetch_remaining(favourites)

#   Get all the instance names of my favourites
for favourite in favourites_all:
    if ( "@" in favourite["acct"]) :
        f = favourite["acct"].split("@")[1]
        instances_all.add( f )
        if ( f not in instances_favourites ) :
            print( "Favourite: " + f )
            instances_favourites.add( r )
total_favourites = len(instances_favourites)

print( "Total Unique Favourites Instances  = " + str(total_favourites) )
print( "Total Unique Reposters Instances = " + str(total_reposters)  )
print( "Total Unique Followers Instances = " + str(total_followers)  )
print( "Total Unique Reposters' Followers Instances = " + str( len(instances_reposters_followers) ) )
print( "Total Unique Instances = " + str( len(instances_all) ) )

Using Selenium & Chrome to automatically download Blob files

@edent — Fri, 15 Sep 2023 11:34:46 +0000

The Selenium WebDriver is a brilliant way to programmatically interact with websites. You can write little Python scripts which can click around inside browser windows and do "stuff".

I use it to download a file generated by a Javascript Blob and automatically save it to disk. Here's how.

Set up the WebDriver

After you've installed Selenium and the Chrome WebDriver, this is the standard boilerplate to use it in Python:

from selenium import webdriver 
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

Set Up Chrome

You can pass whatever options and arguments you need - I use it in headless mode which means it doesn't display a window.

chrome_options = Options()
chrome_options.add_argument('--headless=new')
chrome_options.add_argument('--window-size=1920,1080')

Set where to save the files

These options force the blob to download automatically to a specific location. Note There is no way to set the default file name.

chrome_options.add_experimental_option("prefs", {
        "download.default_directory"  : "/tmp/",
        "download.prompt_for_download": False,
        "download.directory_upgrade"  : True,
})

Download the file

This opens the browser, finds the link, then clicks on it. Your XPATH will be different to mine.

driver = webdriver.Chrome(options=chrome_options)
driver.get("https://example.com")

download_button = driver.find_element(By.XPATH, "//button[@data-integration-name='button-download-data-csv']")
download_button.click()

Rename from the default name

As mentioned, there's no way to set a default file name. But if you know what the file is going to be called, you can rename after it has been downloaded.

time.sleep(2) # Wait until the file has been downloaded. Increase if it is a big file
os.rename("/tmp/example.csv", "/home/me/newfile.csv")

#       Stop the driver
driver.quit()

And there you go. Stick that in a script and you can automatically download and rename Blob URls.

Importing IntenseDebate Comment XML into Commentics

@edent — Fri, 04 Aug 2023 11:34:29 +0000

This is ridiculously niche. If this is of help to anyone other than to me... please shout!

The IntenseDebate comment system is slowly dying. It hasn't received any updates from Automattic for years. Recently it stopped being able to let users submit new comments.

So I've switched to Commentics which is a self-hosted PHP / MySQL comment system. It's lightweight, pretty good at respecting users' privacy, and very customisable. But it doesn't let you easily import comments. Here's how I fixed that.

Export From IntenseDebate

Go to your site's dashboard and click export. They'll email you a link when the process has finished. It's an XML file which looks something like this:



    
        https%3A%2F%2Fopenbenches.org%2Fbench%2F123
        The Page Title
        https://openbenches.org/bench/123
        
            
                0
                
                terence.eden@example.com
                https://example.com
                198.51.100.123
                https://en.wikipedia.org/wiki/Pokarekare_Ana  ]]>
                2017-08-01 14:51:55
                2017-08-01 14:51:55
                1
            
        
    
    ...

Understand the Commentics Table Structure

Once you've installed and configured Commentics, you will be able to replace the database with your old comments. To do that, you'll need to convert your exported XML file into three CSV files.

there are 3 tables you need to understand.

Users Table

This is the easiest one to understand. The columns are:

id, avatar_id, avatar_pending_id, avatar_selected, avatar_login, name, email, moderate, token, to_all, to_admin, to_reply, to_approve, format, ip_address, date_modified, date_added

Hopefully they're self-explanatory. The moderate is always set to default. The token can be any random string. A typical user will look like this:

2, 0, 0, , , Terence Eden, terence.eden@example.com, default, abc123, 1, 1, 1, 1, html, 127.0.0.1, 2017-08-01 14:51:55, 2017-08-01 14:51:55

Note The user's URL is not part of this table! That confused me at first. It is part of the comment table.

Pages Table

Every comment is associated with a page. Therefore every page needs a table.

id, site_id, identifier, reference, url, moderate, is_form_enabled, date_modified, date_added

Again, pretty straightforward. A typical page looks like:

123, 1, openbenches.org/bench/123, OpenBenches - Bench 123, https://openbenches.org/bench/123, default, 1, 2023-07-16 21:29:52, 2023-07-16 21:29:52

The id is a unique number. I've set it to be the same as my page's actual ID - but it doesn't need to be.

Comments Table

This is a slightly cumbersome table:

id, user_id, page_id, website, town, state_id, country_id, rating, reply_to, headline, original_comment, comment, reply, ip_address, is_approved, notes, is_admin, is_sent, sent_to, likes, dislikes, reports, is_sticky, is_locked, is_verified, number_edits, session_id, date_modified, date_added

You can ignore most of them - unless you really want to record someone's home town - and a typical user looks like this:

4, 17, 123, , , , , , , , Wow! That's a great bench. And I imagine the view must be special too. , Wow! That's a great bench. And I imagine the view must be special too., , 127.0.0.1, 1, Moderating all comments., 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, abc123, 2017-08-10 17:56:50, 2017-08-10 17:56:50

The id is unique per comment. The user_id is matched to the id on the users table. And the page_id is the id on the pages table. notes appears to be hardcoded to Moderating all comments.

Horrible evil no-good Python

Here's a Python script I made after a few beers. It ingests the XML file and spits out 3 CSV files.

import csv
import random
import string

import xml.etree.ElementTree as ET
tree = ET.parse('IntenseDebate.xml')
root = tree.getroot()

comment_header = ["id","user_id","page_id","website","town","state_id","country_id","rating","reply_to","headline","original_comment","comment","reply","ip_address","is_approved","notes","is_admin","is_sent","sent_to","likes","dislikes","reports","is_sticky","is_locked","is_verified","number_edits","session_id","date_modified","date_added"]

i = 1

with open('comments_to_import.csv', 'w') as file:
    writer = csv.writer(file)
    # writer.writerow(comment_header)   
    for child in root:
        for children in child:
            if (children.tag == "guid") :
                guid = children.text
                page_id = int(''.join(filter(str.isdigit, guid)))
            if (children.tag == "comments") :
                for comments in children :
                    commentID = comments.attrib["id"]
                    for commentData in comments :
                        if (commentData.tag == "name") :
                            username = commentData.text
                        if (commentData.tag == "text") :
                            commentHTML = commentData.text
                        if (commentData.tag == "url") :
                            userurl = commentData.text
                        if (commentData.tag == "date") :
                            commentDate = commentData.text
                            i = i+1
                            digits  = random.choices(string.digits, k=10)
                            letters = random.choices(string.ascii_lowercase, k=10)
                            sample  = random.sample(digits + letters, 20)
                            session_id  = "".join(sample)
                            row = [i,i,page_id,userurl,"","","","","","",commentHTML,commentHTML,"","127.0.0.1","1","Moderating all comments.","0","1","0","0","0","0","0","0","0","0",session_id,commentDate,commentDate]
                            writer.writerow(row)

pages_header = ["id","site_id","identifier","reference","url","moderate","is_form_enabled","date_modified","date_added"]
with open('pages_to_import.csv', 'w') as file:
    writer = csv.writer(file)
    # writer.writerow(pages_header)
    for i in range(30000):
        row = [i,"1","openbenches.org/bench/" + str(i),"OpenBenches - Bench " + str(i),"https://openbenches.org/bench/" + str(i),"default","1","2023-07-16 21:29:52","2023-07-16 21:29:52"]
        writer.writerow(row)

users_header = ["id","avatar_id","avatar_pending_id","avatar_selected","avatar_login","name","email","moderate","token","to_all","to_admin","to_reply","to_approve","format","ip_address","date_modified","date_added"]
i = 1
with open('users_to_import.csv', 'w') as file:
    writer = csv.writer(file)
    # writer.writerow(users_header)
    for child in root:
        for children in child:
            if (children.tag == "guid") :
                guid = children.text
                page_id = int(''.join(filter(str.isdigit, guid)))
            if (children.tag == "comments") :
                for comments in children :
                    commentID = comments.attrib["id"]
                    for commentData in comments :
                        if (commentData.tag == "name") :
                            username = commentData.text
                        if (commentData.tag == "url") :
                            userurl = commentData.text
                        if (commentData.tag == "date") :
                            commentDate = commentData.text
                            i = i+1
                            digits  = random.choices(string.digits, k=10)
                            letters = random.choices(string.ascii_lowercase, k=10)
                            sample  = random.sample(digits + letters, 20)
                            session_id  = "".join(sample)
                            row = [i,"0","0","","",username,"","default",session_id,"1","1","1","1","html","127.0.0.1",commentDate,commentDate]
                            writer.writerow(row)

Note my pages all follow a numeric sequence /1, /2 etc, hence the loop to quickly regenerate them. Your pages may be different.

Importing

You will need to use PHPmyAdmin or a similar database manager. TRUNCATE the tables for pages, users, and comments. Then import the CSV files into each one.

If everything works, you will have all your old comments imported into your new system.

Enjoy!

Shakespeare Serif - an experimental font based on the First Folio

@edent — Sat, 29 Jul 2023 11:34:57 +0000

Disclaimer! Work In Progress! See source code.

I recently read this wonderful blog post about using 17th Century Dutch fonts on the web. And, because I'm an idiot, I decided to try and build something similar using Shakespeare's first folio as a template.

Now, before setting off on a journey, it is worth seeing if anyone else has tried this before. I found David Pustansky's First Folio Font. There's not much info about it, other than it's based on the 1623 folio. It's a nice font, but missing brackets and a few other pieces of punctuation. Also, no ligatures. And the long s is in the wrong place.

So, let's try to build a font!

You can read how it works, or skip straight to the demo.

Get some scans

There are various scans of the First Folio. I picked The Bodlian's scan as it seemed the highest resolution.

I plucked a couple of pages at random to see what I could find. Of course, a modern font can't replicate the vagaries of hot metal printing. As you can see here, each letter "y" is substantially different.

Within the plays, there are some italic characters - which could be used to make a variant font. You can also see just how poor quality some of the letters are.

There are also plenty of ligatures to choose from:

Ready? Let's go!

Extract the characters

This Python code reads in an image file. It then extracts every distinct letter, number, and punctuation mark. It then detects which character it is and saves each glyph to disk with a filename like this:

As you can see, the text detection is good, but the letter recognition is poor.

import cv2
import pytesseract
from PIL import Image

def preprocess_image(image_path):
    # Load the image using OpenCV
    image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)

    # Thresholding to convert to binary image
    _, binary_image = cv2.threshold(image, 128, 255, cv2.THRESH_BINARY_INV)

    # Find contours to isolate individual letters
    contours, _ = cv2.findContours(binary_image, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

    return image, contours

def extract_and_save_letters(image, contours, output_directory):
    # Create output directory if it doesn't exist
    import os
    if not os.path.exists(output_directory):
        os.makedirs(output_directory)

    for i, contour in enumerate(contours):
        x, y, w, h = cv2.boundingRect(contour)

        # Crop and save each letter as a separate image
        letter_image = image[y:y + h, x:x + w]

        # (Don't) Perform OCR to extract the text (letter) from the contour
        letter_text = "_"
        #letter_text = pytesseract.image_to_string(letter_image, config='--psm 10')
        #letter_text = letter_text.strip()  # Remove leading/trailing whitespace

        # Create a filename with the detected letter
        letter_filename = f"letter_{letter_text}_{i}.png"

        letter_path = os.path.join(output_directory, letter_filename)
        cv2.imwrite(letter_path, letter_image)

if __name__ == "__main__":
    input_image_path = "letters.jpg"
    output_directory = "/tmp/letters/"

    # Preprocess the image
    image, contours = preprocess_image(input_image_path)

    # Perform OCR and save individual letters
    extract_and_save_letters(image, contours, output_directory)

Something to note - the CHAIN_APPROX_SIMPLE is looking for contiguous characters. So it loses the dots from i, j, :, and ;. But it is quick.

Detecting Dots

In order to get glyphs which vertically separate, we need to vertically erode the image so it looks like this:

# Erode the image vertically
kernel = np.array([[0, 0, 0, 0, 0],
                   [0, 0, 1, 0, 0],
                   [0, 0, 1, 0, 0],
                   [0, 0, 1, 0, 0],
                   [0, 0, 0, 0, 0]], dtype=np.uint8)

erode = cv2.erode(image, kernel,iterations = 6)

We use this eroded image for contiguous detection - but we do the actual cropping on the original image.

As you can see, it does make some character touch each other - which means you get occasional crops like this:

They can either be manually split, or ignored.

Put each letter into a folder

There's no automated way to do this. It's just a lot of tedious dragging and dropping. It's hard to tell the difference between o and O, or commas and apostrophes.

Ideally we want several of each glyph because we're about to...

Find the average letterform

Here's a selection of letter "e" images which were extracted.

I didn't want to make some rather arbitrary decisions on which letters I like best. So I cheated.

I copied all the letter "e"s into a folder. I used Python to create the average letter based on the two-dozen or so that I'd extracted. This code takes all the images in a directory, and spits out a 1bpp average letter - like this:

import os
import numpy as np
import argparse
import math
from PIL import Image

def get_arguments():
    ap = argparse.ArgumentParser()
    ap.add_argument('-l', '--letter', type=str,
                    help='The letter you want to average')
    arguments = vars(ap.parse_args())

    return arguments

def load_and_resize_images_from_directory(directory, target_size):
    image_files = [f for f in os.listdir(directory) if f.endswith(".png")]

    images = []
    for image_file in image_files:
        image_path = os.path.join(directory, image_file)
        print("Reading " + image_path)
        image = Image.open(image_path).convert("L")  # Convert to grayscale

        # Create a new white background image
        new_size = (target_size[0], target_size[1])
        new_image = Image.new("L", new_size, color=255)  # White background

        old_width, old_height = image.size

        # Center the image
        x1 = int(math.floor((target_size[0] - old_width)  / 2))
        y1 = int(math.floor((target_size[1] - old_height) / 2))

        # Paste the image at the center
        new_image.paste(image, (x1, y1, x1 + old_width, y1 + old_height))

        # Make it larger to see if that improves the curve detection  
        new_image = new_image.resize( (600,600), Image.LANCZOS)
        images.append(new_image)

    return images

def calculate_average_image(images):
    # Convert the list of images to numpy arrays
    images_array = [np.array(img) for img in images]

    # Calculate the average image along the first axis
    average_image = np.mean(images_array, axis=0)

    return average_image

def convert_to_1bpp(average_image, threshold=120):
    # Convert the average image to 1bpp by setting a threshold value
    binary_image = np.where(average_image >= threshold, 255, 0).astype(np.uint8)

    return binary_image

def save_1bpp_image(binary_image, output_path):
    # Convert the numpy array to a binary image
    binary_image = Image.fromarray(binary_image, mode="L")

    # Save the 1bpp monochrome image to the specified output path
    binary_image.save(output_path)

if __name__ == "__main__":
    args = get_arguments()
    letter = args['letter']
    input_directory   = "../letters/" + letter + "/"
    output_png_path = "../letters/" + letter + ".png"
    target_size = (75, 75)  # Set the desired target size for resizing

    # Load, resize, and add border to all images from the directory
    images = load_and_resize_images_from_directory(input_directory, target_size)

    # Calculate the average image
    average_image = calculate_average_image(images)

    # Convert the average image to 1bpp
    binary_image = convert_to_1bpp(average_image)

    # Save the 1bpp monochrome image
    save_1bpp_image(binary_image, output_png_path)

One Big Image

The next step is to create a single image which holds all of the glyphs. Our good friend ImageMagick comes to the rescue here:

montage *.png -tile 12x8 -geometry +10+10 all_glyphs.png

That takes all of the average symbol .png files and places them on a single image. It looks like this:

Trace Those Glyphs

The GlyphTracer App will take the image and generates a Spline Font Database. It isn't the most intuitive app to use. But after a bit of clicking around you can work out how to assign each image to a glyph.

GlyphTracer uses potrace which turns those raggedy rasters into smoothly curved paths.

Once done, we're on to the next step.

Forge Those Fonts!

The venerable FontForge will open the SFD and show us what the proto-font looks like:

As you can see, all the letters have been vertically centred. So double tap and edit their position - you can also adjust the curves if you like:

The final result looks something like this:

FontForge's "File" ➡ "Generate Font" will let you save the output as TTF, WOFF2, or anything else you want.

Demo!

Here's what the font looks like when rendered on the web:

Two houſeholds, both alike in dignity!
Alas poor Yorik; I knew him Horatio.
To be? Or not to be? That's the uestion.
Bump sickly, vexing wizard! Be sly, fox, and charm the dragon's breath.

TODO

Get more sample images from the 1st Folio.
Extract more letters, numbers, ligatures, and symbols.
Sort symbols into sub-directories.
Generate font with complete alphabet.
Tidy up curves.
Set correct height, ascenders, descenders, etc.
Make the ligatures automatic.
Other font stuff that I haven't even thought of yet!

Want to help out? See the source code on GitHub.

Posting Untappd Checkins to Mastodon (and other services)

@edent — Thu, 02 Mar 2023 12:34:56 +0000

I'm a big fan of Untappd. It's a social drinking app which lets you check in to a beer and rate it. Look, we all need hobbies, mine is drinking cider. You can see a list of everything I've drunk over the 13 last years. Nearly 900 different pints!

After checking in, the app automatically posts to Twitter. But who wants to prop up Alan's failing empire? Not me! So here's some quick code to liberate your data and post it elsewhere.

There are two ways - APIs and Screen Scraping.

API

First up, a big disclaimer. Untappd had an API - but aren't accepting new users:

Thank you for your interest in Untappd’s API. At this time, we are no longer accepting new applications for API access as we work to improve our review and support processes. We do not have a planned date to begin accepting new applications, so please check back soon.

If you already have an API key (and I do) you can call your own data. This code saves the ID of the most recent checkin to a file. Each time it runs, it checks in the ID in the file is the same as what's returned by the API. If they are different, the post is published and the new ID is saved.

This is rough and ready code:

#!/usr/bin/env python
from mastodon import Mastodon
import json
import requests
import config

#  Set up access
mastodon = Mastodon( api_base_url=config.instance, access_token=config.write_access_token )

#   Untappd API - grab the most recent checkin
untappd_api_url = 'https://api.untappd.com/v4/user/checkins/edent?limit=1&client_id=' + config.untappd_client_id + '&client_secret='+ config.untappd_client_secret

r = requests.get(untappd_api_url)

untappd_data = r.json()

#   Latest checkin object
checkin = untappd_data["response"]["checkins"]["items"][0]
untappd_id = checkin["checkin_id"]

#   Was this ID the last one we saw?
check_file = open("untappd_last", "r")
last_id = int( check_file.read() )
print("Found " + str(last_id) )
check_file.close()

if (last_id != untappd_id ) :
    print("Found new checkin")
    check_file = open("untappd_last", "w")
    check_file.write( str(untappd_id) )
    check_file.close()
    #   Start creating the message
    message = ""

    if "checkin_comment" in checkin :
        message += checkin["checkin_comment"]

    if "beer" in checkin :
        message += "\nDrinking: " + checkin["beer"]["beer_name"]

    if "brewery" in checkin :
        message += "\nBy: "       + checkin["brewery"]["brewery_name"]

    if "venue" in checkin :
        if "venue_name" in checkin["venue"] :
            message += "\nAt: "       + checkin["venue"]["venue_name"]
    #   Scores etc
    untappd_checkin_url = "https://untappd.com/user/edent/checkin/" + str(untappd_id)
    untappd_rating      = checkin["rating_score"]
    untappd_score       = "🍺" * int(untappd_rating)

    message += "\n" +  untappd_score + "\n" + untappd_checkin_url + "\n" + "#untappd"
    print( message )
    #   Post to Mastodon. Use idempotency just in case something went wrong
    mastodon.status_post(status = message, idempotency_key = str(untappd_id))
else :
    print("No new checkin")

This doesn't do media or badges, but it's good enough to start with.

Screen Scraping

The Untappd HTML is pretty uniform, so using something like Beautiful Soup is possible.

Here's some code to get you started

from bs4 import BeautifulSoup
import requests

#   Set the URL and headers
url = "https://untappd.com/user/edent"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/109.0'}

#   Get the HTML
response = requests.get(url)
page_html = response.text

#   Parse the HTML into soup
soup = BeautifulSoup(page_html, 'html.parser')

#   Grab the data from the first checkin

untappd_id = int( soup.find("div", class_="item")["data-checkin-id"] )

comment = soup.find("p", class_="comment-text").text

#   Steal the beer from the icon's alt text
beer = soup.find("a", class_="label").find("img")["alt"]

#   Rating is a data element
rating = soup.find("div", class_="caps")["data-rating"]

#   Was this ID the last one we saw?
check_file = open("untappd_last", "r")
last_id = int( check_file.read() )
check_file.close()

if (last_id != untappd_id ) :
    check_file = open("untappd_last", "w")
    check_file.write( str(untappd_id) )
    check_file.close()
    message = .....

And from there you should have enough to start posting your checkins everywhere. Stick that code in a crontab and have it run periodically.

Cheers!

Getting Started with Mastodon's Conversations API

@edent — Thu, 17 Nov 2022 12:34:10 +0000

The social network service "Mastodon" allows people to publish posts. People can reply to those posts. Other people can reply to those replies - and so on. What does that look like in the API? Here's a quick guide to the concepts you need to know - and some code to help you visualise conversations.

When you scroll through the website, you normally see a list of replies. It looks like this:

Because it acts as a one-dimensional list, there's no easy way to figure out which post someone is replying to.

The data structure underlying the conversation is quite different. It actually looks like this:

Concepts

In Mastodon's API, a post is called a status.

Every status on Mastodon has an ID. This is usually a Snowflake ID which is represented as a number.

When someone replies to a status on Mastodon, they create a new status which has a field called in_reply_to_id. As its name suggests, has the ID of the status they are replying to.

Let's imagine this simple conversation:

Ada: "How are you?"
Bob: "I'm fine. And you?"
Ada: "Quite well, thank you!"

Message 2 is in reply to message 1. Message 3 is in reply to message 2.

In Mastodon's jargon, message 1 is the ancestor of message 2. Similarly, message 3 is the descendant of message 2.

  → Descendants →
1--------2-------3
   ← Ancestors ←

Branches

Now, let's imagine a more complicated conversation - one with branches!

1. Alice: What's your favourite pizza topping?
├── 2. Bette: Pineapple
│   ├── 4. Chuck: You make me sick!
│   └── 7. Dave: Yeah, I love pineapple too
└── 3. Chuck: Mushroom are the best
    ├── 5. Alice: Really?
    │   └── 6. Dave: Button mushrooms are best!
    └── 8. Elle: I like them too!

As you can see, people reply in threads. In this example, 2 is a different "branch" of the conversations than 3.

It looks a bit more complicated with hundreds of replies, but that's it! That's all you need to know!

API

If you want to download a single status with an ID of 1234 the API call is /api/v1/statuses/1234

If you want to download a conversation, it is a little bit more complicated. Mastdon's API calls a conversation a context

Let's take the above simple example - Ada and Bob speaking. Ada's first status has an ID of 1. To get the conversation, the API call is /api/v1/statuses/1/context

That returns two things:

A list of ancestors. This is empty because 1 is the first status in this conversation.
A list of descendants. This contains statuses 2 and 3.

You will note, the context does not return the status 1 itself.

Let's suppose that, instead of asking for the context of status 1, we instead asked for 2. This would return:

A list of ancestors. This contains status 1.
A list of descendants. This contains status 3.

What about if we asked for 3? This would return:

A list of ancestors. This contains status 1 and 2
A list of descendants. This is empty because 3 is the last message in this conversation.

Branches

When it comes to complex threads - like the pizza example - things become a bit more difficult. Let's see the example again:

1. Alice: What's your favourite pizza topping?
├── 2. Bette: Pineapple
│   ├── 4. Chuck: You make me sick!
│   └── 7. Dave: Yeah, I love pineapple too
└── 3. Chuck: Mushroom are the best
    ├── 5. Alice: Really?
    │   └── 6. Dave: Button mushrooms are best!
    └── 8. Elle: I like them too!

Suppose we ask for the context of the message with ID 5. This would return:

A list of ancestors. This contains statuses 1 and 3
A list of descendants. This contains status 6.

That's it!?!? Where are the rest? They are part of a different conversation branch. Even status 8 isn't returned because it's a reply to 3, not 5.

In order to get the full conversation, we need to be sneaky!

The list of ancestors contains the first message in the conversation. So we can grab that, and then call context again for its ID.

Let's dive into some Python code to see how it works.

Code

This uses the Mastodon.py library for calling the Mastodon API and the Python treelib to create a conversation tree data structure.

This code connects to Mastodon and receives the status for a single ID.

from mastodon import Mastodon
from treelib import Node, Tree

mastodon = Mastodon( api_base_url="https://mastodon.example", access_token="Your personal access token from your instance" )

status_id =  109348943537057532 
status = mastodon.status(status_id)

Getting the conversation means calling the context API:

conversation = mastodon.status_context(status_id)

⚠ Note: Calling the context on a large thread may take a long time. The longer the conversation, the longer you'll have to wait.

If there are ancestors, that means we are only on a single branch. The 0th ancestor is the top of the conversation tree. So let's get the context for that top status:

if len(conversation["ancestors"]) > 0 :
   status = conversation["ancestors"][0]
   status_id = status["id"]
   conversation = mastodon.status_context(status_id)

Next, we need to create a data structure to hold the conversation. We'll start by adding to it the first status in the conversation:

tree = Tree()

tree.create_node(status["uri"], status["id"])

Finally, we add any replies which are in the descendants. It is possible that some earlier statuses have been deleted. So we won't add any status which are replies to deleted statuses:

for status in conversation["descendants"] :
   try :
      tree.create_node(status["uri"], status["id"], parent=status["in_reply_to_id"])
   except :
      #  If a parent node is missing
      print("Problem adding node to the tree")

That's it! Let's show the tree:

tree.show()

Here's what it should look like:

2022-11-14 20:02 Edent: Today I was meant to be flying in to San Francisco to attend Twitter's Developer Conference - Chirp.Twitter had paid for my flights and hotel, because I was one of their developer insiders. I planned to spend the week meeting friends old and new.Instead, Alan the Hyperprat canceled the conference. So I'm staying in the UK.So I'm going to spend the week hacking on Mastdon's #API and building cool shit.  That'll show him!You can see what I'm working on at https://shkspr.mobi/blog/2022/11/building-an-on-this-day-service-for-mastodon/ https://mastodon.social/users/Edent/statuses/109343943300929632
├── 2022-11-14 20:10 Edent: Oh! And I was meant to be attending a Belle & Sebastian gig tonight. I canceled those tickets for I could fly to SF.So far, I reckon Alan's acquisition of Twitter has cost me close to £190.Wonder if he's good for the money? https://mastodon.social/users/Edent/statuses/109343972435801664
│   ├── 2022-11-14 20:14 thehodge: @Edent reminds me of the time I was booked to speak at a conference in Munich and I excitedly booked a behind the scenes tour of the worlds largest miniature city!Then the company went under!Gutted. https://mastodon.social/users/thehodge/statuses/109343989481494630
│   ├── 2022-11-14 21:16 Janiqueka: @Edent the way my bill for him keeps increasing https://mastodon.online/users/Janiqueka/statuses/109344233355230523
│   ├── 2022-11-14 21:19 henry: @Edent I was due to be at B&S tomorrow but it’s been postponed again.. not sure if that makes it better or worse for you! https://social.lc/users/henry/statuses/109344244402822729
│   │   └── 2022-11-15 04:53 Edent: @henry again!? Ah well!Hope you get to see them soon. https://mastodon.social/users/Edent/statuses/109346031194446940
│   ├── 2022-11-15 09:18 Amandafclark: @Edent send him an invoice :) https://mastodon.social/users/Amandafclark/statuses/109347071811426672
│   └── 2022-11-15 11:29 Edent: One of the #MastodonAPI projects I'm working on is a better way to view long & complex threads.You may have seen me build something similar for the other site a while ago - demo at https://shkspr.mobi/blog/2021/09/augmented-reality-twitter-conversations/ - so I'm hoping I can do something similarly interesting.Main limitation is getting *all* of the conversation threads. It looks like the context API isn't paginated. But I might be being thick. https://mastodon.social/users/Edent/statuses/109347587353822637
│       ├── 2022-11-15 11:36 bensb: @Edent Excellent project. You might have seen, but there's also this feature request for better 🧵 handling: https://github.com/mastodon/mastodon/issues/8615 https://genomic.social/users/bensb/statuses/109347612990393791
│       ├── 2022-11-15 11:39 Edent: Cor! That @katebevan is good for engagement! Look at all those conversations she's kicked off! https://mastodon.social/users/Edent/statuses/109347627634008550
│       │   ├── 2022-11-15 11:58 Edent: Indeed, how could they be?That means that ID of a reply is different depending on where you see it.So the ID of this post is:mastodon. social /@ edent/ 123456But when you see it on your server, it might appear as:your. server /@ edent/ 987654The #MastodonAPI copes with this really well. But it is a mite confusing to get one's head around. https://mastodon.social/users/Edent/statuses/109347703064222520
│       │   │   ├── 2022-11-15 12:02 erincandescent: @Edent the numeric IDs are not part of the protocol - it's all URL based. Pleroma uses UUIDs for example https://queer.af/users/erincandescent/statuses/109347716173491502
│       │   │   │   └── 2022-11-15 12:06 Edent: @erincandescent oh! That's interesting. Thanks. https://mastodon.social/users/Edent/statuses/109347734283971306

Once you have a tree, you can format the contents however you like.

Grab the code

You can download the code for my Mastodon API tools from CodeBerg. Enjoy!