Removing "/Subtype /Watermark" images from a PDF using Linux

@edent — Thu, 22 Jan 2026 12:34:02 +0000

Problem: I've received a PDF which has a large "watermark" obscuring every page.

Investigating: Opening the PDF in LibreOffice Draw allowed me to see that the watermark was a separate image floating above the others.

Manual Solution: Hit page down, select image, delete, repeat 500 times. BORING!

Further Investigating: Using pdftk, it's possible to decompress a PDF. That makes it easier to look through manually.

pdftk input.pdf output output.pdf uncompress

Hey presto! A PDF you can open in a text editor! Deep joy!

Searching: On a hunch, I searched for "watermark" and found several lines like this:

<<
/Length 548
>>
stream
/Figure <>BDC q 0 0 477 733.464 re W n q /GS0 gs 479.2799893 0 0 735.5999836 -1.0800002 -1.0559941 cm /Im0 Do Q EMC 
/Figure <>BDC Q q 28.333 300.661 420.334 126.141 re W n q /GS0 gs 420.3339603 0 0 126.1418879 28.3330078 300.6610601 cm /Im1 Do Q EMC
/Figure <>BDC Q q 16.106 0 444.787 215.464 re W n q /GS0 gs 444.7874274 0 0 216.5921386 16.1062775 -1.1281493 cm /Im2 Do Q EMC
/Artifact <>BDC Q q 0.7361145 0 0 0.7361145 113.3616638 240.8575745 cm /GS1 gs /Fm0 Do Q EMC
endstream
endobj

Those are Marked Content Blocks. In theory you can just chop out the line with /Subtype /Watermark but each block has a /length variable - so you'd also need to adjust that to account for what you've changed - otherwise the layout goes all screwy.

That led me to PyMuPDF which claimed to solve the problem. But running that code only removed some of the watermarks. It got stuck on an infinite loop on certain pages.

So, now that I had more detailed knowledge, I managed to get an LLM to construct something which mostly seems to work.

Does it work with every PDF? I don't know. Does it contain subtle implementation bugs? Probably. Is there an easier way to do this? Not that I can find.

import re
import pymupdf

# Open the PDF
doc = pymupdf.open("output.pdf")

# Regex of the watermarks
pattern = re.compile(
    rb"/Artifact\s*<<[^>]*?/Subtype\s*/Watermark[^>]*?>>BDC.*?EMC",
    re.DOTALL
)

# Loop through the PDF's pages
for page_num, page in enumerate(doc, start=1):
    print(f"Processing page {page_num}")
    xrefs = page.get_contents()
    for xref in xrefs:
        cont = doc.xref_stream(xref)
        new_cont, n = pattern.subn(b"", cont)
        if n > 0:
            print(f"  Removed {n} watermark block(s)")
            doc.update_stream(xref, new_cont)

doc.save("no-watermarks.pdf")

One of the (many) problems with Vibe Coding is that trying to get a LLM to spit out something useful depends massively on how well you know the subject area. I'm proud to say I know vanishingly little about the baroque PDF specification - which meant that most of my attempts to use various "AI" tools consisted of me saying "No, that doesn't work" and the accurs'd machine saying back "Golly-gee! You're right! Let me fix that!" and then breaking something else.

I'm not sure this is the future we wanted, but it looks like the future we've got.

LLMs are still surprisingly bad at some simple tasks

@edent — Sun, 21 Sep 2025 11:34:38 +0000

I asked three different commercially available LLMs the same question:

Which TLDs have the same name as valid HTML5 elements?

This is a pretty simple question to answer. Take two lists and compare them. I know this question is possible to answer because I went through the lists two years ago. Answering the question was a little tedious and subject to my tired human eyes making no mistakes. So surely this is the sort of thing which an LLM excels at, right?

Wrong!

Here's how the three big beasts fared.

ChatGPT

OpenAI's LLM does a poor job:

Good question! Let’s carefully check this.
[…Skip preamble…]
✅ Confirmed Matches
The following TLDs exactly match valid HTML5 element names:

.audio →
.code →
.link →
.menu →
.nav →
.video →



This is wrong on two counts. First, it misses off several matches. Secondly, there is no .code TLD! There is a .codes, but that's not a match.

Google Gemini

Gemini shits the bed completely:


Here are the top-level domains (TLDs) that have the same name as valid HTML5 elements:

.address: The HTML  element is used to define the contact information for the author/owner of a document or an article.
.article: The HTML  element specifies independent, self-contained content.
.aside: The HTML  element is used to define some content aside from the content it is placed in.
.audio: The HTML  element is used to embed sound content in a document.
.body: The HTML  element defines the document's main content.
[…Skip a dozen more items…]



It just lists a bunch of HTML elements with no crossover to TLDs. Thanks Google. Thoogle.

Claude

Anthropic's AI does a half-arsed job.


Here are the TLDs that match HTML5 element names:

.audio - matches the  element
.video - matches the  element
.data - matches the  element
.link - matches the  element
.menu - matches the  element
.style - matches the

LLM – Terence Eden’s Blog

Removing "/Subtype /Watermark" images from a PDF using Linux

LLMs are still surprisingly bad at some simple tasks

ChatGPT

Google Gemini

Claude