epub – Terence Eden’s Blog

Would adding Brotli Compression help shrink ePubs?

@edent — Sat, 26 Jul 2025 11:34:35 +0000

The ePub format is the cross-platform way to package an eBook. At its heart, an ePub is just a bundled webpage with extra metadata - that makes it extremely easy to build workflows to create them and apps to read them.

Once you've finished authoring your ePub, you've got a folder full of HTML⁰, CSS, metadata documents, and other resources. The result is then stored in a standard Zip file and is then renamed to .epub. This is known as the Open Container Format (OCF).

There are actually a few different compression schemes for Zip files, but the specification says:

OCF ZIP containers MUST include only stored (uncompressed) and Deflate-compressed ZIP entries within the ZIP archive.

The Deflate algorithm is venerable¹ and, while incredible for its time, has been superseded by more modern compression schemes. For example, Brotli.

What happens if we unzip an ePub and then recompress it with Brotli? Will that dramatically reduce the file size?

Steps

Unzip the book
- unzip book.epub -d book/
Brotli files can't contain directories, so tar the directory without any compression
- tar -cvf book.tar book/
Create a Zip file with maximum compression
- zip -9 book.tar.zip book.tar
Create a Brotli file with maximum compression
- brotli -k -q 11 book.tar

Results

I took a random(ish) sample from Standard eBooks and a few from my personal stash².

	Book 1	Book 2	Book 3	Book 4
Contents	768KB	911KB	389KB	594KB
Deflate	250KB	248KB	103KB	175KB
Brotli	190KB	187KB	82KB	137KB

The good news is that ePubs compress pretty well already! That isn't much of a surprise - compression algorithms love the repetitious nature of HTML and human-readable text. Obviously Brotli is better but, on the file sizes we're talking about, not dramatically better. Saving 60KB is OK - but in a world of terabyte sized SD cards does it matter?

Brotli is also computationally harder to decompress, which makes it slightly less attractive for low-powered eReaders.

It's also possible to make a small saving by reducing the complexity and verbosity of the CSS and HTML.

However, that's not the real problem.

I lied to you

An ePub contains more than just text and text-based metadata. It can contain web fonts, images, even music. The above books had all their fonts and media stripped out. Let's run the experiment again but, this time, including everything in the original book.

	Book 1	Book 2	Book 3	Book 4
Contents	23MB	3.8MB	0.76MB	0.93MB
Deflate	22MB	1.7MB	0.46MB	0.51MB
Brotli	22MB	1.5MB	0.43MB	0.47MB

All of a sudden, Brotli makes next to no difference. Yes, the textual compression is still there, but it is overshadowed by the huge cost of the media files.

Mixed Media

The ePub 3.3 specification lays out which multimedia formats are acceptable. As well as the older formats like gif, png, and jpeg - newer formats like WebP are acceptable. Similarly, TTF fonts are listed in the standard along with WOFF2.

Modern image and font formats have better compression than their ancestors. Indeed, WOFF2 uses Brotli as its compression scheme.

The biggest filesize saving in ePubs comes from properly compressing images and fonts.

Can You Picture That?

It is a matter of opinion as to what resolution is best suited to an ePub. Most modern eReaders have, at best, 300ppi resolution. They're also normally monochrome. But eBooks aren't always read on low-resolution, black and white eInk screens - so it probably makes sense to have high-resolution colour images in order to future-proof books.

But the compression of those images is not a matter of opinion. Lossless compression algorithms are well supported for legacy and modern image formats.

Let's take a specific example. Twenty Years at Hull House is the 22MB book above. Less than a MB of that is for text, the rest is images.

The largest illustration in the book is a 1937x1971, transparent PNG weighing in at 1MB. Increasing the lossless compression level takes it down to 840KB. Reducing the palette to something more suitable takes it to 640KB. If you were releasing this as an ePub 3.3 file, using WebP would take the image to a hair over 600KB.

Basically, a 20%-40% filesize reduction with no loss of fidelity.

Across all the PNG images in the ePub, I was able to easily get the filesize from 20MB to 16MB.

Converting to lossless WebP got it down to 13MB.

What The Font?

Fonts can be shrunk in a number of ways. The most obvious way is to compress to WOFF2 which, as described above, uses Brotli compression.

Based on my quick tests, a typical ePub's TTF will see about a 50% reduction in font size. For typical "English" language fonts, that's a reduction from 30KB to 15KB. So big relative compression, but small absolute compression.

Complex decorative fonts can go from 800KB to 80KB. But it is rare for a font to exceed a megabyte.

If it does, that usually means that it has more glyphs than strictly necessary. If your book is written entirely in the Latin alphabet, do you really need all those fancy accents, Chinese ideographs, and emoji? Probably not.

I've previously written about Subsetting Fonts and the perils of excessive trimming.

Back to Basics

Brotli is magic - but changing the compression algorithm for the ePub standard is probably a false economy. The text portion of modern eBooks is already fairly small and compresses with reasonable efficiency.

The best compression gains come from either using next-generation image and font formats or, if legacy compatibility is necessary, using the most aggressive compression settings for traditional images.

OK! It is actually XHTML, but let's not quibble. ↩︎
That's a fancy way of saying "old". ↩︎
I couldn't be bothered automating this. Go ahead a run it on every ePub if you want something more representative. ↩︎

Extracting content from an LCP "protected" ePub

@edent — Sun, 16 Mar 2025 12:34:57 +0000

As Cory Doctorow once said "Any time that someone puts a lock on something that belongs to you but won't give you the key, that lock's not there for you."

But here's the thing with the LCP DRM scheme; they do give you the key! As I've written about previously, LCP mostly relies on the user entering their password (the key) when they want to read the book. Oh, there's some deep cryptographic magic in the background but, ultimately, the key is sat on your computer waiting to be found. Of course, cryptography is Very Hard™ which make retrieving the key almost impossible - so perhaps we can use a different technique to extract the unencrypted content?

One popular LCP app is Thorium. It is an Electron Web App. That means it is a bundled browser running JavaScript. That also means it can trivially be debugged. The code is running on your own computer, it doesn't touch anyone else's machine. There's no reverse engineering. No cracking of cryptographic secrets. No circumvention of any technical control. It doesn't reveal any illegal numbers. It doesn't jailbreak anything. We simply ask the reader to give us the content we've paid for - and it agrees.

Here Be Dragons

This is a manual, error-prone, and tiresome process. This cannot be used to automatically remove DRM. I've only tested this on Linux. It must only be used on books that you have legally acquired. I am using it for research and private study.

This uses Thorium 3.1.0 AppImage.

First, extract the application:

./Thorium-3.1.0.AppImage --appimage-extract

That creates a directory called squashfs-root which contains all the app's code.

The Thorium app can be run with remote debugging enabled by using:

./squashfs-root/thorium --remote-debugging-port=9223 --remote-allow-origins=*

Within the Thorium app, open up the book you want to read.

Open up Chrome and go to http://localhost:9223/ - you will see a list of Thorium windows. Click on the link which relates to your book.

In the Thorium book window, navigate through your book. In the debug window, you should see the text and images pop up.

In the debug window's "Content" tab, you'll be able to see the images and HTML that the eBook contains.

Images

The images are the full resolution files decrypted from your ePub. They can be right-clicked and saved from the developer tools.

Files

An ePub file is just a zipped collection of files. Get a copy of your ePub and rename it to whatever.zip then extract it. You will now be able to see the names of all the files - images, css, fonts, text, etc - but their contents will be encrypted, so you can't open them.

You can, however, give their filenames to the Electron app and it will read them for you.

Images

To get a Base64 encoded version of an image, run this command in the debug console:

fetch("httpsr2://...--/xthoriumhttps/ip0.0.0.0/p/OEBPS/image/whatever.jpg") .then(response => response.arrayBuffer())
  .then(buffer => {
    let base64 = btoa(
      new Uint8Array(buffer).reduce((data, byte) => data + String.fromCharCode(byte), '')
    );
    console.log(`data:image/jpeg;base64,${base64}`);
  });

Thorium uses the httpsr2 URl scheme - you can find the exact URl by looking at the content tab.

CSS

The CSS can be read directly and printed to the console:

fetch("httpsr2://....--/xthoriumhttps/ip0.0.0.0/p/OEBPS/css/styles.css").then(response => response.text())
  .then(cssText => console.log(cssText));

However, it is much larger than the original CSS - presumably because Thorium has injected its own directives in there.

Metadata

Metadata like the NCX and the OPF can also be decrypted without problem:

fetch("httpsr2://....--/xthoriumhttps/ip0.0.0.0/p/OEBPS/content.opf").then(response => response.text())
  .then(metadata => console.log(metadata));

They have roughly the same filesize as their encrypted counterparts - so I don't think anything is missing from them.

Fonts

If a font has been used in the document, it should be available. It can be grabbed as Base64 encoded text to the console using:

fetch("httpsr2://....--/xthoriumhttps/ip0.0.0.0/p/OEBPS/font/Whatever.ttf") .then(response => response.arrayBuffer())
  .then(buffer => {
    let base64 = btoa(
      new Uint8Array(buffer).reduce((data, byte) => data + String.fromCharCode(byte), '')
    );
    console.log(`${base64}`);
  });

From there it can be copied into a new file and then decoded.

Text

The HTML of the book is also visible on the Content tab. It is not the original content from the ePub. It has a bunch of CSS and JS added to it. But, once you get to the body, you'll see something like:


    
        Book Title
        
            
                
            
        
        
        SUMMARY 
        Lorem ipsum etc.

Which looks like plain old ePub to me. You can use the fetch command as above, but you'll still get the verbose version of the xHTML.

Putting it all together

If you've unzipped the original ePub, you'll see the internal directory structure. It should look something like this:

├── META-INF
│   └── container.xml
├── mimetype
└── OEBPS
    ├── content.opf
    ├── images
    │   ├── cover.jpg
    │   ├── image1.jpg
    │   └── image2.png
    ├── styles
    │   └── styles.css
    ├── content
    │   ├── 001-cover.xhtml
    │   ├── 002-about.xhtml
    │   ├── 003-title.xhtml
    │   ├── 004-chapter_01.xhtml
    │   ├── 005-chapter_02.xhtml
    │   └── 006-chapter_03.xhtml
    └── toc.ncx

Add the extracted files into that exact structure. Then zip them. Rename the .zip to .epub. That's it. You now have a DRM-free copy of the book that you purchased.

BONUS! PDF Extraction

LCP 2.0 PDFs are also extractable. Again, you'll need to open your purchased PDF in Thorium with debug mode active. In the debugger, you should be able to find the URl for the decrypted PDF.

It can be fetched with:

fetch("thoriumhttps://0.0.0.0/pub/..../publication.pdf") .then(response => response.arrayBuffer())
  .then(buffer => {
    let base64 = btoa(
      new Uint8Array(buffer).reduce((data, byte) => data + String.fromCharCode(byte), '')
    );
    const blob = new Blob([buffer], { type: "application/pdf" });
    const link = document.createElement("a");
    link.href = URL.createObjectURL(blob);
    link.download = "publication.pdf"; // filename for saving
    link.click();
    URL.revokeObjectURL(link.href);
  });

Copy the output and Base64 decode it. You'll have an unencumbered PDF.

Next Steps

That's probably about as far as I am competent to take this.

But, for now, a solution exists. If I ever buy an ePub with LCP Profile 2.0 encryption, I'll be able to manually extract what I need from it - without reverse engineering the encryption scheme.

Ethics

Before I published this blog post, I publicised my findings on Mastodon. Shortly afterwards, I received a LinkedIn message from someone senior in the Readium consortium - the body which has created the LCP DRM.

They said:

Hi Terence, You've found a way to hack LCP using Thorium. Bravo!
We certainly didn't sufficiently protect the system, we are already working on that.
From your Mastodon messages, you want to post your solution on your blog. This is what triggers my message.
From a manual solution, others will create a one-click solution. As you say, LCP is a "reasonably inoffensive" protection. We managed to convince publishers (even big US publishers) to adopt a solution that is flexible for readers and appreciated by public libraries and booksellers.
Our gains are re-injected in open-source software and open standards (work on EPUB and Web Publications).
If the DRM does not succeed, harder DRMs (for users) will be tested.
I let you think about that aspect

I did indeed think about that aspect. A day later I replied, saying:

Thank you for your message.
Because Readium doesn't freely licence its DRM, it has an adverse effect on me and other readers like me.

My eReader hardware is out of support from the manufacturer - it will never receive an update for LCP support.

My reading software (KOReader) have publicly stated that they cannot afford the fees you charge and will not be certified by you.

Kobo hardware cannot read LCP protected books.

There is no guarantee that LCP compatible software will be released for future platforms.

In short, I want to read my books on my choice of hardware and software; not yours.
I believe that everyone deserves the right to read on their platform of choice without having to seek permission from a 3rd party.
The technique I have discovered is basic. It is an unsophisticated use of your app's built-in debugging functionality. I have not reverse engineered your code, nor have I decrypted your secret keys. I will not be publishing any of your intellectual property.
In the spirit of openness, I intend to publish my research this week, alongside our correspondence.

Their reply, shortly before publication, contained what I consider to be a crude attempt at emotional manipulation.

Obviously, we are on different sides of the channel on the subject of DRMs.
I agree there should be many more LCP-compliant apps and devices; one hundred is insufficient. KOReader never contacted us: I don't think they know how low the certification fee would be (pricing is visible on the EDRLab website). FBReader, another open-source reading app, supports LCP on its downloadable version. Kobo support is coming. Also, too few people know that certification is free for specialised devices (e.g. braille and audio devices from Hims or Humanware).
We were planning to now focus on new accessibility features on our open-source Thorium Reader, better access to annotations for blind users and an advanced reading mode for dyslexic people. Too bad; disturbances around LCP will force us to focus on a new round of security measures, ensuring the technology stays useful for ebook lending (stop reading after some time) and as a protection against oversharing.
You can, for sure, publish information relative to your discoveries to the extent UK laws allow. After study, we'll do our best to make the technology more robust. If your discourse represents a circumvention of this technical protection measure, we'll command a take-down as a standard procedure.

A bit of a self-own to admit that they failed to properly prioritise accessibility!

Rather than rebut all their points, I decided to keep my reply succinct.

As you have raised the possibility of legal action, I think it is best that we terminate this conversation.

I sincerely believe that this post is a legitimate attempt to educate people about the deficiencies in Readium's DRM scheme. Both readers and publishers need to be aware that their Thorium app easily allows access to unprotected content.

I will, of course, publish any further correspondence related to this issue.

Stop treating eBooks like paper books

@edent — Fri, 14 Apr 2023 11:34:31 +0000

As part of my never-ending quest to banish this skeuomorph from the world…

I was reading a fascinating eBook recently which was, sadly, designed to mimic a legacy / paper book. To the point where the authoring software had hard-coded in page numbers and forced them to be displayed.

Here's what it looked like:

There are two abominations here. There's no need to interrupt the reading experience by bisecting a page and displaying the page numbers. And there's no need to put footnotes at the actual foot of the artificial page.

The whole point of an eBook is to free the reader from the tyranny of the publisher's choices. If the reader wants to justify the text, change the font, hide all footnotes, or has strong opinions about widows and orphans - they can choose a reading experience which suits their needs.

Let's take a look at the code behind the page and how it should been written.

Page numbers

The HTML the book uses to show the page numbers is:

   ...
   end p.16
    


   
      
         ...

The CSS is:

div.page-break {
    text-align: right;
    line-height: 1px;
    font-size: 10px;
    color: #aca368;
    border-bottom: 1px solid #706650;
    padding: 3px 4px 8px 0px;
    margin: 0px 2px 10px 0px;
}

So, what should it be? I've previously written about the support ePub has for page numbers. And the answer is well documented in the specification.

It's simply this:

That inserts an invisible pagebreak. The reader can choose to render one page per physical screen. Or they can choose to display a page number. Or they can choose to ignore the suggestion.

HTML Footnotes

Here's the code as presented in the ePub:

   Not only was that result in disagreement with the other trials made by the committee but also it was the direct opposite of the observations by Adams and De Luc, 


   15. See De Luc 1772, 1:219-221, §408.


   16. See ...

end p.16
...

There's no need to put footnotes alongside the text. If you do that, you're basically telling the reader that you know better than them how they want to read the book. Most eReaders will pop-up a footnote, making it easy to read and easy to close:

The footnote link should have a specific ePub type, and the footnote itself is semantically represented as an

with it's own type:

The use of dilithium crystals was discouraged1.


    1. Scott, M (2257).

The footnote text can be placed at the end of the chapter or the end of the book. No need to force it into the reader's eyeline.

There's also similar advice from Kindle.

What have we learned today

Books are magical. And, in my humble opinion, eBooks are better than legacy paper books. I can boost the font size of my eBook rather than having to buy an expensive large-print version. I can navigate by searching, or by semantic features, rather than grubbing around with page numbers. And I can choose to follow a footnote or ignore it as the whim strikes me.

But all that requires the publishers actually understanding how to take advantage of the format.

How do you raise a software bug with a book publisher?

@edent — Mon, 23 Nov 2020 12:26:12 +0000

Recently, I bought an eBook which has a bug. I'd like to explain what the bug is, why it is a problem, and how I'm trying to get it corrected.

Amazon sells eBooks in KF8 format. That is an ePub with some proprietary extras. ePub is a standard based off HTML5. You can read the ePub 3 specification but, basically, it is a .zip of HTML files. If you unzip an eBook, you can read the source code behind it.

When trying to read a Kindle book on a non-Kindle device, I noticed a bug. Some words were not displaying. I took a look at the underlying source code, and found this:

Sometimes words and letters were wrapped with a pagebreak span like this:

‘But of course!’

When I tried to read the book using KOReader the word "‘But" didn't appear. Why? Let's take a look at the ePub3 specification concerning page breaks:

pagebreak
A separator denoting the position before which a break occurs between two contiguous pages in a statically paginated version of the content.
HTML usage context: phrasing and flow content, where the value of the carrying elements title attribute takes precedence over element content for the purposes of representing the pagebreak value

Here's the problem - eBooks can have page numbers. Despite "Page Numbers in eBooks Considered Harmful" lots of publishers still use them. I guess it is kind of useful if you want to refer to something on a printed page - but eReaders allow you to change font size and line spacing, so the concept of a page is somewhat nebulous.

The way the spec is written, means that you can write something like:

You use the id for internal linking and the title attribute for the value.

Because of this, most eReaders do not display the physical page number inside the span. It has no semantic content for the reader, and breaks flow. If they did display it, you might end up reading text like this:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean vel

9

risus at metus molestie tincidunt. Donec aliquet aliquam lorem, ...

So KOReader deliberately ignores any text which is wrapped with epub:type="pagebreak".

If you look at the example ePubs provided by the International Digital Publishing Forum, you'll see that the majority of their spans are self-closing.

Very occasionally, you see something which has just a page number in it:

169

I checked with KOReader and they confirmed that they were following the spec. I agree with them. There's no reason to wrap readable content in a metadata span like that.

I've checked several other books from different publishers. None of them abuse pagebreaks this way. I think Penguin Random House are doing it wrong and I would like to correct them.

Reporting it

I've previously reported buggy ebooks to vendors. But because the Kindle app doesn't exhibit this problem, I thought it was futile going via Amazon. So I thought I'd try going directly to the publisher.

Sadly, Penguin UK's GitHub repo is dead. Their dedicated digital publishing team haven't tweeted in 2 years.

Their contact page has a suggestion for what to do if there is an error in an eBook:

It is best if you return the book to the original bookshop from which it was purchased; they should be happy to exchange it for a perfect copy. If you have any difficulty with this then please return the cover and title page of the book to us.

Hmmm....

I dropped them an email, and got back this very reasonable reply:

Thank you for reaching out and bringing this to our attention. The distinction you’ve made is already a part of our specification; this was an oversight which we’re looking into as a result of your input—it bypassed checks both because it validates and also passes visual checks on all the major platforms we screen for. This isn’t reflective of our entire library and should be limited to specific titles which we’re currently investigating.

That's fair enough. The rendering quirk is specification compliant - but hard to spot because of the Kindle monoculture.

Change the spec, change the world

I've made a suggestion on GitHub that the spec should be clarified. I don't think it's particularly obvious that content in a pagebreak may not be displayed.

Most resources agree that the content of a pagebreak should either be blank, or be the page number.

If you include the page numbers as text content within a span or div, the pages will be more easily accessible to both sighted users and users using assistive technologies. This method has been employed in previous DAISY standards. The potential downside, however, is that mainstream user agents will not provide equivalent functionality to turn off unwanted content, forcing users to hear and view the page numbers.

Digital Accessible Information System (DAISY)

Whose fault is it anyway?

This is a tricky one. I think Penguin have undoubtedly made a mistake with the way they publish ePubs. But, so far, KOReader is the only rendering engine I've found which suppresses the content of pagebreaks by default.

Generally speaking, a user wouldn't want to display page numbers on an eBook. Software could have a user defined toggle to switch them on or off. Luckily, KOReader has a variety of style-sheets for rendering eBooks - so I picked one which displayed pagebreak content.

Software is hard.

If The Kindle is Sold at Break-Even, Why Doesn't Amazon Sell ePub?

@edent — Thu, 15 Nov 2012 07:10:11 +0000

Amazon claims that it makes no money from the sale of Kindle eReader hardware.

Looking at the prices of eink devices at wholesalers, this looks broadly accurate. They do seem to be selling at around wholesale cost - customers also get Amazon's fabulous support, free software updates, and high quality manufacturing.

Yet there is a curious anomaly. Why aren't Amazon selling ePub books?

Terminology

A quick diversion into the terminology used in this article.

eReader - the physical hardware. Kindle, Kobo, nook, etc.
eBook - the electronic file containing the words & pictures. ePub, Mobi, PDF, etc.

Background

There are, broadly speaking, two main formats for ebook - ePub and MobiPocket. Think of them like the difference between 8-Tracks and cassette tapes - they both hold music, but play on different system.

ePub works on just about every eReader on the planet - with the notable exception of the Kindle.

MobiPocket (or Mobi, for short) only works on the Kindle⁰.

Wikipedia has a fairly comprehensive comparison of which device can handle which format.

So, we have a problem. The books you buy from Amazon can't be read on your Sony, Kobo, Nook, or Generic eReader. Well, they can, but you have to remove the DRM, covert the book to the ePub, and hope that everything works ok.

What a pain in the arse.

What Would Happen If...

Now, I'm not suggesting that the Kindle should be able to read ePub books. Obviously, it's technically capable of doing so - but it would mean that Amazon customers could compare prices with other retailers and start to leave the Amazon ecosystem.

What I'm suggesting is that Amazon should say "Buy this ebook for your Kobo" and deliver an ePub to those poor, unfortunate souls who haven't been blessed with a Kindle.

There are lots of statistics regarding eReader share. Some suggest that Amazon have a ~47% share of the eReader market in the US whereas the Kobo eReader has a 46% share in Canada, and 50% share in France.

Let's say that the Kindle has a worldwide share of 50%. Amazon has two options:

It can aggressively pursue that market share by producing more innovative, cheaper hardware, and hope to convert users to the Amazon flock
It can accept that some people don't want its hardware and start selling books directly to those users

Amazon claims that it makes more money from eBook sales than hardware. So why doesn't it expand its market to the 50% of eReaders which are currently not served by its store?

At the moment, customers with Kobo, nook, and other eReaders can compare prices across a number of eBook stores. What would happen if they could add Amazon to the list of shops they could compare with?

I'm talking specifically about the DRM'd form of Mobi which is sold by Amazon. ↩︎