Quick and dirty way to rip an eBook from Android
I recently purchased a book for my MSC which was only available via a crappy Android app. There was no obvious way to decrypt it to read on a more sensible device, so I resorted to the ancient art of screenscraping.
This is a quick-and-dirty way to grab images of the pages and convert them to a standard PDF using Linux. There's a lot more you can do to make the end book more useful, but this'll get you started
Lots of Screen Shots
With a USB cable plugged into my phone and laptop, I wrote a horrible little bash script:
BASH#!/bin/bash
for i in {00001..00555}; do
adb exec-out screencap -p > $i.png
adb shell input tap 1000 2000
sleep 1s
done
echo All done
This runs a loop 555 times. Takes a screenshot, names it for the loop number with padded zeros, taps the bottom right of the screen, then waits for a second to ensure the page has refreshed. Slow and dull, but works reliably.
Images range from 200KB to 2MB depending on complexity. Back them up before doing the next bit.
Cropping
The screenshots are all 1080x2160. But the page only takes up part of that. The top left corner is at 50x432 and the bottom right is at 1028x1726.
This command crops all the images. It is destructive, so make sure you have a backup.
mogrify -crop 978x1294+50+432 +repage *.png
It's also useful to trim the images to remove any whitespace from the borders. That makes a smaller file size.
mogrify -trim *.png
Images can be shrunk with:
pngquant *.png
PDF and OCR
Sticking all the images together into a single PDF is pretty easy:
convert *.png +repage output.pdf
The +repage
option keeps the aspect ratio of the trimmed image.
But there's no text to search. There are a bunch of OCR programs on Linux, I like PDF Sandwich
:
pdfsandwich -rgb -nopreproc output.pdf
That'll get you a colour PDF with OCR'd text embedded in it. The text is "sandwiched" behind the image of the page, so you can't see it but can search for it.
You can also use OCRmyPDF which may result in a smaller file:
ocrmypdf -l eng output.pdf output_ocr.pdf
And that's it. I now have a searchable PDF which I can read on any device.
What have we learned?
DRM on textbooks is an annoyance. For computer science books, it's little more than a fig-leaf.
Hacker News said on twitter.com:
Quick and dirty way to rip an eBook from Android: shkspr.mobi/blog/2021/12/q… Comments: news.ycombinator.com/item?id=297048…
안드로이드에서 eBook 추출하기 | GeekNews said on :
This Article was mentioned on news.hada.io
matoken said on twitter.com:
Androidのスクリーンショットをadbコマンドで取得&ページ送りで独自形式だったりして読みづらい電子書籍を画像に,そしてpdfにしてocr 以前BT HIDで操作して似たことしたことあるけどこっちのほうが良さそうだな🤖 "Quick and dirty way to rip an eBook from Android" shkspr.mobi/blog/2021/12/q…