Screenscraping Album Artwork From The Linux Command Line
Like many people, I've collected a fair number of CDs over the years. As hard-drives and MicroSD cards have got larger and cheaper, I've gradually been ripping them to FLAC. Most CD rippers automatically tag the music files with the correct metadata and, nowadays, they will also download and embed album artwork as well.
(As an aside, it always boggled my mind that CDs don't come with metadata burned onto the disc. Even a single spare megabyte would be enough to hold detailed track listing, artwork, etc.)
Back when I started, there was no way to get album artwork. Most media players will recognise that if a .jpg is in a folder with music, then it should be treated as the album artwork. This file is usually called "cover.jpg" or "albumart.jpg" - but that's only convention; any name will do.
So, rather than re-rip all by CDs, I wrote a quick bash script to scrape the images from albumart.org. First the script and then some notes about the choices I made when writing it.
#!/bin/bash -e
# get_coverart.sh
#
# This simple script will fetch the cover art for the album information provided on the command line.
# It will then download that cover image, and place it into the child directory.
#
# ./get_coverart.sh
#
# get_coverart Beatles/Sgt Pepper
#
# get_coverart Beatles/Sgt_Pepper
#
# get_coverart "Beatles - Sgt_Pepper"
#
# To auto-populate all directories in the current directory, run the following command
#
# find . -type d -exec ./get_coverart "{}" ;
albumpath="$1"
# Escape any problematic character
encoded="$(perl -MURI::Escape -e 'print uri_escape($ARGV[0]);' "$albumpath")"
# Skip if a cover.jpg exists in the directory
if [ -f "$albumpath/cover.jpg" ]
then
echo "$albumpath/cover.jpg already exists"
exit
fi
# Tell the user what is going on
echo ""
echo "Searching for: [$1]"
# scraping AlbumArt.org
url="http://www.albumart.org/index.php?skey=$encoded&itempage=1&newsearch=1&searchindex=Music"
echo "Searching ... [$url]"
# Grab the first Amazon image without an underscore (usually the largest version)
coverurl=`wget -qO - "$url" | grep -m 1 -o 'http://ecx.images-amazon.com/images/I/*/[%0-9a-zA-Z.,-]*.jpg'`
echo "Cover URL: [$coverurl]"
# Save the imager
wget "$coverurl" -O "$albumpath/cover.jpg"
Notes
I originally suggested this as an enhancement for the popular ABCDE Linux ripper. It's based off this older, now obsolete, script.
AlbumArt.org uses images from Amazon.com - why not just use the Amazon API?
The Amazon API is great but it requires that you get an account with Amazon and include an API key with every request. That means you can't just dump the script on a box and start downloading - you'd need to configure it first?
Why the change from XPATH?
I love XPATH and use it regularly. What I found when deploying this script to a new Ubuntu install was that xmllint wasn't installed by default. On the other hand, grep is installed on every machine. Seeing as how the Amazon images are a fixed pattern, a regular expression works just fine.
What if there are multiple results from a search?
This will automatically download the first one. As this is a command line tool, there's no practical way to display the various images. I did look at ASCII art conversion, but that's problematic. Some albums work well - e.g. Little Mix's DNA
.=====+++O887?+++.+===~~INMNMMMMMN?~==~. .=7I+=~~NMM8ND$=+.====~OMMMMMMMMMMMD~=~. .~=+=ONMD8ZNNN88I.~~~~:MMMM=?MMMMMMMN=~. .~==ZNN+:,,,N8ODO.:~~~=NMMZZ:$MMMMMMN=~. .===NZNII..+O88ONI.=~:=,DM+$::7NMMMMN:~. .+=7DD8,::~,,DD88O.OOIZOMMM~=7.+MMND+~,: .=+8DND+,:~,=DD8DD~,8DZ+:NMZ?~?8MMMNNMO~ .==DNNNNI=++NDDNNDZ$MNOOOON7++IMMM8?==~M .+?NMNNN78N8MNNNNNOI?+???I?I+=+IM8$+++I: .=NMMMNDN==MNNMNMN$=+?8$I+?+==+???III==+ .8NMMMNN?=+M$IIII:I,77II7D?+~~=??:ONMNZ? .MNMNMMZ7~+MO7~II:7=7???77$,......,:??=M .MMMM8=.......,~=~=.I:I,7,I=+==+=====+=. ...,~=?8D878DO+~~~=?~=7++NNM8NNMD===+==. .==~=~DD8DNNON8+7I=+?=78ND~:::~NNNI===~. .==~~~NI:,,:I88+:~~~.~=8M?:,?,=+NOM7+=~. .=~=~~8~,...,?8~~~~~.+=MDND:,I::8MMZ=+=. .==~~~ZDD+.+7+$~~~~~,,~MM7:~=.,=DNNMI+=. .~~~~:?:,~,,,7=:~::~~.+NMMI7:~:~NDNNN+=. .~~==~:I,,,,~D::~:~~=.DMDNNI::O+NN8DN+=. .====~=:,,,~N~:::~~==.N8NDDN+=~+DOODN$=. .+====~D+,M~~,.,:====.ND8NNN+::=DD8OND=. .??,++++I~,?+:,,:::~+87D8DNM:::~MNODND=. .?=,.~=~,+,$+?+??++~IMNNNMNN$,,,MNNMNNOI ..$+~~=.:~,Z====++++7MMNMNM,~,,,MMNNNMM.
Whereas Sgt Pepper is hard to make out.
:::::::::::::::::::::::~~~~~~::::::,,,,, :,:::::,~::==:::O7~~:~8~=~~=Z+~:I:::,,,. ~+.:+8~D7:::~+8??:==I:O?~$:~$ZI=7IO:,,~: :I$=,N$=:8O7?7+I?=.8?78OI:~+I~7O$Z$O+?$: ,O$.~D8$~~Z+ZD7$$DOO=8ZZ?OI8?:D7=887Z8Z? ~87+Z$+:~8+,:OD8O~=NDD7OO$+Z8D:ZDDDD88Z? $O88Z8?DDD?ZZN+ODI,D?88D887$7=D8D+D8O8Z= ,7ID=88II7IOZ7ZZZ7$:===$77~I:~88ZIIO88$? ~..DDDDDDDDOIN=7$7~?I?=OIZ$I:$Z7+?Z+88ZI Z+,8DDDDDDD+D7O?8ZZ777?8Z8I==OD~8~D$7OZ? ,,D8DDDDNDDDO8Z=$Z8+IZDOZ+ZI+D?88+78O8~= I?ZDDDDNDN8N=+O+7II$7?ZO+Z$7I8DDOZ?DDII= =~DDDONNNNNDZZOZZI+?Z+87OOZOZ$8D888D8+II +=D8NNDDDNDD8N+Z,?OZ~?~88ODOZ+ODDD88DI:? ~~DDDDD8DNDDDO~??:,+7~=DD8DNDDD8788:D,:: ~=D8==8DDNNNDO=$=?O$~I:DDO$?DDDDDDD=?OZ+ OD8DZONNDNNDNNN=~7~Z+,DNN8?$DNZZ?DDDO8$7 7O=O8:D78DZNZZN77$7OD8?7$I7=?~D$O8O~~Z7I =Z$O88D$N$O8N88ZZZ$ON8NDNNZO7ON$7OD8?Z$~ IZD8ZOD8O$OOD$8DDNZNN$DDNN$8ZDNDO$DDZ8OI 7ZDO8ODDO8DOD8N88$ZNDDNDNN8DD8N8$OOZ8O$? IZDNNNNN8NNNNNNNMZ7N=+Z7O=O7ZI:~DD8DD8OI Z8DND+$Z+OD+NNNDN8DNDNNNN++I+$=ZDDDDDD8? ONNDDNO8=D8NNNNDDZD8~NNNONNNNNN8NDDDDDD$ 8D88ZNNNOONNNNNDDD$ODNDD8ZMNNDN8D8NNDDDI
There are amazing tools like aview - but again, that's an extra program which the user might not have.
If your album directories are sensibly named, the first hit is usually good enough.
Hang on! There's a mistake!
Quite probably, this is a quick and dirty script. I'm sure there are lots of edge case and (no-doubt) some poor coding practices. If you wish to contribute a patch, please drop it in the comments.
Oli says:
Will says:
Will says:
Terence Eden says: