Screenscraping Album Artwork From The Linux Command Line


Like many people, I've collected a fair number of CDs over the years. As hard-drives and MicroSD cards have got larger and cheaper, I've gradually been ripping them to FLAC. Most CD rippers automatically tag the music files with the correct metadata and, nowadays, they will also download and embed album artwork as well.

(As an aside, it always boggled my mind that CDs don't come with metadata burned onto the disc. Even a single spare megabyte would be enough to hold detailed track listing, artwork, etc.)

Back when I started, there was no way to get album artwork. Most media players will recognise that if a .jpg is in a folder with music, then it should be treated as the album artwork. This file is usually called "cover.jpg" or "albumart.jpg" - but that's only convention; any name will do.

So, rather than re-rip all by CDs, I wrote a quick bash script to scrape the images from albumart.org. First the script and then some notes about the choices I made when writing it.

#!/bin/bash -e
# get_coverart.sh
#
# This simple script will fetch the cover art for the album information provided on the command line.
# It will then download that cover image, and place it into the child directory.
#
# ./get_coverart.sh 
#
# get_coverart Beatles/Sgt Pepper
#
# get_coverart Beatles/Sgt_Pepper
#
# get_coverart "Beatles - Sgt_Pepper"
#
# To auto-populate all directories in the current directory, run the following command
#
# find . -type d -exec ./get_coverart "{}" ;
albumpath="$1"

# Escape any problematic character
encoded="$(perl -MURI::Escape -e 'print uri_escape($ARGV[0]);' "$albumpath")"

# Skip if a cover.jpg exists in the directory
if [ -f "$albumpath/cover.jpg" ]
then
echo "$albumpath/cover.jpg already exists"
exit
fi

# Tell the user what is going on
echo ""
echo "Searching for: [$1]"

# scraping AlbumArt.org
url="http://www.albumart.org/index.php?skey=$encoded&itempage=1&newsearch=1&searchindex=Music"
echo "Searching ... [$url]"

# Grab the first Amazon image without an underscore (usually the largest version)
coverurl=`wget -qO - "$url" | grep -m 1 -o 'http://ecx.images-amazon.com/images/I/*/[%0-9a-zA-Z.,-]*.jpg'`

echo "Cover URL: [$coverurl]"

# Save the imager
wget "$coverurl" -O "$albumpath/cover.jpg"

Notes

I originally suggested this as an enhancement for the popular ABCDE Linux ripper.
It's based off this older, now obsolete, script.

AlbumArt.org uses images from Amazon.com - why not just use the Amazon API?

The Amazon API is great but it requires that you get an account with Amazon and include an API key with every request. That means you can't just dump the script on a box and start downloading - you'd need to configure it first?

Why the change from XPATH?

I love XPATH and use it regularly. What I found when deploying this script to a new Ubuntu install was that xmllint wasn't installed by default. On the other hand, grep is installed on every machine. Seeing as how the Amazon images are a fixed pattern, a regular expression works just fine.

What if there are multiple results from a search?

This will automatically download the first one. As this is a command line tool, there's no practical way to display the various images.
I did look at ASCII art conversion, but that's problematic.
Some albums work well - e.g. Little Mix's DNA

.=====+++O887?+++.+===~~INMNMMMMMN?~==~.
.=7I+=~~NMM8ND$=+.====~OMMMMMMMMMMMD~=~.
.~=+=ONMD8ZNNN88I.~~~~:MMMM=?MMMMMMMN=~.
.~==ZNN+:,,,N8ODO.:~~~=NMMZZ:$MMMMMMN=~.
.===NZNII..+O88ONI.=~:=,DM+$::7NMMMMN:~.
.+=7DD8,::~,,DD88O.OOIZOMMM~=7.+MMND+~,:
.=+8DND+,:~,=DD8DD~,8DZ+:NMZ?~?8MMMNNMO~
.==DNNNNI=++NDDNNDZ$MNOOOON7++IMMM8?==~M
.+?NMNNN78N8MNNNNNOI?+???I?I+=+IM8$+++I:
.=NMMMNDN==MNNMNMN$=+?8$I+?+==+???III==+
.8NMMMNN?=+M$IIII:I,77II7D?+~~=??:ONMNZ?
.MNMNMMZ7~+MO7~II:7=7???77$,......,:??=M
.MMMM8=.......,~=~=.I:I,7,I=+==+=====+=.
...,~=?8D878DO+~~~=?~=7++NNM8NNMD===+==.
.==~=~DD8DNNON8+7I=+?=78ND~:::~NNNI===~.
.==~~~NI:,,:I88+:~~~.~=8M?:,?,=+NOM7+=~.
.=~=~~8~,...,?8~~~~~.+=MDND:,I::8MMZ=+=.
.==~~~ZDD+.+7+$~~~~~,,~MM7:~=.,=DNNMI+=.
.~~~~:?:,~,,,7=:~::~~.+NMMI7:~:~NDNNN+=.
.~~==~:I,,,,~D::~:~~=.DMDNNI::O+NN8DN+=.
.====~=:,,,~N~:::~~==.N8NDDN+=~+DOODN$=.
.+====~D+,M~~,.,:====.ND8NNN+::=DD8OND=.
.??,++++I~,?+:,,:::~+87D8DNM:::~MNODND=.
.?=,.~=~,+,$+?+??++~IMNNNMNN$,,,MNNMNNOI
..$+~~=.:~,Z====++++7MMNMNM,~,,,MMNNNMM.

Whereas Sgt Pepper is hard to make out.

:::::::::::::::::::::::~~~~~~::::::,,,,,
:,:::::,~::==:::O7~~:~8~=~~=Z+~:I:::,,,.
~+.:+8~D7:::~+8??:==I:O?~$:~$ZI=7IO:,,~:
:I$=,N$=:8O7?7+I?=.8?78OI:~+I~7O$Z$O+?$:
,O$.~D8$~~Z+ZD7$$DOO=8ZZ?OI8?:D7=887Z8Z?
~87+Z$+:~8+,:OD8O~=NDD7OO$+Z8D:ZDDDD88Z?
$O88Z8?DDD?ZZN+ODI,D?88D887$7=D8D+D8O8Z=
,7ID=88II7IOZ7ZZZ7$:===$77~I:~88ZIIO88$?
~..DDDDDDDDOIN=7$7~?I?=OIZ$I:$Z7+?Z+88ZI
Z+,8DDDDDDD+D7O?8ZZ777?8Z8I==OD~8~D$7OZ?
,,D8DDDDNDDDO8Z=$Z8+IZDOZ+ZI+D?88+78O8~=
I?ZDDDDNDN8N=+O+7II$7?ZO+Z$7I8DDOZ?DDII=
=~DDDONNNNNDZZOZZI+?Z+87OOZOZ$8D888D8+II
+=D8NNDDDNDD8N+Z,?OZ~?~88ODOZ+ODDD88DI:?
~~DDDDD8DNDDDO~??:,+7~=DD8DNDDD8788:D,::
~=D8==8DDNNNDO=$=?O$~I:DDO$?DDDDDDD=?OZ+
OD8DZONNDNNDNNN=~7~Z+,DNN8?$DNZZ?DDDO8$7
7O=O8:D78DZNZZN77$7OD8?7$I7=?~D$O8O~~Z7I
=Z$O88D$N$O8N88ZZZ$ON8NDNNZO7ON$7OD8?Z$~
IZD8ZOD8O$OOD$8DDNZNN$DDNN$8ZDNDO$DDZ8OI
7ZDO8ODDO8DOD8N88$ZNDDNDNN8DD8N8$OOZ8O$?
IZDNNNNN8NNNNNNNMZ7N=+Z7O=O7ZI:~DD8DD8OI
Z8DND+$Z+OD+NNNDN8DNDNNNN++I+$=ZDDDDDD8?
ONNDDNO8=D8NNNNDDZD8~NNNONNNNNN8NDDDDDD$
8D88ZNNNOONNNNNDDD$ODNDD8ZMNNDN8D8NNDDDI

There are amazing tools like aview - but again, that's an extra program which the user might not have.

If your album directories are sensibly named, the first hit is usually good enough.

Hang on! There's a mistake!

Quite probably, this is a quick and dirty script. I'm sure there are lots of edge case and (no-doubt) some poor coding practices. If you wish to contribute a patch, please drop it in the comments.

9 thoughts on “Screenscraping Album Artwork From The Linux Command Line

  1. I use beets (http://beets.radbox.org/) which does almost all the management of my 80+GB music library. It tags, fetches art, manages the directory structure and is totally customizable.

    Your script works well, there's an <relative> at the last line.

    1. I've started using beets as well. Bit of a pain to configure, but once it works it's incredible. I have found that it does miss some album art though - hence the script.

      Thanks for the correction - now amended.

  2. Albumart.org doesn't like it if you send a query that contains parenthesis. For example: "Devo-Recombo_DNA_(disc_1_of_2_-_Sequence_A)" won't yield any results. Abcde names directories like this for multi-disc packages. An apparently decent way around this is to lop off the string submitted to albumart at the first paren encountered. Another enhancement is to refrain from writing a cover.jpg file when it didn't find a cover image. Here's a sloppy diff-ish thing of what I'm talking about

    albumpath="$1"
    +title=`echo $albumpath | cut -d'(' -f1`
    ...
    -encoded="$(perl -MURI::Escape -e 'print uri_escape($ARGV[0]);' "$albumpath")"
    +encoded="$(perl -MURI::Escape -e 'print uri_escape($ARGV[0]);' "$title")"
    ...
    -echo "Searching for: [$1]"
    +echo "Searching for: [$title]"
    ...
    +if [ -z $coverurl ] ; then
    + echo "Unable to find cover art for $title."
    + exit
    +fi

  3. A litte different concept ...

    #!/bin/bash

    if [ $# != 1 ]
    then
    echo "Need Path for Search!"
    exit 1
    fi

    cd $1
    find . -maxdepth 2 -mindepth 2 -type d | while read -r dir
    do
    albumpath="$dir"

    # Escape any problematic character
    encoded="$(perl -MURI::Escape -e 'print uri_escape($ARGV[0]);' "$albumpath")"

    # Skip if a cover.jpg exists in the directory
    if [ -f "$albumpath/cover.jpg" ]
    then
    echo "$albumpath/cover.jpg already exists"
    continue
    fi

    # Tell the user what is going on
    echo ""
    echo "Searching for: [$albumpath]"

    # scraping AlbumArt.org
    url="http://www.albumart.org/index.php?searchkey=$encoded&itempage=1&newsearch=1&searchindex=Music"
    echo "Searching ... [$url]"

    # Grab the first Amazon image without an underscore (usually the largest version)
    coverurl=`wget -qO - "$url" | grep -m 1 -o 'http://ecx.images-amazon.com/images/I/*/[%0-9a-zA-Z.,-]*.jpg'`

    if [ "x" == "x$coverurl" ]
    then
    albumpath=`dirname "$albumpath"`
    echo "Neuer Versuch mit '$albumpath'"
    encoded="$(perl -MURI::Escape -e 'print uri_escape($ARGV[0]);' "$albumpath")"
    url="http://www.albumart.org/index.php?searchkey=$encoded&itempage=1&newsearch=1&searchindex=Music"
    coverurl=`wget -qO - "$url" | grep -m 1 -o 'http://ecx.images-amazon.com/images/I/*/[%0-9a-zA-Z.,-]*.jpg'`
    fi

    if [ "x" != "x$coverurl" ]
    then
    echo "Cover URL: [$coverurl]"
    # Save the imager
    wget "$coverurl" -O "$dir/cover.jpg" 2$> /dev/null
    fi
    done

  4. I had to use the following URL format for it to work today:
    url="http://www.albumart.org/index.php?searchk=abba+gold&itempage=1&newsearch=1&searchindex=Music"

  5. And to get the above format with +'s in it, I found it useful to do this (since all my directory names use _ not spaces)

    GUTS OF MY SCRIPT
    ----------------------------
    albumpath="$1"

    # Split albumpath into artist and album
    artist_input="${albumpath%/*}"
    album_input="${albumpath##*/}"

    # I need to replace the underscores with a '+' character
    artist=${artist_input//[_]/+}
    album=${album_input//[_]/+}

    echo "artist: $artist"
    echo "album: $album"

    search_terms="$artist+$album"

    # Scrape AlbumArt.org
    url="http://www.albumart.org/index.php?searchk=$search_terms&itempage=1&newsearch=1&searchindex=Music"

    echo "Searching ... [$url]"
    coverurl=`wget -qO - "$url" | grep -m 1 -o 'http://ecx.images-amazon.com/images/I/*/[%0-9a-zA-Z.,-]*.jpg'`

    echo "Cover URL: [$coverurl]"

    # Save the image jpg file
    wget "$coverurl" -O "$DEST_DIR/cover.jpg"

    RUNNING IT
    ----------------
    And running it like this:

    GetCoverArt.sh james_morrison/undiscovered

    gives this output:
    artist: james+morrison
    album: undiscovered

    Searching for album art for: [james_morrison/undiscovered]
    Searching for: [james+morrison+undiscovered]
    Searching ... [http://www.albumart.org/index.php?searchk=james+morrison+undiscovered&itempage=1&newsearch=1&searchindex=Music]
    Cover URL: [http://ecx.images-amazon.com/images/I/51SAKEc0HEL.jpg]
    --2014-12-06 09:48:32-- http://ecx.images-amazon.com/images/I/51SAKEc0HEL.jpg
    Resolving ecx.images-amazon.com (ecx.images-amazon.com)... 54.230.198.142, 54.230.199.201, 54.230.199.90, ...
    Connecting to ecx.images-amazon.com (ecx.images-amazon.com)|54.230.198.142|:80... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 35448 (35K) [image/jpeg]
    Saving to: ‘/home/music/flac/james_morrison/undiscovered/cover.jpg’

    100%[======================================>] 35,448 --.-K/s in 0.02s

    2014-12-06 09:48:32 (1.85 MB/s) - ‘/home/music/flac/james_morrison/undiscovered/cover.jpg’ saved [35448/35448]

    DO ALL MUSIC
    ---------------------
    And finally, to get cover art for ALL my music in one fell swoop, I ran this:

    for i in *; do for j in $i/*; do GetCoverArt.sh $j; done; done

  6. You may be interested to see that abcde now has the capability to download album art, I apologise for not using the approach that you suggested! The eventual successful patches came from the same thread on GoogleCode where your patch was suggested...

    Currently available only in the git version but it will go mainstream when 2.6.1 is released. The abcde FAQ in git has some detailed information on how to get it all working although sane defaults should guarantee a good result anyway. I am planning a web page with more detailed information, this will come out in a week or so...

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.