Using Soundex to find Duplicate Database Entries

by @edent | # # #

Our community website - OpenBenches - has over seventeen thousand crowd-sourced entries. The nature of user-generated content is that there are bound to be duplicates. Especially around popular walking routes. Here's how I culled around 200 duplicates using the awesome power of SOUNDEX!

Soundex is a clever algorithm for reducing a string of characters into a string which roughly represents its pronunciation. The "roughly" is key here. We could just search the database for identical entries, but users often mistype entries. So comparing sounds is a good way to gloss over those mistakes.

This gives us the Soundex of every inscription in the database:

SELECT soundex(`inscription`) FROM `benches` WHERE 1 

Which results in:

|soundex(`inscription`)                               | 
|=====================================================|
|D531323161545613216323                               |
|J51616565324126143216316531414                       |
|D232624132635253263532154                            |
|H6432534215361312345252342525321625342323153423163...| 
|A2316215635                                          |

Effectively, it's like a fuzzy hash for text.

So, here's how to get a list of all the Soundexs which have duplicates:

SELECT `inscription`, SOUNDEX(`inscription`), COUNT(SOUNDEX(`inscription`))
    FROM    `benches`
    WHERE `published` = 1
    GROUP BY SOUNDEX(`inscription`)
    HAVING COUNT(SOUNDEX(`inscription`)) > 1
    LIMIT 0 , 1024"

Which I can then use to produce a list of duplicates:
List of duplicates on a website.
And I can then get a list of all benches with a specific Soundex:

SELECT `benchID`, `inscription`, `address`
    FROM    `benches`
    WHERE  SOUNDEX(`inscription`) = "A123"
    AND  `published` = 1

Benches on a  website. One is called "Bertie" the other "Bert".

In this case, I can see that they're similar but not identical - so I don't need to merge them.

I started off with a thousand possible duplicates and, after going through each of them, merged a couple of hundred dupes. That was a fun weekend!

Other Strategies

There were a couple of other things I thought of, and then discarded, to deal with dupes.

Ask the user on upload

This is probably the simplest. If a new upload has a similar Soundex to a nearby bench, ask the user if it a duplicate.
But, actually, that's fraught with complexity. It puts a lot of pressure on a new user to get it right. And, frankly, we're a tiny community which needs all the users it can get. So I decided to put the pressure back on me, the admin.

There are quite a few benches near each other which have identical inscriptions. So asking a user to check which bench they meant could also be complicated.

Which leads on to...

Check proximity

There are lots of benches near each other. That's understandable. But calculating how near any two benches are is a bit more complex. They're stored in the database with separate latitude and longitude values. So, we can use the Haversine formula to find all benches within, say, 250 metres of position 1.23,4.56:

SELECT
(
    6371 * ACOS(COS(RADIANS(1.23)) *
    COS(RADIANS(`latitude`)) *
    COS(RADIANS(`longitude`) -
    RADIANS(4.56)) +
    SIN(RADIANS(1.23)) *
    SIN(RADIANS(`latitude`)))
)
AS distance, benchID, latitude, longitude, inscription, published
FROM benches
WHERE published = true AND present = true
HAVING distance < .25
ORDER BY distance

That's fine - but when combined with Soundex, it becomes much more complex. Effectively you then have to group similar sounds by geographic closeness. And, frankly, I really can't be bothered!

If you'd like to help out, the code to OpenBenches is open source.

Leave a Reply

Your email address will not be published. Required fields are marked *