Using Soundex to find Duplicate Database Entries
Our community website - OpenBenches - has over seventeen thousand crowd-sourced entries. The nature of user-generated content is that there are bound to be duplicates. Especially around popular walking routes. Here's how I culled around 200 duplicates using the awesome power of SOUNDEX!
Soundex is a clever algorithm for reducing a string of characters into a string which roughly represents its pronunciation. The "roughly" is key here. We could just search the database for identical entries, but users often mistype entries. So comparing sounds is a good way to gloss over those mistakes.
This gives us the Soundex of every inscription in the database:
SQLSELECT soundex(`inscription`) FROM `benches` WHERE 1
Which results in:
|soundex(`inscription`) |
|=====================================================|
|D531323161545613216323 |
|J51616565324126143216316531414 |
|D232624132635253263532154 |
|H6432534215361312345252342525321625342323153423163...|
|A2316215635 |
Effectively, it's like a fuzzy hash for text.
So, here's how to get a list of all the Soundexs which have duplicates:
SQLSELECT `inscription`, SOUNDEX(`inscription`), COUNT(SOUNDEX(`inscription`))
FROM `benches`
WHERE `published` = 1
GROUP BY SOUNDEX(`inscription`)
HAVING COUNT(SOUNDEX(`inscription`)) > 1
LIMIT 0 , 1024"
Which I can then use to produce a list of duplicates:
And I can then get a list of all benches with a specific Soundex:
SQLSELECT `benchID`, `inscription`, `address`
FROM `benches`
WHERE SOUNDEX(`inscription`) = "A123"
AND `published` = 1
In this case, I can see that they're similar but not identical - so I don't need to merge them.
I started off with a thousand possible duplicates and, after going through each of them, merged a couple of hundred dupes. That was a fun weekend!
Other Strategies
There were a couple of other things I thought of, and then discarded, to deal with dupes.
Ask the user on upload
This is probably the simplest. If a new upload has a similar Soundex to a nearby bench, ask the user if it a duplicate. But, actually, that's fraught with complexity. It puts a lot of pressure on a new user to get it right. And, frankly, we're a tiny community which needs all the users it can get. So I decided to put the pressure back on me, the admin.
There are quite a few benches near each other which have identical inscriptions. So asking a user to check which bench they meant could also be complicated.
Which leads on to...
Check proximity
There are lots of benches near each other. That's understandable. But calculating how near any two benches are is a bit more complex. They're stored in the database with separate latitude and longitude values. So, we can use the Haversine formula to find all benches within, say, 250 metres of position 1.23,4.56
:
SQLSELECT
(
6371 * ACOS(COS(RADIANS(1.23)) *
COS(RADIANS(`latitude`)) *
COS(RADIANS(`longitude`) -
RADIANS(4.56)) +
SIN(RADIANS(1.23)) *
SIN(RADIANS(`latitude`)))
)
AS distance, benchID, latitude, longitude, inscription, published
FROM benches
WHERE published = TRUE AND present = TRUE
HAVING distance < .25
ORDER BY distance
That's fine - but when combined with Soundex, it becomes much more complex. Effectively you then have to group similar sounds by geographic closeness. And, frankly, I really can't be bothered!
If you'd like to help out, the code to OpenBenches is open source.
Thomas Steiner :chrome: said on toot.cafe:
@Edent @openbenches.org That's neat. I wasn't aware that this is a native SQL function. (You may want to mention this in your blog post.) Here's the MySQL docs for it, maybe saving someone else a Web search: https://dev.mysql.com/doc/refman/8.4/en/string-functions.html#function_soundex.
MySQL :: MySQL 8.4 Reference Manual :: 14.8 String Functions and Operators
dusoft says:
These three functions might be also useful: https://www.php.net/manual/en/function.metaphone.php https://www.php.net/levenshtein https://www.php.net/manual/en/function.similar-text.php
I was experimenting with them at some point (along soundex) to match up old and new URLs to automatically create redirects.