Our community website - OpenBenches - has over seventeen thousand crowd-sourced entries. The nature of user-generated content is that there are bound to be duplicates. Especially around popular walking routes. Here's how I culled around 200 duplicates using the awesome power of SOUNDEX!
Soundex is a clever algorithm for reducing a string of characters into a string which roughly represents its pronunciation. The "roughly" is key here. We could just search the database for identical entries, but users often mistype entries. So comparing sounds is a good way to gloss over those mistakes.
This gives us the Soundex of every inscription in the database:
SELECT soundex(`inscription`) FROM `benches` WHERE 1
Which results in:
|soundex(`inscription`) | |=====================================================| |D531323161545613216323 | |J51616565324126143216316531414 | |D232624132635253263532154 | |H6432534215361312345252342525321625342323153423163...| |A2316215635 |
Effectively, it's like a fuzzy hash for text.
So, here's how to get a list of all the Soundexs which have duplicates:
SELECT `inscription`, SOUNDEX(`inscription`), COUNT(SOUNDEX(`inscription`)) FROM `benches` WHERE `published` = 1 GROUP BY SOUNDEX(`inscription`) HAVING COUNT(SOUNDEX(`inscription`)) > 1 LIMIT 0 , 1024"
Which I can then use to produce a list of duplicates:
And I can then get a list of all benches with a specific Soundex:
SELECT `benchID`, `inscription`, `address` FROM `benches` WHERE SOUNDEX(`inscription`) = "A123" AND `published` = 1
In this case, I can see that they're similar but not identical - so I don't need to merge them.
I started off with a thousand possible duplicates and, after going through each of them, merged a couple of hundred dupes. That was a fun weekend!
There were a couple of other things I thought of, and then discarded, to deal with dupes.
This is probably the simplest. If a new upload has a similar Soundex to a nearby bench, ask the user if it a duplicate.
But, actually, that's fraught with complexity. It puts a lot of pressure on a new user to get it right. And, frankly, we're a tiny community which needs all the users it can get. So I decided to put the pressure back on me, the admin.
There are quite a few benches near each other which have identical inscriptions. So asking a user to check which bench they meant could also be complicated.
Which leads on to...
There are lots of benches near each other. That's understandable. But calculating how near any two benches are is a bit more complex. They're stored in the database with separate latitude and longitude values. So, we can use the Haversine formula to find all benches within, say, 250 metres of position
SELECT ( 6371 * ACOS(COS(RADIANS(1.23)) * COS(RADIANS(`latitude`)) * COS(RADIANS(`longitude`) - RADIANS(4.56)) + SIN(RADIANS(1.23)) * SIN(RADIANS(`latitude`))) ) AS distance, benchID, latitude, longitude, inscription, published FROM benches WHERE published = true AND present = true HAVING distance < .25 ORDER BY distance
That's fine - but when combined with Soundex, it becomes much more complex. Effectively you then have to group similar sounds by geographic closeness. And, frankly, I really can't be bothered!
If you'd like to help out, the code to OpenBenches is open source.