How we organise and store thousands of UGC photos
Our community project - OpenBenches is going really well. At the time of writing, we have 33,211 photos, taking up over 100GB.
Cameras and phones all have different ways of naming the photos they save. Some files are named with a datestamp - 2019-12-25_01.jpg
. Others are sequential - photo_0001.jpg
. Or they might have a system generated name - 7bba245908_k.jpg
.
Storing all those photos in a single directory gives us a problem. What if two photos have the same name? Even if we split directories by username, or some other factor, we could still get clashes.
This is the solution we came up with:
- Take the hash of the photo. e.g.
A9342C5C39E5AE5F0077AECC32C0F81811FB8193
- Rename the photo to the hash, add a
.jpg
extension. - Move the file to the
/photos/A/9/
directory. i.e. The first directory is the first letter of the hash, the sub-directory is the second letter of the hash.
Why do this?
There's another practical reason to split files into sub-directories. What's the maximum number of files you can put in a directory?
For FAT32, it's 65,536. We're about halfway there! OK, let's hope I never have to move this code to an ancient Windows box!
Linux filesystems like ext2
have an apparent limit of 31,998 files per directory - but suffer performance issues over 10,000.
Even a default ext3
partition seems to suffer after about 2,000 files per directory.
Applying dir_index
seems to improve things, but still limits us to "easily" 200,000 files per directory.
I don't know what software this might run on in the future. But it seems obvious to me that splitting files into 36 primary subdirectories reduces the risk of poor performance.
Or, perhaps we won't get billions of images of benches submitted, and this is the ultimate word in premature optimisation.
Is SHA1 the best hash?
Probably. SHA1 is not cryptographically secure. A well funded adversary could create a photo with the same SHA1 hash as an existing photo.
But I don't care. Our system just rejects any file with an existing hash already in the system. So the risk of attack is low. The chances of two legitimate photos having an identical SHA1 hash is miniscule.
And SHA1 is quick and efficient.
A word about metadata
We were originally planning to store the data in date-based directories. Or perhaps username directories. We even considered a separate directory structure depending on camera manufacturer.
In the end, it was easier to store the metadata in a database rather than relying on the vagueries of directory names.
The Future
I don't know how many photos we'll end up with. There are lots of benches.
This storage strategy means that I could mount a separate disk for each directory. In the future, when every bench in the world has a dozen photos, we could match storage needs to directory layout.
I'm sure there are better ways of organising large amounts of UGC - and I'd love to hear about them.
What links here from around this blog?