The number can have leading zeros (e.g. 00123).
What data type do you use?
I am very interested in your opinion on this.
Imagine that you work at a company which sells widgets. Each widget has a unique serial number. The number is a fixed length, and can contain leading zeros. That is, the following are all valid identifiers:
00001
01010
12345
What data type would you use to store these data in a database?
This is one of those strong opinions, weakly held. I'm not sure there's a right answer to it. A quick survey on Twitter0 was inconclusive:
Here, I attempt to provide some of the opinions on both sides of the argument. Feel free to supply your own arguments.
Can you do maths on it? If not, it isn't a number. It's a text string that happens to only contain numbers.
For example, asking "what is the average of these three serial numbers?" is a meaningless question. "What is this serial number divided by two?" again - that has an answer with no semantic content.
What if the serial has some semantic content in it? For example even numbers were made in one factory, odds in another. Or prime numbers are demonstration units. It might be easier to grab information from a number type. But it's probably better to pull that data into separate fields.
If this field was for the item's price, it's is always going to be a number. Sure, it might be a float
or an int
- but it'll be a number. Same if this was the item's height, width, or weight. Those values can only be numbers.
But there's nothing inherently "numbery" about a serial number. At any point, the boss could recommend adding Greek letters to it.
Given we can't do maths on it, and the "number" doesn't have any semantic content, there's no need to artificially restrict our database to a "number" type.
The same is broadly true of any constraint. The maximum length could change. The validation rules might be updated. Modern databases need to be able to cope with changes in business requirements.
I don't know of any number type which stores leading zeros. Please enlighten me if I'm wrong. That means any data retrieved from the database has to be formatted before displaying it to the user.
That also means that searching the database requires data to be pre-formatted.
All data should go through validation and sanitation before being displayed to the end user. Numbers aren't special in that regard. Similarly, data should be carefully checked before being searched for.
We all know how Microsoft Excel will look at any number and try to interpret it as a date. It also has a tendency to strip leading zeros. And some formatters will automatically add thousands separators to any number they see. Keeping the data as text reduces the risk of this happening.
Excel will mangle anything that looks even vaguely like a number. Storing as text does not guarantee that it will be interpreted as text.
OK - I think that's the majority of the argument for not treating this as a number. What are the arguments on the other side?
If we allow text in this field, what happens if someone types a letter O
rather than a number 0
? Having this be an int
prevents these errors creeping in. Yes, have checks at the front end, but this provides defence in depth.
As above. Data should be checked by the application before submitting to the database. The database should be checking the data before storing it.
Suppose we want to get all widgets with a serial number >=
a specific ID. A number type is much better than string for those sorts of operations.
This again assumes that the IDs have some semantic content. For example, 00012
was manufactured after 00008
. This may or may not be the case. Such operations are best performed on a field like "Manufactured Date".
It is quicker and computationally cheaper to store integers rather than text. Searching is faster, disk space requirements are lower.
It isn't the 1970s. We're not paying per bit. Unless we're storing billions of rows, or working on constrained hardware, this isn't a practical concern for most users.
It looks like a number. A human, on encountering one of these IDs is going to assume it is a number. Anything which makes it harder for humans to understand is going to cause problems.
Things like phone numbers look like integers, but they aren't. Momentary human confusion is preferable to mangled or imprecise data.
I lean to the side that says that this is a string with specific constraints. Namely /^[0-9]{5}$/
(please supply your own regex meme).
When a user enters a new serial number, it should be checked that it meets the constraints. If it doesn't, refuse to submit it until the user has formatted it correctly. When submitted, the database should check against the constraints, and refuse to accept non-matching strings.
The speed of searching and sorting is not meaningfully degraded by storing as a string.
But, I'm very aware that I could be wrong. This is my strong opinion, but if you can supply a better argument, I will drop my weak grip on it.
Which has obvious limitations! ↩︎
William Brooks is immunologically enhanced said on twitter.com:
Numbers with leading zeros aren't numbers. They are strings.
If a data element must be a string limited to numeric characters, then front-end validation (and, less ideally, back-end validation/quality rules) can suffice for enforcement, just like any other string formatting.
bigattichouse said on twitter.com:
chef's kiss I generally use hashed uuids for unique keys, for serial I like the int + validate on the front end (or really, take advantage of autoincrement) and then just say "we format the output with leading zeros". Interesting analysis of most arguments. Thanks for posting.
Tim Haynes said on twitter.com:
You should consider presentation.
Store in the lowest common denominator, ie number, potentially with other fields for letters etc, and define output formatting functions for various uses.
The alternative is parsing strings every time you do the maths, which is anathema.
Alison W (♿️⚾☕🍣♀️🏳️🌈🛠️🇪🇺🇬🇧🇯🇴🌊🛤️🎹) said on twitter.com:
If it is a "number" then I don't, leading zeroes are presentational. If it's a reference (SKU etc) then as alpha.
Richard says:
Can you expand on why sorting is easier with a numeric field than a text field? Assuming an agreed order of alphanumeric characters and symbols, and good string functions, isn't sorting pretty similar? The number of different characters possible shouldn't make a difference because we are talking about an arbitrarily long field.
@edent says:
For some languages, that is true. But sorting Chinese numbers is different because their order in Unicode is based on stroke-order rather than semantic order.
Alphabetic text is (slightly) slower to sorting than integers. Although that's not a major concern unless you have billions of rows.
Which means that retrieving all serial numbers between 00123 - 09876 is slightly more complex.
Like with most of the points here, it's fairly minor.
Keith كيث कीथ キース said on twitter.com:
Chaotic Evil: Fixed-width binary coded decimal.
TBH I'd probably just use a string.
Eric Andersen says:
I'm a proponent of serial numbers being text fields. Can you do math on it, or would you want to, should be the guiding principle. I use that guidance for my personal spreadsheet development.
Merton says:
It has been a VERY long time since I had to deal with such things AND I'm not a coder or in any way techie. I solved this problem in 3rd party software, things like customer number, and other "numbers"I had control over.
To solve problem of leading zeros, I just started the numbers with 10000, for example. This first customer number was 10000, the next 10001, 10002, etc. Non-important byproduct was that you always knew how many customers you had in the database!
Not very elegant but it solved the problem. Being unable to use numbers less than 10000 was never a poblem - always lost of permutations available.
When you are not a coders or techie you have to get clever/sneaky.
Daniel says:
Whatever people are doing, they’re doing it wrong. My postal code is a four digit number with a leading zero. Quite a few online forms and stores explodes when I enter it. Even if they allow me to enter it, the shipping label is often wrong. The leading zero is often printed as a binary blob of nonsense, a Unicode replacement character (U+FFFD), or another weird Unicode symbol. Or the zero gets unceremoniously moved from the start to the end of the postal code. It’s not uncommon for the zero to be dropped and the other three other digits to be repeated, creating a new non-existing six digit postal code. I almost forgot that once a store insisted I enter my postal code to confirm my login, but it refused to let me login. I eventually guessed that the Unicode replacement character (U+FFFD) followed by the remaining three digits would let me login. (My last order had contained that character on the shipping label.) Programmers suck at planning for and handling the leading zeros.
Philip Bragg said on twitter.com:
I have designed a few serial number generators that are still in use. They store the padded number as a string, actually in all cases they also store some other characters. The generating integer is also stored in a separate column. I'd always stick with text for a serial number.
Deeznuts said on bitcoinhackers.org:
@Edent In Postgres I would make a custom type... maybe?