When designing a database, I have a standard of always creating a clustered primary key in each table of type INT IDENTITY(1,1) NOT NULL
.
CREATE TABLE Cars( ID INT IDENTITY(1,1) NOT NULL CONSTRAINT PK_Cars PRIMARY KEY CLUSTERED, Brand NVARCHAR(20) NOT NULL, RegistrationNumber NVARCHAR(10) NOT NULL, MadeIn NVARCHAR(20) NOT NULL ) |
It is simple to create and for 90% of all tables this will give good performance and a small storage footprint. Those 9% of tables that are link tables for many-to-many relationships get a composite primary key, made up of the primary keys from the two related tables. Two simple rules handles 99% of the cases for primary keys. In this post I will go through some considerations for primary keys and show why an INT IDENTITY(1,1) NOT NULL
field almost always is a good choice for a clustered primary key. The examples and technical details in this article are written for MS SQL Server, but the concepts should apply to other database engines as well.
Clustered Index
Before looking into the different considerations for choosing the primary key I would like to briefly go through the concept of clustered and nonclustered indexes. Inside the database, each table is stored like an excel document, with several sheets consisting of rows and columns. The first sheet is the table itself, with all the columns. The sorting of the table itself is the clustered key. The other sheets are secondary indexes or keys, which are ordered by something else. For the Cars table we could add a key on the registration nunmber.
CREATE UNIQUE INDEX UK_Cars_RegistrationNumber ON Cars (RegistrationNumber) |
This will create an additional “excel sheet” ordered by the registration number. Not all columns of the main table are present in the index “sheet”. The key columns are duplicated into the index and then there is a reference back to the row in the main table for retrieving the rest of the data. More on that reference later.
There is also the case where there is no sorting at all to the table. With no sorting, there is no clustered index and the table is referred to as a heap. In a heap there is no way to make a direct reference to a single row, all access to the table requires searching through all rows. Such a search is of course slow.
Unique Clustered Index
When there is a clustered index on a table SQL server requires that it is unique. Remember the secondary index on the cars table having a reference back to the main table? That reference is made up of the clustered index key. If the clustered index is not unique, that wouldn’t work. If a non-unique clustered index is created SQL Server adds another hidden 4-byte column to the table to make each value unique. Since the extra column will be there anyway, we might as well create it ourselves to be able to access it and use it to uniquely identify rows in the table. We can use it as a primary key.
Non Changing Primary Keys
So far we have covered that the we should have a primary key in each table and that a primary key is unique. Now it is time to look into the properties of a good primary key. Remember that the table itself is sorted on the primary key. What if the primary key of a row changes values? The row have to be moved. The UPDATE
statement effectively turns into a DELETE
at the old location followed by an INSERT
in the new location. The more data there is in each row the slower it will be. With a non-changing primary key that is avoided.
Never a Business Key
The non-changing primary keys rule has a related rule: Never use a business key as primary key. A business key is something that has some kind of meaning to the or user of the system. An invoice number is an example of a business key. It is required to be unique so it fulfils the unique requirement, but there is another problem. The rules of the business key is owned by the business. The rules can be changed. Changing business rules often requires system redesign, but you don’t want to have to redesign fundamental principles of your database just because of a business rule change. The rules of the primary key should be subject to technical considerations only.
For the invoice example, this requires a separate invoice number column in the database which is fine. That number might be the only one ever displayed to the user which is also fine. To guarantee uniqueness a unique index can be added to the invoice number column. The primary key field can be completely hidden from the user interface.
Small Data Type
The data type of a primary key should be kept small. We have left the times where each single bit costed a fortune to store, but still storage space is valuable. It is not only an economic value, but also a performance value. Having a smaller footprint on each row means that more rows can be read together and fit in memory together. The primary key is not only stored in the table itself, it is also stored once more in all indexes and in all tables having an foreign key into the table. Generally a 4-byte integer is small enough, but in some cases like dimensions in a data warehouse it makes sense to use the 2-byte smallint or 1-byte tinyint if there is a limited number of rows in the table.
Ever Increasing Key Reduces Fragmentation
When a new row is added to a database table it has to be put in the right place in the clustered index. If the keys are generated in random order, the right place might be in the middle of the table. If the page that stores that part of the table is full, it will have to be splitted into two pages, which are each filled to 50%. The splitting itself is an expensive operation and the nonfilled pages will vaste space. With an ever increasing key generation algorithm the new row is always added to the end of the table, leaving behind a 100% filled pages. The built in IDENTITY(1,1)
construct creates ever increasing values automatically.
Many to Many Relationships
One special case where I don’t follow my own rule is link tables for many to many relationships. We might want to link our car to those people frequently driving it. Each person can be a frequent driver of several cars and each car can have several frequent drivers.
CREATE TABLE FrequentDrivers ( CarID INT NOT NULL CONSTRAINT FK_FrequentDrivers_Car REFERENCES Cars (ID), PersonID INT NOT NULL CONSTRAINT FK_FrequentDrivers_Person REFERENCES Persons (ID), CONSTRAINT PK_FrequentDrivers PRIMARY KEY CLUSTERED (CarID, PersonID) ) |
This table requires a unique key on the CarID, PersonID combination anyways, so we might as well use it as the primary key.
A special Note on GUIDs
After having presented all the arguments for a INT IDENTITY(1,1) NOT NULL
clustered primary key, I would like to present some arguments against using GUIDs as primary keys. From the criteria above, GUIDs fail on:
- Small Data Type
- Ever Increasing Key Reduces Fragmentation.
- Never a Business Key (an identity column is unique within the database, so a GUID has to be some business requirement).
If there are requirements on GUIDs for replication, go a ahead and add a GUID column to the table. You might even use that column as the target for foreign keys referencing the table, but don’t use it as the primary key.
Conclusion
For 99% of the cases, a clustered primary key of type INT IDENTITY(1,1) NOT NULL
will do well. It handles all the basic requirements of a primary key well. Take it as a habit to start with and change to something else if there are specific reasons to do so, but only after having properly considered all the implications.