For performance reasons, clustered indexes are necessary on almost all tables. We assume that the reader is aware of this. This article isn’t going explain the why’s and wherefore’s of that statement. We just want to focus on how to choose the best columns(s) for the clustered index.
The right choice of the clustering key for your table is a critical element in creating a high-performance database. Many times, too little thought is put into the selection of the clustering key at design time, which can lead to performance problems as the database grows in size and number of users.
The columns included in the clustered index are often referred to as the ‘clustering key’. The clustered index is called just that, the ‘clustered index’. The important difference is that ‘clustered index’ refers to a single specific index. ‘Clustering Key’ refers to a column or set of columns that are automatically added to all nonclustered indexes on the table. We will use both terms in this article. We also use the notation ‘column(s)’ to mean ‘one or more columns’.
Considerations for Choosing the Clustering Key
Despite the existence of ‘Best Practices’, there are no absolute rules for choosing the clustered index for a table. However, there are a few primary considerations to take into account when you design a table:
The clustering key should be sequentially increasing. Sequentially increasing values prevent page splitting which can be a performance issue and can cause rapid fragmentation of the index.
The clustering key should be static, i.e. once the row is inserted, the value of the clustering key column(s) should not change. Changing values in the clustering key can cause page splitting and rapid fragmentation.
The clustering key should be small. It becomes a part of every nonclustered index created on the table. A large clustering key can cause all indexes to be larger and less efficient.
The column or combination of columns in the clustering key should be unique. If the clustered index is not naturally unique, SQL server will add a 4 byte integer to the key to make it unique. This will increase the size of the nonclustered indexes on the table as well as the size of the clustered index. (more about this later)
One Size Does NOT fit all
Notice that I use the word ‘should’ in all of the considerations mentioned above. Any or all of these considerations can be overridden in specific cases. To select the best clustering key it is necessary to understand your schema and how your tables will be accessed through your SQL code. Many times you will be faced with a judgment call, balancing potential performance benefits against the maintenance cost of a clustered index that does not meet all the considerations discussed above.
The first thing to understand about best practices is that there aren’t any. There are always cases where the ‘best practice’ is not the best choice. A simple set of rules can’t possibly cover the multitude of complexities involved in most decisions regarding SQL Server. A Best Practice is a safe choice for those who do not have the background or training to make correct decisions in this matter. If you are reading this article, presumably you want to be someone who does not have to rely on best practices and who knows how to make the correct decisions based on their understanding of the principals involved.
The best practice promulgated by Microsoft in the case of clustered indexes is to create a clustered primary key on a sequentially increasing identity column. This, in fact, is not a bad idea in many cases. It creates a clustered index on a small, sequentially increasing, unique, static value and makes it the primary key. That takes care of all the considerations we mentioned above with a few clicks of your mouse. What’s not to like?
The problem is that this cookie-cutter approach cannot take into consideration circumstances that might affect performance dramatically.
To take a very common example, you might know that your huge Orders table will be accessed primarily by queries returning many rows and the query will be filtered on the sequentially increasing OrderDate column, e.g. (e.g. Where OrderDate > 9/1/2011). This may be a case where the clustered index should be focused on the OrderDate column rather than an identity value because of the vastly better performance of range queries on a clustered index. It might be best to create the clustered index on the OrderDate column plus another column that would make the index unique.
Why a Second Column?
Even though you can create a non-unique clustered index on a non-unique column, a clustered index must be internally unique. A date field is not usually unique. Theoretically at least, multiple records could be created at exactly the same time. If you create a clustered index on a non-unique column, SQL Server adds a hidden 4 byte integer column called a ‘uniquifier’ to make the index internally unique but it does not mark the index itself as unique.
If there is no other small column that provides a uniqueness to the date column, it can be useful to create a sequentially increasing identity field. This will allow you to create the clustered index on the orderdate column and the identity column using the UNIQUE attribute. And that will allow the query optimizer to make better decisions about execution plans.
In this case it is very important that the date field be the FIRST column in the index definition and the identity be the second. If you don’t remember why this is so, you might want to review this article: Four Indexing Myths
With the 4 byte integer added, the clustering key will be the same size as if the index were created as non-unique and SQL Server had created its own 4 byte ‘uniquifier’. So, the identity column is essentially free.
Primary Keys, Candidate Keys and Surrogate Keys
Another problem with the existing best practice in regard to clustered indexes is that its widespread usage has created a myth that the Primary Key index must be clustered. That, of course, is not true. In many cases it may be better to create a nonclustered Primary Key and save your clustered index for a column where it will provide better performance.
Yet another problem with the widespread use of the best practice is that it muddies up the concept of uniqueness with the concept of the primary key. The primary key is not necessarily the only unique column or set of colums in the table. Those who blindly follow best practices often do not realize that the identity column they create is often a surrogate for a naturally occurring unique column or columns that should exist in every table in a normalized schema. These are referred to as a candidate keys and there may be more than one of them in a table. It is usually important that the candidate keys also have a unique index. This will not only enforce data consistency but it will also give the optimizer useful information for creating the optimal execution plan.
Candidate keys are often a better place for the clustered index. Joins and filters are often made on the naturally occurring candidate key column(s) rather than the Primary Key declared on an identity value.
The OrderDetail Table
For example we will take the case of the OrderDetail table. The business rule enforced by the schema is that a given invoice can have no more than one line item for the same product. This implies that the combination of invoice id and product id must be unique in the table. This combination of invoice and product fulfills the requirements for a Primary Key. However, we often see that the primary key has been declared as a clustered index on an identity column. Let’s call it OrderDetailID. In all likelihood the clustered index has been wasted on this column.
The OrderDetail table has two principal tables to which it joins, Order and Product. Your queries will join to those tables on the OrderID and the ProductID respectively because the clustered OrderDetailID column is completely irrelevant to the way these tables are accessed. In all likelihood, queries against this table will also be filtered on invoice id or product id or both. The clustered index on the identity field might be scanned for page lookups but it will not play much of a role in the overall performance when querying this table.
The choice of the clustering key is critical for best performance. No static set of rules can assess all the complexities involved in making the choice. There is no substitute for understanding not the rules, but the principles involved, and applying them using your human intelligence.
A Best Practice is a set of safe choices for those who do not understand the issues involved. Hopefully this article has given you a clearer view of those issues so that you can make the choice intelligently.