SQL Server Database Corruption

The data in your database is going to be corrupted someday. It is not a question of IF. It is just a question of WHEN.

Corruption is caused by errors and failures in your storage media and\or power supply. Hard disks, like all mechanical devices, will fail someday for some reason. Solid state storage is not mechanical, but it will also fail eventually. The most common cause of corruption for both mechanical and solid state media is a power failure during a write to disk.

How Does Data Get Corrupted?

In SQL Server a page of data is always 8 kilobytes. If a disk failure or power failure occurs when SQL Server is writing a page of data to disk, a failure called a “torn page” can result.

Take the case when SQL Server has written 3 kb of the 8kb page to disk when a sudden failure occurs. The page will have 3 kb of the new data and 5 kb of old data.

Losing 5 kb of data is bad enough, but perhaps worse, the linkage between this page and others may be corrupted and many other pages may be unreachable by processes following that page chain.

Protect Yourself from Data Loss Due to Corruption

Most data recovery plans anticipate events like a system hardware failure or a sql injection attack that destroys the data in a very noticeable way. Risk is managed by implementing a backup plan that guarantees data recoverability with an acceptable amount of data loss.

Sometimes these plans do not take into account undetected corruption of the data. While having the ability to recover your data is absolutely essential to a successful recovery from data corruption, sometimes it is not enough.

Corruption can go undetected for long periods of time when the corrupt data is not queried frequently. When it finally becomes a recognized problem, days, weeks or months may have passed since the incident that caused the corruption. Even if you have backup files that go back to when the corruption occurred, normally you can’t afford to lose that much data by restoring a “last known good” weeks-old database backup.

Having to tell the boss you lost the last six weeks of mission-critical data is what SQL guy Grant Fritchey calls an “RGE” or “Resume Generating Event”.

This is why you must check data consistency regularly using the built-in maintenance plan task “Check Database Integrity” or by executing “DBCC CHECKDB dbname” for your production databases. It is absolute essential to detect and deal with corruption as soon after it occurs as possible.

As well as running this command regularly, you also need to run it as soon as possible after a power failure, a power-off shutdown of the server, san or disk array or any hint of a glitch in your storage media.

Other Things That Can Help

Normally, a disk drive for database data has a small battery pack in the controller that will allow it to complete the page write correctly if external power is cut suddenly. Using a disk or array like this is a best practice but it is not foolproof because sometimes the disk failure is mechanical and unrelated to a power outage.

In versions of SQL Server later than SQL 2000, the database option ‘page_verify_option’ defaults to 2 (or CHECKSUM). This option should be left at the default setting. If the database was migrated from SQL 2000 or before, you might have to change the existing setting to CHECKSUM.

When the CHECKSUM setting is on, SQL Server will throw error 824 to let you know whenever it tries to access a corrupt page. Just as importantly, this setting will run a checksum on a page BEFORE SQL Server writes it to disk. This insures that the corruption cannot come from SQL Server.

Repairing Corruption

Fixing database corruption is an article in itself, so I will concentrate on the basic things you need to know.

There is no entirely satisfactory way to fix serious corruption. It is far better to detect corruption early and restore the “last known good” backup. However, if you have no choice, there is a method that might fix the corruption, but it comes with a lot of risk.

Repair Options

If you have not already done so, run DBCC CHECKDB WITH NO_INFOMSGS. It will return information about the corrupt pages as well as telling you what level of repair is possible on them. The two options are “repair_rebuild” and “repair_allow_data_loss“.

Before running either one of these repair options read the information returned by CHECKDB about the corrupt pages carefully. You may find that the corrupt pages belong to an index. You might fix these pages just by dropping and recreating the index. If so, you are very lucky. Thank your lucky stars and immediately go back and fix your disaster recovery plan so you will never have to think about repair options again.

If the checkdb will allow you to run the checkdb statement again with the “repair_rebuild” option you are also lucky because this possibility doesn’t come up very often. The “repair_rebuild” option will not delete data or otherwise endanger your database. It just shuffles a few pointers around.

However, if CHECKDB tells you the minimum repair level is “repair_allow_data_loss” you have some serious thinking to do. Using this option has been likened to taking a chainsaw to your database. It simply deallocates corrupt pages without any thought to what they might contain. It might well delete records from a parent table and leave the child records orphaned and essentially useless. It might deallocate sql server metadata. There is no guarantee that your database will be usable after running “repair_allow_data_loss“.

If there is absolutely no other choice and you have to run this command, I wish you good luck. If it works out OK,go fix your DR plan so you never have to take this gamble again.

Conclusion

Database corruption can destroy a database if it is not detected early when a recovery from the current backup is still practical. Make sure your disaster recovery plan includes regular data consistency checks. Pay attention to and follow up on error messages mentioning torn page or checksum errors. It is far better to recover from the “last known good” database backup than to have to gamble on repair_allow_data_loss.

Additional menu