In a recent post, I mentioned a customer that takes some perverse pleasure in undermining point-in-time-recoverability of their databases. Here are some more details on the bad practice and what I wish they would do instead.
The customer has a two-node failover cluster with two instances of SQL Server 2005. Each clustered instance has dozens of databases, nearly all of them with identical schema (they have a Data-Dependent Routing/shard architecture), and each is in the full recovery model. They don’t have enough disk space allocated for the instances and, much too frequently, they encounter 9002 errors from SQL Server (when one or more of the transaction log files fill up and don’t have room to grow):
Msg 9002, Level 17, State 1, Line 1
The transaction log for database '____' is full. To find out why space in the log cannot be reused, see the log_reuse_wait_desc column in sys.databases
The customer reacts by looping through all of their databases and running BACKUP LOG … WITH NO_LOG against each. Let’s walk through the steps that they should be taking, instead of this knee-jerk overreaction.
The first problem is that, despite the hint in the 9002 error message, the customer doesn’t use the [log_reuse_wait_desc] column of the [sys].[databases] catalog view to determine which databases need attention. Instead, they resort to shotgun troubleshooting and truncate the transaction logs of all databases.
Next, the customer should perform appropriate capacity planning and provision enough disk space for their databases (now and in the future).
Since the customer won’t acquire sufficient disk space, I have asked whether their databases really belong in the full recovery model. They expect point-in-time-recoverability, but their frequent and ill-advised reaction to 9002 errors trashes the databases’ backup log chains. It will be an upsetting day when they need to restore to a certain time and cannot do so. It would be better to review the Recovery Point Objective (RPO) for each database (or define it for the first time) and then place the majority of the databases in the simple recovery model, saving the resources and administrative overhead associated with transaction log backups. Differential backups could be used to reduce data loss to levels consistent with the RPOs (YMMV).
As bad as it is to rely on sabotaged point-in-time-recoverability, the bigger issue is that the customer will follow-up the transaction log truncations with a round of shrinking the transaction log files! I don’t have time right now to delve into the perils of shrinking transaction log files, so here are some great posts from others in the community:
If, despite careful planning and database administration, the transaction log for one of your databases fills up (and you receive a 9002 error), take a deep breath and simply back up the transaction log for that database. It takes slightly longer than throwing away all of the committed transaction log records (via BACKUP LOG … WITH NO_LOG/TRUNCATE_ONLY) but it also avoids the problems discussed in this post and the last one. If absolutely necessary, discard the committed transactions using the following process (which still works even after the wretched BACKUP LOG … WITH NO_LOG/TRUNCATE_ONLY functionality was removed from SQL Server):
-
Put the database in simple recovery model:
ALTER DATABASE [YourDatabase] SET RECOVERY SIMPLE;
-
Wait for the transaction log to automatically truncate… or force it:
USE [YourDatabase];
GO
CHECKPOINT;
GO
-
If the database truly belongs in full recovery model:
-
Change the recovery model back:
ALTER DATABASE [YourDatabase] SET RECOVERY FULL;
- Immediately perform a full backup of the database to reestablish the backup chain
Note: While performing research for this post, I came across a similar one from Brent Ozar (blog, Twitter). Despite the overlap, I decided to publish anyway, hoping that we rank higher in search engine results than all of the bad Internet advice that advocates forcibly clearing the transaction log!