If you use transactional replication, I have no doubt that from time to time you are asked to explain why there is an increased latency between the publisher and the subscriber. More often than not, this end-to-end latency is caused by latency in distributing the commands from the distribution database to the subscriber. So I’ll focus on the distribution latency in this post.
If you are monitoring the latency of your transaction replication using SQL Server native alert system, you may get an alert such as the following:
SQL Server Alert System: 'Replication Warning: Transactional replication latency (PUBLISHER-publication-SUBSCRIBER-780)' occurred on \\DISTRIBUTOR
DATE/TIME: 6/4/2009 5:03:30 PM
DESCRIPTION: The SQL Server performance counter 'Dist:Delivery Latency' (instance 'PUBLISHER-publication-SUBSCRIBER-780') of object 'SQLServer:Replication Dist.' is now above the threshold of 10000.00 (the current value is 40160.00).
This tells you that the distribution agent, named PUBLISHER-publication-SUBSCRIBER-780, has crossed the latency alert threshold.
Now what?
Well, there can be more root causes for an increase in the distribution latency. A transaction volume increase is one of the most common causes. Hey, it justs longer to pump more commands through the system. So you want to quickly determine if that’s the cause or to rule it out quickly so that you can focus your attention on some other potential causes.
More specifically, you want to determine how many transactions, and most usefully, how many commands are being pumped into the distribution database for the distribution agent to apply to the subscriber. Note that a single update that modifies 100,000 rows—that is, one transaction with 100,000 commands—can easily send the distribution latency through the roof, at least, temporarily.
In order for you to determine if the transaction volume is the culprit and if the problem still persists, you need to be able to answer the following questions concerning the current state of replication and the state of replication just prior to the reported latency:
· What are the replication perfmon counter values related to the distribution agent?
· What are the undistributed vs. distributed command counts?
· What is the pending command count?
· What is the recent traffic pattern of the incoming transactions (e.g. transaction count by minute and transaction count to command count ratio over the past 20 minutes)?
· What is the recent traffic pattern of the incoming commands from the log reader (e.g. command count by minute by article over the past 20 minutes)?
· What are the to-be-distributed command counts by article?
· Is the distribution agent being blocked on the subscriber? Okay, this is not a volume issue, but nevertheless is an important piece of information to review.
You need a script to collect the stats if you want to get to the answers quickly, for instance, while your client is pestering you on the phone for an explanation.
The attached script accepts a distribution agent id—which you can get from the alert—as one of the parameters, and gives you the stats to determine if the latency is caused by a transaction/command volume surge.
A number of things to note about the script.
First, it is a Perl script. But that’s just because I wanted to streamline things a bit, and to be able to have better control on the output and the parameter passing. You can easily extract the SQL statements and put them into a T-SQL script with not much additional work.
Secondly, the script does accept parameters in addition to a distribution agent id. For instance, you can specify how far back in minutes you want the script to collect the stats on the incoming transaction and command traffic patterns. But other than a distribution agent id, all the parameters have default values that you can set in the script to match your environment.
Thirdly, it can be expensive to run some queries against the MSrepl_transactions and MSrepl_commands tables. So for instance, by default the script does not print the sample commands. You can control which query to run or not to run through the script parameters. Look up the comments in the script for details.
Finally, the script currently gets the stats primarily from the distributor. Although I have found this script to be very handy in many occasions, it can be significantly improved. One area I want to improve is to collect more stats from the publisher and the subscriber. For that, I’ll probably convert the script into a C# program because Perl is not good for multi threading and I do want to collect stats from the publisher, the distributor, and the subscriber in parallel from different threads.
If you use the script, I appreciate any feedback you may have.