THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Linchi Shea

Checking out SQL Server via empirical data points

Distribution latency in transactional replication: Is a volume surge the culprit?

If you use transactional replication, I have no doubt that from time to time you are asked to explain why there is an increased latency between the publisher and the subscriber. More often than not, this end-to-end latency is caused by latency in distributing the commands from the distribution database to the subscriber. So I’ll focus on the distribution latency in this post.

 

If you are monitoring the latency of your transaction replication using SQL Server native alert system, you may get an alert such as the following:

 

SQL Server Alert System: 'Replication Warning: Transactional replication latency (PUBLISHER-publication-SUBSCRIBER-780)' occurred on \\DISTRIBUTOR

 

DATE/TIME:  6/4/2009 5:03:30 PM

 

DESCRIPTION: The SQL Server performance counter 'Dist:Delivery Latency' (instance 'PUBLISHER-publication-SUBSCRIBER-780') of object 'SQLServer:Replication Dist.' is now above the threshold of 10000.00 (the current value is 40160.00).

 

This tells you that the distribution agent, named PUBLISHER-publication-SUBSCRIBER-780, has crossed the latency alert threshold.

 

Now what?

 

Well, there can be more root causes for an increase in the distribution latency. A transaction volume increase is one of the most common causes. Hey, it justs longer to pump more commands through the system. So you want to quickly determine if that’s the cause or to rule it out quickly so that you can focus your attention on some other potential causes.

 

More specifically, you want to determine how many transactions, and most usefully, how many commands are being pumped into the distribution database for the distribution agent to apply to the subscriber. Note that a single update that modifies 100,000 rows—that is, one transaction with 100,000 commands—can easily send the distribution latency through the roof, at least, temporarily.

 

In order for you to determine if the transaction volume is the culprit and if the problem still persists, you need to be able to answer the following questions concerning the current state of replication and the state of replication just prior to the reported latency:

 

·        What are the replication perfmon counter values related to the distribution agent?

·        What are the undistributed vs. distributed command counts?

·        What is the pending command count?

·        What is the recent traffic pattern of the incoming transactions (e.g. transaction count by minute and transaction count to command count ratio over the past 20 minutes)?

·        What is the recent traffic pattern of the incoming commands from the log reader (e.g. command count by minute by article over the past 20 minutes)?

·        What are the to-be-distributed command counts by article?

·        Is the distribution agent being blocked on the subscriber? Okay, this is not a volume issue, but nevertheless is an important piece of information to review.

 

You need a script to collect the stats if you want to get to the answers quickly, for instance, while your client is pestering you on the phone for an explanation.

 

The attached script accepts a distribution agent id—which you can get from the alert—as one of the parameters, and gives you the stats to determine if the latency is caused by a transaction/command volume surge.

 

A number of things to note about the script.

 

First, it is a Perl script. But that’s just because I wanted to streamline things a bit, and to be able to have better control on the output and the parameter passing. You can easily extract the SQL statements and put them into a T-SQL script with not much additional work.

 

Secondly, the script does accept parameters in addition to a distribution agent id. For instance, you can specify how far back in minutes you want the script to collect the stats on the incoming transaction and command traffic patterns. But other than a distribution agent id, all the parameters have default values that you can set in the script to match your environment.

 

Thirdly, it can be expensive to run some queries against the MSrepl_transactions and MSrepl_commands tables. So for instance, by default the script does not print the sample commands. You can control which query to run or not to run through the script parameters. Look up the comments in the script for details.

 

Finally, the script currently gets the stats primarily from the distributor. Although I have found this script to be very handy in many occasions, it can be significantly improved. One area I want to improve is to collect more stats from the publisher and the subscriber. For that, I’ll probably convert the script into a C# program because Perl is not good for multi threading and I do want to collect stats from the publisher, the distributor, and the subscriber in parallel from different threads.

 

If you use the script, I appreciate any feedback you may have.

Published Friday, June 05, 2009 12:25 PM by Linchi Shea
Filed under: ,

Attachment(s): replStats.zip

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

noeldr said:

90% of the time is volume surge!

June 5, 2009 2:20 PM
 

Linchi Shea said:

Agreed. The issue then is what exactly is that volume surge. Is it on a particular table?

June 5, 2009 2:55 PM
 

noeldr said:

I use MSdistribution_status for that :-)

June 5, 2009 4:04 PM
 

Linchi Shea said:

Sure! But there are a few problems with that view (if you rely on it alone):

(1) MSdistribution_status only provides a current view, and may not help answer the question about what happened a few minutes ago, which is often the case when you got an alert or got a call from the client.

(2) At times, I find the MSdistribution_status view can be expensive to query. I do use it in the script to get a count of the currently undistributed commands, but have that feature disabled by default.

(3) It doesn't give you the details of the tran/command flow by minute. I find this info to be most useful because you can see clearly that, for instance, between a specific time period there is a clear surge of transactions or commands.

June 5, 2009 4:36 PM
 

noeldr said:

I agree on all points

But the simplicity of it is what made me mention it.

True, Agent history tables do provide further info, I was only answering to the case you mentioned about a particular article(table).

June 5, 2009 5:27 PM
 

Chad Miller said:

You may want to look into Powershell + RMO. I've written a Powershell RMO function library that basically returns the same type of information you would see in the Replication Monitor GUI. The library is available from http://sqlpsx.codeplex.com. Just source the LibraryRMO.ps1 file.

Here's a few examples

$replMon = Get-ReplMonitor 'Z002\Sql1'

$publisherMon = Get-ReplPublisherMonitor 'Z002\Sqlqa1

$pubMon = Get-ReplPublicationMonitor 'Z002\Sql1'

$pubMon | Get-ReplTransPendingCommandInfo

$publisherMon | Get-ReplEnumPublications

$publisherMon | Get-ReplEnumPublications2

$pubMon | Get-ReplEnumSubscriptions

$pubMon | Get-ReplEnumSubscriptions2

$pubMon | Get-ReplenumlogReaderAgent

$pubMon | Get-ReplenumSnapshotAgent

June 6, 2009 9:52 AM
 

Linchi Shea said:

Thanks for the link. Will take a look.

Replication Monitor is a good tool, but it is not sufficient in answering the question of whether a volume surge was the root cause. It gives you only the current status. In addition, I don't want to run any expensive queries (limiting everything to a specific distribution agent helps).

June 7, 2009 1:19 AM
 

Seamus O Heinous said:

We use replication in a very high throughput system.

We often see latency at subscribers particularly the source for the admin website, commonly this is IO related, the commands cannot be pushed into the subscriber fast enough due to contention between users and distribution agent, however IO is rarely the cause, and the cause is usually bad coding or execution plans changing due to bad stats.

Randy Dyess's tools here http://blogs.solidq.com/rdyess/Post.aspx?ID=7&title=query+wait+option+and+query+timeouts

report query cost and have been very useful in identifying unstable query plans. Cheers

August 5, 2011 10:48 AM
 

SQL Crazy said:

Please try writing in SQL Script that will be very helpful for most of the DBAs like me.

June 4, 2013 5:49 PM

Leave a Comment

(required) 
(required) 
Submit

About Linchi Shea

Checking out SQL Server via empirical data points

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement