We recently worked on a transactional replication setup that involved a very active VLDB and a subscriber being located on a different datacenter. What made it even more interesting is that the WAN link was not particularly fast. In this post, I would like to mention a few of the challenges we faced while and how we got past them, in the hopes that our experience can help you in your future endeavours.
The Problem with Slow Distribution Servers
If your distribution server is slow, your replication performance will tank. You will get behind on transactions and might not ever catch up. In our scenario we had a distribution server that was outdated. The server was running Windows Server 2003 and SQL Server 2005. In our case, this was the biggest issue. After we moved to a new distribution server that was running Windows Server 2008 R2 and SQL Server 2008 R2, our performance increased greatly. One of the biggest benefits in moving to Windows Server 2008 R2 is the set of enhacements made to the TCP/IP stack - particularly send/receive TCP windows. For more information, see this article on Technet.
Careful with WAN Accelerators
While WAN accelerators can be fantastic in a myriad of scenarios, in our testing we noticed that the device wasn't really optimizing replication traffic - and it was actually causing latency. With the help of our Networking team, we made sure that traffic from the distributor to the subscriber was skipped by the WAN accelerator. Obviously, your mileage will vary - so test accordingly.
Use Pull Subscriptions
Our initial setup was done using a Push subscription. This was a big mistake. While Push subscriptions are easier to setup and maintain, their performance across WAN links is plain dismal. The MSDN team at Microsoft put out a great White Paper on Geo-Replication Performance which is, in my opinion, required reading for replication to other datacenters. We saw huge performance gains when we switched to Pull. Orders of magnitude faster. Put simply - never use Push subscriptions across WAN links.
One of the design decisions made in our scenario that I would like to point out: we intentionally kept the distribution database near the publisher (i.e., on the same datacenter) - the reason behind this is simple: if your level of confidence in your WAN link isn't that high, the concern becomes the Log Reader agent and getting the Transaction Log to clear reliably and constantly.
Initialize the Subscriber from a Backup
With a WAN link and high latency involved, initialization of the subscriber from a backup is your best bet. We saved ourselves a lot of headaches by doing it. Creating and transferring a snapshot of a VLDB is out of the picture when you're concerned with WAN latency. In our scenario, the publisher was running SQL Server 2005 and backups were being taken using LiteSpeed. We transferred the most recent full backup to the remote datacenter using robocopy (could have used FTP also) plus the latest differential taken after changing the properties of the publication to allow initialization from backup. Restored at the subscriber using LiteSpeed tools, and then used the Extractor utility to create native-format backup files to initialize from backup. This is because you cannot initialize from a LiteSpeed backup, as SQL needs to read LSN information from the backup file and it uses a system stored procedure for that purpose. Here are some tips: you only need the first backup file created by Extractor to initialize. Also, you don't have to initialize with a differential backup - you can use a T-Log backup just as well.
Here is a good post on Initialization from Backup at ReplTalk that might be helpful if you run into issues.
There are other replication features that help reduce the amount of commands sent across to the subscriber. Namely: