THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Aaron Bertrand

Aaron is a Product Evangelist for SentryOne, makers of Plan Explorer and a performance monitoring and event management platform for the Microsoft Data Platform and VMware. He has been blogging here at sqlblog.com since 2006, focusing on manageability, performance, and new features, and also blogs at blogs.sentryone.com and SQLPerformance.com; has been a Microsoft MVP since 1997; tweets as @AaronBertrand; and speaks frequently at major conferences, user group meetings, and SQL Saturday events worldwide.

Mixing OLTP and reporting using indexed views

I'm not a big fan of denormalizing, nor of repeating redundant redundant information in a database when those facts can already be derived from other information.  A classic example of the latter is when I see questions on newsgroups, forums or StackOverflow that ask how they can update a table's rank column to reflect the current rank based on some other criteria in the table.  The problem with this is that you have to run the update every time any DML operation touches any row in the table, and if you can calculate that rank in an UPDATE query, you can also calculate that rank in a SELECT query, so why not just figure it out in real time?  An example of the former is when we want to maintain a second table with aggregates from another table.  This is where I want to spend my time today.

In reality, many of us deal with OLTP systems that must also serve as the reporting source, so it's not always feasible to calculate aggregates in real time against a constantly moving target.  In several of these scenarios I've used indexed views, where we have tables that are inserted often, read often for statistics, but rarely or never updated... sure, you pay a little hit up front on the insert, but the benefit achieved during real-time reporting was worth it.  Let me give you an idea of what I was dealing with initially, and how I have fixed it over time to perform even better.

One of our systems is an SaaS e-mail platform, where we offer the ability for our customers to distribute messaging to their customers (mostly via e-mail, but we also support fax, SMS, widgets/gadgets, and social media).  For brevity, let's call these people "recipients."  If you stripped the schema down to the bare essentials for metadata, you'd see this (I'm going to leave out the foreign key and other constraints, as they are either obvious or irrelevant):

CREATE TABLE dbo.Domains
(
  
DomainID INT PRIMARY KEY,
  
DomainName VARCHAR(255)
);

CREATE TABLE dbo.Recipients
(
  
RecipientID INT PRIMARY KEY,
  
DomainID INT,
  
LocalPart VARCHAR(64)
  
/* , ... other columns ... */
); 

Obviously we store many other details about recipients, such as demographic information, preferences, opt-in status, subscriptions, etc.  But for this discussion, the above is sufficient.  Notice that for space savings we don't actually store the e-mail address of the recipient, though it is required and is enforced to be unique via a constraint on (DomainID, LocalPart); this information can easily be derived using a view (you can even make this an indexed view if you want quicker access to e-mail addresses, but I'll leave that for another day):

CREATE VIEW dbo.vRecipientDetails
AS
    SELECT
        
r.RecipientID,
        
EmailAddress = r.LocalPart '@' d.DomainName
    
FROM
        
dbo.Recipients AS r
    
INNER JOIN
        
dbo.Domains AS d
        
ON r.DomainID d.DomainID;

RecipientID is the primary key because it is used in a lot of related tables, most importantly, stats tables.  We record every single transaction for a recipient: when a message is attempted to be sent to them; when a successful delivery occurs; when a bounce occurs; when the recipient opens the message or clicks on a link in the message; when the recipient changes their preferences; or, when the user declines to receive further communications from our customer.  The thought of storing LocalPart + DomainID (or the fully composed e-mail address) in all of these other tables that would grow and grow over time made a surrogate representation a very easy choice for the primary key.  I didn't want to make the discussion revolve around this, but I've seen several raised eyebrows in the past about surrogate primary keys and wanted to assure you that, in this system, I think it is the best choice.  (If you want to debate that, let's have a different discussion, as long as it doesn't turn religious.)

Okay, so I mentioned this is an OLTP system, and I mentioned that we record all of the individual transactions for a recipient in various stats tables.  Let's take a quick look at how one of these stats tables looked for the first, oh, 5 years of their existence:

CREATE TABLE dbo.Deliveries
(
  
MessageID INT-- references a table dbo.Messages
  
RecipientID INT,
  
EventDate SMALLDATETIME
  
/* , ... other columns... */
);

So a customer would send out a message targeted to, say, 50,000 recipients.  During the next few hours we would fill up this table with successful deliveries (and other similar tables with bounces, opens, clicks, etc).  Soon after inception of the system, we found that our customers wanted to pull real-time statistics on how the messages were going... mostly on the lookout for high bounce rates (well, low delivery rates too, I guess).  Of course this was very contentious if they pulled reports while the tables were loading.  So in order to give a much more satisfactory experience to the user pulling reports in our web UI, we created indexed views that would automatically maintain the number of deliveries, bounces, etc.:

CREATE VIEW dbo.vMessageDeliveries
WITH SCHEMABINDING
AS
  SELECT
    
MessageID,
   
c = COUNT_BIG(*)
  
FROM
    
dbo.Deliveries
  
GROUP BY
    
MessageID;
GO

CREATE UNIQUE CLUSTERED INDEX m
  
ON dbo.vMessageDeliveries(MessageID);

So now the inserts were a little more expensive but, as mentioned above, this was much better for overall system performance (at least perceived performance) and, more importantly, customer happiness.

Then customers wanted more details on their statistics.  They wanted to see how their messages were getting into, say, Hotmail compared to AOL.  We have this information, but it was quite expensive to retrieve.  Remember earlier I noted that we don't typically store redundant information, but this means that in cases like this, we need to do a lot of joins.  So for example, to provide a responsive report that showed real-time message status per domain, we created indexed views like this:

CREATE VIEW dbo.vMessageDomainDeliveries
WITH SCHEMABINDING
AS
  SELECT
    
d.MessageID,
    
r.DomainID,
   
c = COUNT_BIG(*)
  
FROM
    
dbo.Deliveries AS d
  
INNER JOIN
    
dbo.Recipients AS r
    
ON r.RecipientID d.RecipientID
  
GROUP BY
    
d.MessageID,
    
r.DomainID;
GO

CREATE UNIQUE CLUSTERED INDEX md
  
ON dbo.vMessageDomainDeliveries(MessageIDDomainID);

Of course this again added more strain to the insert process on the deliveries table, since it now had to maintain two indexed views.  In isolation, this still yielded better overall performance than essentially expanding that view and trying to get at the domain aggregates in real time.  But it introduced a new problem to the system: severe blocking.  Picture the case where deliveries are being recorded while a recipient is trying to update their preferences or change their e-mail address.  Since both processes need to lock and potentially update the index on the indexed view, one has to wait for the other.  This can lead to a sad face on an end user.

What I ended up doing was to recant my disdain for storing redundant information, and store the domain information in the deliveries table.  Yes, this was an extra 4 bytes, but as we were moving the whole system to a new 2008 cluster with a much faster I/O subsystem, we would get much more than that back with page compression.  So starting over, we still have the same Domains and Recipients tables, but now domain-based reports are driven from objects that look like these:

CREATE TABLE dbo.Deliveries
(
  
MessageID INT,
  
RecipientID INT,
  
DomainID INT,
  
EventDate SMALLDATETIME,
  
/* , ... other columns ... */
);
GO

CREATE VIEW 
dbo.vMessageDomainDeliveries
WITH SCHEMABINDING
AS
  SELECT
    
MessageID,
    
DomainID,
   
c = COUNT_BIG(*)
  
FROM
    
dbo.Deliveries
  
GROUP BY
    
MessageID,
    
DomainID;
GO

CREATE UNIQUE CLUSTERED INDEX md
  
ON dbo.vMessageDomainDeliveries(MessageIDDomainID);

The up-front cost of calculating the domain of each recipient during insert is likely very equivalent to the system doing the same lookup while maintaining the indexed view (except we can use snapshot isolation in our query, but can't really enforce the same during clustered index updates).  And in addition to no longer causing blocking on updates to the recipients table, we also get a much more accurate representation of history: the delivery is marked with the domain the recipient had on that day, since their e-mail address could have changed 20 times since then.  All in all it has turned out that storing the domain information multiple times has helped contribute to making a very busy system suddenly seem almost idle.

The primary lesson learned here: indexed views can be a very helpful tool in your arsenal, but they can bring you down if you try to over-use them.  Another lesson learned: do not assume that redundancy is the devil.  It can help out in many scenarios, even if it feels a little dirty.

Published Tuesday, December 29, 2009 10:40 AM by AaronBertrand

Comments

 

jamiet said:

Aaron,

I can identify with the "feels a little dirty" sensation. I always get a pang of guilt when I introduce redundancy into a model no matter what Ralph Kimball may have to say on the subject (that's not a sleight against Kimball by the way).

Just a question about the processing of the deliveries. I assume (perhaps wrongly) that its a more-or-less synchronous process; by that I mean a record gets inserted into [dbo].[Deliveries] as soon as an email is sent. I wonder if there is any scope to decouple those 2 activities e.g. An email gets sent and the insertion into [dbo].[Deliveries] is done at a different time (perhaps by employing a messaging infrastructure such as Service Broker). That way maybe you could batch-up more of the insertions and incur less strain on [dbo].[Deliveries].

Of course you then get into what exactly do the users mean when they say they want "real-time"? and that's another can of worms in itself!

Just a thought.

-Jamie

December 29, 2009 10:24 AM
 

Wes W. said:

Too bad SQL Server indexed views can't be bound to synonyms.

December 29, 2009 10:29 AM
 

AaronBertrand said:

Jamie, yes, the inserts are performed by batch.  An external process actually sends the mail, and logs deliveries to a flat file.  These logs are bulk inserted into a work table, processed there, then inserted into the main deliveries table on a schedule.  This makes it very easy to schedule stats loading around periods of other activity, though for the most part this process runs every 10 minutes throughout the day.

Wes, since indexed views are schema-bound, I'm curious what you expect to gain from utilizing synonyms (other than, perhaps, simpler object names in the view definition).  Even if synonyms were supported, it's not like you could magically point the synonym at a different object and expect the view to automatically "catch up"...

December 29, 2009 12:14 PM
 

jamiet said:

Gotcha. I figured you'd already be on top of that sort of setup.

I too am curious to see what Wes meant about synonyms.

December 30, 2009 4:28 AM
 

Alex K said:

Hey Aaron,

I think that for better concurrency instead of the indexed view we could log the relevant changes in a queue or insert them into a log table. Another process could read messages from that queue and modify the summary table manually. Service Broker could shine in this scenario. What do you think?

December 30, 2009 6:31 PM
 

AaronBertrand said:

Alex, a lot of the changes are already processed in a queue pretty much exactly how you describe, and service broker is employed for those.  However there are a few methods that need to be relayed back to the user immediately - instant feedback kind of things, like change this or that preference.

December 30, 2009 6:44 PM
New Comments to this post are disabled

About AaronBertrand

...about me...

This Blog

Syndication

Archives

Privacy Statement