|
|
|
|
-
Here's a tidbit for those who might have SQL server in their environments, maybe without knowing all the nitty gritty low-down: if you try to use file system replication (robocopy, xcopy, repli-whatever) to maintain a DR server from your production SQL Server, you might be in for a nasty surprise.
I recently had to troubleshoot a scenario like this: the app owner came to me and said, "the DB on our disaster recovery system is in suspect mode, what should we do? (and what happened?)" I immediately went to disk corruption, etc., but it turns out the issue was really simpler: it turns out there was nothing wrong with the machine other than copying files onto it.
Huh?
Yes, copying files had corrupted the database. Here's how:
The (well meaning) group involved had established replication of files from a production server to this failover server, thinking about disaster recovery. For almost all "vanilla" files, this works great. In practically all other cases outside SQL Server, each file is a separate entity, working alone. If the file is unlocked at any point in time, then it's pretty safe to copy it off to another server.
SQL Server databases, however, are always composed of at least two separate files that work together: the data file and the log file. In addition, those two files are typically locked whenever the SQL Server service is running and has them open. Two issues with file replication, then: the database files are always locked, so they probably won't get replicated very often, but more importantly, you have to copy both of them at the same time to end up with a working database at the other end of the process. If the log and data files log sequence numbers (LSN's) don't line up, the recovery process doesn't work, and the database will not function.
What I think happened with my app owner is this: the source system and destination went offline for a brief period (probably patching), so the original MDF and LDF files were unlocked. The replication thingamagig managed to pick up and copy the small log file but not the big data file to the its target server. When the target server's SQL Server service fired up, it found a mismatched log and data file, and so, of course, failed recovery and would not bring the DB online.
The takeaway - SQL Server has great technology for disaster recovery, including the simple/basic log shipping method. Copying raw files from disk, however, won't give you DR!
|
-
I was inspired by Aaron's post on a similar topic a while back to dig deeper into how I could automagically adjust the memory allocation for multiple instances of SQL Server in a cluster-failover scenario. So, suppose you have:
- A two-node cluster
- Four SQL Server instances, two intended to run on each physical server (call it "deluxe active-active")
- An environment where it's feasible to run all four instances on one node, in case of an emergency or in a maintenance window
- An expectation that some performance degredation is acceptable in the emergency scenario, and you want to take full advantage of your hardware investment when both nodes are up and running
- Stuff running that is tolerant of reconfiguration of max memory in SQL Server "on the fly"
A typical deployment would have each SQL Server instance's memory limited to about 1/4 the usable RAM on one node (that is, 1/4 the RAM remaining after the overhead required for OS and processes outside what is covered by SQL Server's max memory configuration) so that in a failover, their memory footprint, collectively, would not overwhelm one physical server. This is certainly safe, but you also have to buy a whole lot of extra memory that will sit idle practically all the time.
Instead, I wondered if I could set each instance to use half the usable RAM on each physical host, but in the event of a failover, automatically adjust the max memory setting on each instance to an approriate value for them to share a node and still be "happy." This would be a bit like a new person bellying up to a crowded bar: some of the other customers have to "move over" - have their RAM allocation reduced - to make space.
In pseudo-code, I want to:
-
Detect that a failover has occurred
-
Check the max memory setting on each instance on the cluster against a reasonable value for where they are running
-
If needed, change the max memory setting on each instance, to prevent the total memory allocated from exceeding the usable memory on each host
I'd been mulling this over during the past few months as we design and test a SQL Server consolidation / data center move at work. I finally settled on a design like this:
-
Create a PowerShell script that can check and set the max memory value for all instances on a cluster
-
Create a SQL Server Agent job to call the PowerShell script on Agent startup, running under a proxy account that has access to the other instances
Disclaimer I don't know that this is a great idea, I only have it in test at this point. Your mileage may vary. Never dowload and use scripts from the internet. If you experience dizzyness, nausea or you turn purple or your hair falls out while using this script, see your doctor right away. Some people taking this script have reported dry mouth, memory loss or profuse bleeding from the nose. All bananas shall be placed on top of refrigerators for ripening.
So, the PowerShell script starts with a function to go get the memory configuration and host name for a running SQL Instance:
function Get-SQLInstanceConfig( [string]$SQLInstance, [ref]$hostName, [ref]$maxServerMemory ) {
# Function to establish a connection to a clustered SQL Server instance,
# read max server memory configuration value and current physical host name
$con = New-Object System.Data.SqlClient.SqlConnection( `
"Data Source=$SQLInstance;Initial Catalog=master;Integrated Security=SSPI;")
$q = "SELECT ( SELECT serverproperty('ComputerNamePhysicalNetbios') ), " `
+ "value_in_use FROM sys.configurations " `
+ "WHERE Name = 'max server memory (MB)';"
$cmd = New-Object System.Data.SqlClient.SqlCommand( $q, $con )
$con.open()
$reader = $cmd.ExecuteReader()
if( $reader.read() ) {
$hostName.Value = $reader.GetValue(0)
$maxServerMemory.Value = $reader.GetValue(1)
}
$reader.close()
$con.close()
}
Next, we need a function that can change the memory setting for an instance if we find that it's incorrect:
function Set-SQLInstanceMemory( [string]$SQLInstanceName, [int]$maxMemSetting ) {
# Function to set max server memory on a given SQL instance
write-host " Reconfiguring" $SQLInstanceName to $maxMemSetting
$con = New-Object System.Data.SqlClient.SqlConnection( `
"Data Source=$SQLInstanceName;Initial Catalog=master;Integrated Security=SSPI;")
$q = "EXEC sys.sp_configure N'show advanced options', N'1'; " `
+ "RECONFIGURE WITH OVERRIDE; " `
+ "EXEC sys.sp_configure N'max server memory (MB)', N'" + $maxMemSetting + "'; " `
+ "RECONFIGURE WITH OVERRIDE; " `
+ "EXEC sys.sp_configure N'show advanced options', N'0'; " `
+ "RECONFIGURE WITH OVERRIDE; "
$cmd = New-Object System.Data.SqlClient.SqlCommand( $q, $con )
$con.open()
$cmd.executeNonQuery()
$con.close()
}
Having established those two functions, it's then a matter of "walking" all the SQL instances to check their currently running values, then correcting any that are inappropriate for the location of instances on cluster node one or two. I need a couple of containers to store the results and make decisions about what to change:
# List the instances on the cluster:
$SQLInstances = @( "SQL01", "SQL02", "SQL03", "SQL04" )
# Make a hashtable to sort out which SQL instance is running on which host,
# with how much memory. One table entry for each host, with each entry containing
# an empty hash table we will load with SQL instance configuration info later:
$hostsTable = @{ "NODE1HOST" = @{}; "NODE2HOST" = @{} }
# Allowed memory for SQL Server = total server memory (e.g. 16 GB), less 3 GB to leave for OS
# and SQL tasks outside the SQL configured boundary:
$aggregateMemory = ( 16 - 3 ) * 1024
Working from those variables, we can then use two loops, one to collect the current state from each instance, and a second to validate/correct the values based on what instance is running on what node:
# Get the configs for each SQL Server instance
foreach( $SQLInstance in $SQLInstances ) {
$hostName = $null
$maxServerMemory = $null
Get-SQLInstanceConfig $SQLInstance ([REF]$hostName) ([REF]$maxServerMemory)
# If we really can't see one of the instances, it's best to bail at this point,
# rather than reconfiguring memory settings without complete information
if( $hostName -eq $null ) {
throw ( "Could not connect to one of the SQL instances on the cluster." )
}
# Put the current SQLInstance and its max memory value into the right slot
# in the hosts table, to classify the instance by host
$hostsTable[$hostName].Add( $SQLInstance, $maxServerMemory )
}
# For each physical cluster node, calculate a reasonable memory limit
# per SQL instance, then verify or correct the max memory value for
# each SQL Server instance running on that host
foreach( $hostEntry in $hostsTable.getEnumerator() ) {
write-host
write-host $hostEntry.Name
write-host " Num SQL instances on this host:" $hostEntry.Value.count
if( $hostEntry.Value.count -gt 1 ) {
[int]$maxMemSetting = $aggregateMemory / $hostEntry.Value.count
} else {
[int]$maxMemSetting = $aggregateMemory / 2
}
write-host " Max memory allowed per SQL instance:" $maxMemSetting
write-host
foreach( $SQLInstanceEntry in $hostEntry.Value.getEnumerator() ) {
write-host " " $SQLInstanceEntry.Name is set to $SQLInstanceEntry.Value
if ( $SQLInstanceEntry.Value -ne $maxMemSetting ) {
Set-SQLInstanceMemory $SQLInstanceEntry.Name $maxMemSetting
}
}
}
In implementing this on my test cluster, I went to each SQL Instance, and added a proxy account and then a SQL agent job to call this PowerShell script. The script had to be copied into an identical location on the physical disks in each cluster node, like the SQL Server binaries, so that it could be run by any SQL Agent, regardless of which node the agent is running on.
I then set the schedule for the jobs to execute once on SQL Agent start up.
Lastly, I reasoned that at least one common scenario could make this process fail (among other things): when a four-instance/two node cluster has a node "croak," the other node will attempt to bring two instances online as quickly as possible. If one failed SQL instance comes up before the other one, and this script runs immediately, the script will fail to connect to the second failed SQL instance, and if allowed to complete would set incorrect values. When the second node came up, then it would call the script again and correct the problem, but you've then reconfigured all the instances twice. It's a race condition, but in order to try to get this right most of the time, I put a "wait" step in each of these agent jobs, so that the agent comes up, the job fires, then there's a delay to give the other instances time to come online before the script runs.
As extra insurance, I also trap for a failed connection in the script and have it terminate. That way, if the agent runs it and one of the instances is still down, it should have no effect. When the down instance comes online, then it'll execute the script and succeed.
So far in testing, this seems to work OK. The main concern I have, with what I know about our workload and applications, isn't that we'll have a problem running on one node in an emergency, it's that there could be some side effect from changing max server memory, on the fly, while the instances are running.
I'd welcome comments or suggestions, if you've tried this or perhaps see a hole in my logic.
|
-
Mac++ | C:\macOSX > bliss It’s been about eight years (!) since my wife and I bought a computer for personal use. Last week we took delivery of a new MacBook Pro 13 2.26. And it’s fantastic! Sometimes if I work on a machine that’s slightly newer than the previous -- at work, for example -- there’s a little bump but the novelty wears off pretty quick. But get this: we went directly from a much loved Power Mac G4 Cube built about 2001 to a brand new Intel-based MacBook. Bam! Bye Bye Cube Still runs! Picture of the Cube, taken with the MacBook’s iSight camera. Note the old iSight camera on top of the monitor… The old machine has a 450 mhz one core PowerPC processor, and shipped with 64 MB RAM and an 18 GB hard drive. We got one of the early Apple LCD displays, an Apple Studio 17. I upgraded the memory a few years ago to a whopping 320 MB, and got a 60 GB external firewire drive. I also replaced the DVD ROM with a burner and bought a Microsoft Arc mouse. The thing is, other than those updates, the Cube has run flawlessly all this time. It was expensive when we bought it, but we got eight years of good use out of it. It upgraded without issue all the way from the original OS to Mac OS 10.3. It’s also one of the few computer models in the industrial design collection of the New York MOMA, and industrial design is a weakness of mine. I have a real, personal attachment to it, and I’ll be sad to give it up. In November we finally hit a wall in terms of software upgrades with a deprecated version of iPhoto. C’est la vie. I’ll want to keep it with all my relic Macs as a keepsake; I’m sure my wife will want to be rid of it :-). [Edit: she’s sitting right here and says “No,” she wants to keep the Cube and continue to use it as a second machine!] Hello Mac OS X 10.6 + Windows 7 Picture of the new MBP from the old Cube’s iSight The upgrade experience has been really exceptional. I can remember (dating myself here) when you had to worry, especially with Windows, about things like interrupts and drivers and autoexec.bat. Macs were better, but they had issues too (remember Extensions? MultiFinder? Classic?). I am not really an Apple fanboy these days, having developed a fondness for, and then a career around SQL Server -- though about 1995 I’m sure I was. On the other hand, we’ve always had Macs at home, from my first $4,700 Mac II in 1988 right up to the present. Today I try to be OS agnostic, so the prospect of running Windows 7 and OS X integrated together on Apple hardware is almost too good to be true. I am giddy. Here’s what we did, in case anyone else is headed down this road. There are other, equally good methods, but this was great: - Mirrored the Cube’s original hard drive to our external Firewire disk drive using the shareware SuperDuper! (No joke – I’ve had this for years as our backup solution. It makes a complete, bootable image of the original hard drive.)
- Ejected and unplugged the external drive from the Cube and connected its USB interface to the new MacBook.
- On the MacBook, ran the Apple Migration Assistant to bring all the data, user profile information and vintage apps across. Here I picked the option to make renamed user profiles on the new machine from the old profiles, while keeping a “clean” native user profile on the MacBook. I just like to start from scratch, especially when I am imagining eight-year-old bits.
- Some old apps we have are PowerPC binaries, which won’t be as fast on the new hardware, but I installed Apple’s Rosetta to run them anyway. They still work.
- Copied a pile of other files such as music and pictures from another partition on the external disk to the MacBook’s internal drive.
- I then downloaded vmware Fusion and installed it.
- Atop that, I bought Windows 7, downloaded and installed it from .ISO.
- Virus Protection for the Windows side, Windows Update patches
- Installed apps from our CDs on both OS’s
- Hooked adapters up from Apple to continue using our vintage Studio 17, driven by the MacBook
What’s incredible and satisfying: it all works. First time, no hassle. Two operating systems, one integrated desktop/UI. Wireless, multi-touch trackpad, everything. Amazing. The very next day my wife took the MacBook on a trip, taking with her real work requirements that had to be accomplished in both OS’s. She came back happy and reporting 100% success. Wow. I wonder if because I have been at this a while (30 years fooling with computers, but who’s counting?) I have a greater appreciation for the engineering that made this possible. All these components, hardware, software libraries, chips, interfaces and standards converge to make something elegant, powerful – amazing. Here I sit writing in Windows Live Writer, on my Mac, connected to sqlblog.com – with no issues. High five to Intel, Apple, vmware, Microsoft, Adobe. Wow.
|
-
-
So, I am just writing because I am excited about some new toys we're setting up at work: I've got three new Nehalem-based SQL servers connected to our first storage array with solid state disks (it actually has three tiers of storage: SSD, FC and SATA). Looking forward to screaming fast reads for our data warehouse/BI setup, as well as consolidation of a sprawly old collection of SQL Servers onto a nice, consolidated set of new clusters.
Technically the new storage array will support a cross-platform mix of HP/UX Itanium servers running Oracle and MS Windows clusters running SQL Server, side by side on the SSD and fiber channel disks. The design work has given me the chance to work closely with one of our Oracle DBAs and the SAN admins (all friendly people! Amazing!), and I'm learning a lot.
All the metal bits are in place in the data center, lights a-blinking, and we're on to configuring the array and perf testing in the next two weeks. Sweet!
|
-
With all the talk about cloud computing, I felt a little sheepish for being worried about mundane things like backup, and whether many users might underestimate what they need to do after turning over all their data to some service.
http://www.codinghorror.com/blog/archives/001315.html
The Provider says, "There there, we have everything backed up, your data is perfectly safe." Really?
I might just be paranoid. I am sorry to see this become so real for someone. Ouch.
|
-
This has bugged (pun?) me for some time, and I wonder if I am just nuts: I hesitate to log in to a production server with Remote Desktop and run SSMS there, preferring to do all I can with SSMS running on my local workstation instead. However, in both SSMS 2005 and 2008 there's this strange behavior when setting up an Integration Services package for SQL Server Agent: the GUI demands the package be specified using a local disk instead of a disk from the server, which of course is impossible:
- Open SSMS and connect to a remote server
- Create a new SQL Server Agent job
- Create a step with type SQL Server Integration Services Package
- Choose the File System as the source (we deploy SSIS packages as files)
- Click the Ellipsis (...) button
- I then get the rather unhelpful My Documents folder on my C: drive.
Most other server-related dialogs give you server filesystems (backup or restore, for example.) Is this just a long-overlooked bug, or am I crazy?
|
-
This is probably old news to many people, but sometimes I'm a slow learner.
So far, getting neat, color-coded code from Management Studio or Visual Studio into passable Blog-ready HTML has ... well ... mostly made me go, "Aaarrrrrggghh grrrr @*%$^#&! grrrr aaarrrgg <repeat>." I've tried several online code prettifiers, and even Word (yech) but it's painful to take code from one tool, paste it into a formatter, format it, take source out again, correct it with elaborate regex find/replace, repeat. Redgate has a nice tool for T-SQL code formatting inside Management Studio, but it doesn't, in my version at least, help with the HTML bit at all.
I know many people are very, very careful to meticulously hand format their code, and while I completely appreciate the desire for neat code and try to write neatly myself, I can't quite stomach all the extra hours spent pressing the spacebar, when a tool could do a very nice job. In C# I'm the first to admit I am completely spoiled by Ctrl+K Ctrl+D. High formatting value from minimum biological effort.
After my last post, which was long on code and very frustrating to format, I am onto this tool: http://copysourceashtml.codeplex.com/ which looks more like it. Visual Studio > Highlight > Copy > HTML Editor or Browser > Paste > Aah. What computers are supposed to do.
We'll see if this lessens my pain on the next post.
|
-
Partitioning for Fun and Profit
Over the past year or so I’ve been playing with a toy SQL monitoring application, just to sharpen my dev skills a bit. Part of the app is a SQL database to store historical performance data, in order to do trending. Fooling with the design of the performance database allowed me to look at EAV (Entity Attribute Value) versus other patterns and do some real performance measurements.
I went into this design with some preconceptions about EAV:
- I figured EAV could be a good fit for this problem, given that one would not know which of the hundreds of available performance counters the app would track until run time. At design time, all I know is that the database has to store some unknown, user-defined collection of performance counters. It seems like the classic EAV problem of run time, user-defined attributes.
- I predicted that EAV would be more convenient to implement than a dynamic / multiple table design because the schema would not change at run time. The question is, how much more convenient? Conversely, how much effort and risk is there in extending the schema at run time?
- I thought there would be tradeoffs, where EAV is bigger and slower than having tables built out, especially when it comes to reporting queries. Generally it's possible to boost database performance when you can partition sub-sets of rows into physically separate structures, rather than having them interleaved together in one huge table. The questions here are, how much bigger, and specifically which kinds of queries are slower? Are there concrete rules that can indicate when EAV will work well or not?
To set the stage for this, I am planning to store the performance data in a way that aligns with how Windows performance counters are structured: there’s a tree of counters labeled performance objects / counters / instances in perfmon or counter categories / counters / instance names in C#. It’s pretty easy to dump out a tabular list of available counters with a simple C# or PowerShell snippet:
using System;
using System.Text;
using System.Diagnostics;
namespace ListPerfCounters
{
class Program
{
static void Main(string[] args)
{
using (System.IO.StreamWriter outfile
= new System.IO.StreamWriter("c:\\CounterList.txt"))
{
PerformanceCounterCategory[] cats
= PerformanceCounterCategory.GetCategories();
foreach (PerformanceCounterCategory cat in cats)
{
// Exclude category "Thread" to eliminate an insane number of counters:
if (cat.CategoryName != "Thread")
{
// Single instance categories get labled as instance = "Default":
if (cat.CategoryType == PerformanceCounterCategoryType.SingleInstance)
{
PerformanceCounter[] counters = cat.GetCounters();
foreach (PerformanceCounter counter in counters)
{
outfile.WriteLine(cat.CategoryName + "\t"
+ counter.CounterName
+ "\tDEFAULT");
}
}
// Multi-instance categories are listed out per instance:
else if (cat.CategoryType == PerformanceCounterCategoryType.MultiInstance)
{
String[] instances = cat.GetInstanceNames();
foreach (String instance in instances)
{
PerformanceCounter[] counts = cat.GetCounters(instance);
foreach (PerformanceCounter counter in counts)
{
outfile.WriteLine(cat.CategoryName + "\t"
+ counter.CounterName + "\t"
+ instance);
}
}
}
//Ignore CategoryType = unknown
else
{
// do nothing
}
}
}
}
Console.WriteLine("Press the 'Any' key...");
Console.ReadKey();
}
}
}
The output is a tab-delimited list of most available counters:
|
Processor |
% Processor Time |
_Total |
|
Processor |
% User Time |
_Total |
|
Processor |
% Privileged Time |
_Total |
|
Processor |
Interrupts/sec |
_Total |
|
Processor |
% DPC Time |
_Total |
|
Processor |
% Interrupt Time |
_Total |
|
Processor |
DPCs Queued/sec |
_Total |
|
Processor |
DPC Rate |
_Total |
|
Processor |
% Idle Time |
_Total |
|
Processor |
% C1 Time |
_Total |
|
Processor |
% C2 Time |
_Total |
|
Processor |
% C3 Time |
_Total |
|
Processor |
C1 Transitions/sec |
_Total |
|
Processor |
C2 Transitions/sec |
_Total |
|
Processor |
C3 Transitions/sec |
_Total |
|
Processor |
% Processor Time |
0 |
|
Processor |
% User Time |
0 |
|
Processor |
% Privileged Time |
0 |
|
Processor |
Interrupts/sec |
0 |
|
Processor |
% DPC Time |
0 |
|
Processor |
% Interrupt Time |
0 |
|
Processor |
DPCs Queued/sec |
0 |
|
Processor |
DPC Rate |
0 |
|
Processor |
% Idle Time |
0 |
|
Processor |
% C1 Time |
0 |
|
Processor |
% C2 Time |
0 |
|
… etc … |
|
|
First the EAV schema
The EAV pattern essentially says that since we don’t know what attributes (what specific performance counters, in this case) need to be stored, we store the data in labeled rows, instead of in columns – rows being really easy to change at run time, while columns are not. The simplest implementation of this is with two tables, one small table of monitored servers, and one huge table of labeled performance counter values. If a “sample” is defined as a set of specific counter values at one point in time, then this big table will have a set of rows for each server, for each time point, where each row is labeled per performance counter:
CREATE DATABASE [CountersEAV]
GO
ALTER DATABASE [CountersEAV] SET RECOVERY SIMPLE
GO
USE [CountersEAV]
GO
CREATE TABLE dbo.[Servers](
ServerID int IDENTITY(1,1) NOT NULL,
ServerName varchar(128) NOT NULL,
[Description] varchar(255) NOT NULL,
CONSTRAINT PK_Servers PRIMARY KEY CLUSTERED ( ServerID ASC )
);
GO
CREATE UNIQUE NONCLUSTERED INDEX UniqueServerNames ON dbo.[Servers]
(
ServerName ASC
);
GO
CREATE TABLE dbo.CounterSamples(
SampleTime datetime NOT NULL,
ServerID int NOT NULL,
PerfObject nvarchar(30) NOT NULL,
Counter nvarchar(50) NOT NULL,
Instance nvarchar(30) NOT NULL,
Value sql_variant NOT NULL,
CONSTRAINT PK_CounterSamples PRIMARY KEY CLUSTERED
(
SampleTime ASC,
PerfObject ASC,
[Counter] ASC,
Instance ASC,
ServerID ASC
)
);
GO
ALTER TABLE dbo.CounterSamples
WITH CHECK ADD CONSTRAINT FK_CounterSamples_Servers FOREIGN KEY( ServerID )
REFERENCES dbo.[Servers] ( ServerID );
GO
ALTER TABLE dbo.CounterSamples CHECK CONSTRAINT FK_CounterSamples_Servers;
GO
Inserting the counter data into this table is trivial:
CREATE PROCEDURE saveCounterSample (
@SampleTime datetime,
@ServerID int,
@PerfObject nvarchar(30),
@Counter nvarchar(50),
@Instance nvarchar(30),
@Value float
) AS
BEGIN
SET NOCOUNT ON;
INSERT INTO dbo.CounterSamples (
SampleTime,
ServerID,
PerfObject,
[Counter],
Instance,
Value
) VALUES (
@SampleTime,
@ServerID,
@PerfObject,
@Counter,
@Instance,
@Value
);
END GO
Next, EAV Sample Data
In order to see how this will perform at scale, we need to populate it with a lot of sample data, certainly more than will fit in memory. I’ve got a script that can do this by just creating random values for a reasonable set of different counters, in bulk:
INSERT INTO dbo.[Servers] ( ServerName, [Description] )
VALUES ( 'MyServer', 'Test Server' );
INSERT INTO dbo.[Servers] ( ServerName, [Description] )
VALUES ( 'BigServer','Big One' )
INSERT INTO dbo.[Servers] ( ServerName ,[Description] )
VALUES ( 'ProdServer' ,'Some Server' )
INSERT INTO dbo.[Servers] ( ServerName ,[Description] )
VALUES ( 'SQLProd01' ,'Yet Another Server' )
INSERT INTO dbo.[Servers] ( ServerName ,[Description] )
VALUES ( 'SQLProd02' ,'Still another' )
GO
Sample counter values (note: this makes a decent quantity of data, four weeks of one minute sample intervals for 194 counters on five imaginary servers 4 * 24 * 60 * 5 * 194 ~= 39,000,000 rows)
-- Populate with test data
-- Create rows for many counters representing a 24 hour period
DECLARE @now datetime;
SET @now = '2009-01-01';
DECLARE @val float;
WHILE @now < '2009-01-02' BEGIN
BEGIN TRANSACTION;
SET @val = RAND() * 10;
EXEC saveCounterSample @now, 1, N'Memory', N'Pages/sec', N'Default', @val;
SET @val = RAND() * 10;
EXEC saveCounterSample @now, 1, N'PhysicalDisk', N'Avg. Disk Queue Length', N'_Total', @val;
SET @val = RAND() * 10;
EXEC saveCounterSample @now, 1, N'Processor', N'% Processor Time', N'_Total', @val;
SET @val = RAND() * 10;
EXEC saveCounterSample @now, 1, N'Processor', N'% Processor Time', N'0', @val;
SET @val = RAND() * 10;
EXEC saveCounterSample @now, 1, N'Processor', N'% Processor Time', N'1', @val;
SET @val = RAND() * 10;
EXEC saveCounterSample @now, 1, N'Processor', N'% Processor Time', N'2', @val;
SET @val = RAND() * 10;
EXEC saveCounterSample @now, 1, N'Processor', N'% Processor Time', N'3', @val;
/* … repeat for many sample counters … */
SET @val = RAND() * 10;
EXEC saveCounterSample @now, 1, N'MSSQL$TEST2000:General Statistics', N'User Connections', N'DEFAULT', @val;
SET @val = RAND() * 10;
EXEC saveCounterSample @now, 1, N'MSSQL$TEST2000:Locks', N'Average Wait Time (ms)', N'_Total', @val;
COMMIT;
SET @now = DATEADD( minute, 1, @now );
END;
GO
-- Use the first day's values and a random multiplier to mock test data for a month:
DECLARE @i int;
SET @i = 1;
WHILE @i < 28 BEGIN
INSERT INTO dbo.CounterSamples (
SampleTime,
ServerID,
PerfObject,
[Counter],
Instance,
Value
) SELECT
DATEADD( DAY, @i, SampleTime ) as SampleTime,
ServerID,
PerfObject,
[Counter],
Instance,
CAST ( Value as float ) * ( RAND() * 1.5 ) as Value
FROM dbo.CounterSamples
WHERE '2009-01-01' <= SampleTime and SampleTime < '2009-01-02';
SET @i += 1;
END;
GO
-- Use the whole month's data and a random multiplier to generate mock data for more servers
DECLARE @i int;
SET @i = 2;
WHILE @i <= 5 BEGIN
INSERT INTO dbo.CounterSamples (
SampleTime,
ServerID,
PerfObject,
[Counter],
Instance,
Value
) SELECT
SampleTime,
@i as ServerID,
PerfObject,
[Counter],
Instance,
CAST ( Value as float ) * ( RAND() * 1.5 ) as Value
FROM dbo.CounterSamples
WHERE ServerID = 1;
SET @i += 1;
END;
GO
Index rebuild at the end, as the table is almost certainly highly fragged.
This design does function, but there are some challenges:
- The CounterSamples table is obviously huge, and highly repetitive because of the row labels.
- A query is most likely to “want” the data from one counter or another, as separate sets, but practically never from a mixture of different counters. Directly aggregating the “total % processor used” values with, for example, the “percent log used” values makes no sense, because the counters measure two unrelated metrics. So it seems like we have rows mixed together that most queries will want to separate.
Counterpoint: Dynamic Table Solution
Another possible schema is driven by the notion that at query time this data “wants” to be partitioned by counter, so all the values for each specific counter are in one place. Since we don’t know the counter list until run time, it means we have to code around the fact that tables need to be added dynamically to the schema. This does add to the complexity of the code, while possibly paying dividends in storage requirement and performance.
First, we can use the same Servers table as in the first example, so assume we have that one. Next, we will need to be able to dynamically generate a table to store counter samples for any specific performance counter. A template for such a table might look like this:
CREATE TABLE dbo.[LogicalDisk_% Disk Read Time__Total](
SampleTime datetime NOT NULL,
ServerID int NOT NULL,
Value float NOT NULL,
CONSTRAINT [PK_LogicalDisk_% Disk Read Time__Total] PRIMARY KEY CLUSTERED
(
SampleTime ASC,
ServerID ASC
)
);
GO
ALTER TABLE [dbo].[LogicalDisk_% Disk Read Time__Total]
WITH CHECK ADD CONSTRAINT [FK_LogicalDisk_% Disk Read Time__Total_Servers] FOREIGN KEY([ServerID])
REFERENCES dbo.[Servers] ([ServerID])
GO
ALTER TABLE dbo.[LogicalDisk_% Disk Read Time__Total] CHECK CONSTRAINT [FK_LogicalDisk_% Disk Read Time__Total_Servers]
GO
But, code has to be written to take a performance counter and generate a table like this at run time. In a “real” scenario, it probably makes sense to do that client side, but to mock this up I am going to make a fake, dynamic SQL driven stored procedure that will generate these tables:
CREATE PROCEDURE addCounterTable (
@PerfObject nvarchar(30),
@Counter nvarchar(50),
@Instance nvarchar(30)
) AS
BEGIN
DECLARE @tableName varchar(128);
SET @tableName = @PerfObject + '_' + @Counter + '_' + @Instance;
IF NOT EXISTS (
select * from INFORMATION_SCHEMA.TABLES
where TABLE_SCHEMA = 'dbo'
and TABLE_NAME = @tableName
) BEGIN
-- Requested table is missing, so create it
DECLARE @qtablename nvarchar(128)
DECLARE @qpkname nvarchar(128)
DECLARE @qfkname nvarchar(128)
SET @qtablename = QUOTENAME( @tableName );
SET @qpkname = QUOTENAME( 'PK_' + @tableName );
SET @qfkname = QUOTENAME( 'FK_' + @tablename + '_Servers' );
declare @createStmts nvarchar(max);
set @createStmts =
'CREATE TABLE dbo.' + @qtablename + '( ' + CHAR(10)
+ ' SampleTime datetime NOT NULL, ' + CHAR(10)
+ ' ServerID int NOT NULL, ' + CHAR(10)
+ ' Value float NOT NULL, ' + CHAR(10)
+ 'CONSTRAINT ' + @qpkname + ' PRIMARY KEY CLUSTERED ' + CHAR(10)
+ ' ( ' + CHAR(10)
+ ' [SampleTime] ASC, ' + CHAR(10)
+ ' [ServerID] ASC ' + CHAR(10)
+ ' ) ' + CHAR(10)
+'); ' + CHAR(10)
+ 'ALTER TABLE dbo.' + @qtablename + ' ' + CHAR(10)
+ 'WITH CHECK ADD CONSTRAINT ' + @qfkname + ' ' + CHAR(10)
+ 'FOREIGN KEY( ServerID ) ' + CHAR(10)
+ 'REFERENCES dbo.Servers ( ServerID ); ' + CHAR(10)
+ 'ALTER TABLE dbo.' + @qtablename + ' ' + CHAR(10)
+ 'CHECK CONSTRAINT ' + @qfkname + '; ' + CHAR(10)
-- PRINT @createStmts
EXEC( @createStmts )
END
END
GO
Similarly, client-side code would have to be written to insert counter samples into the database by intelligently using the correct table for each counter. Technically this isn’t very difficult, as one could put a check for a matching table into the constructor of a client-side object, which would validate that an appropriate table exists (or create one), and then use that table from that point on. I can also mock this in T-SQL with a dynamic SQL stored procedure, though I would not implement it this way on a real system:
CREATE PROCEDURE saveCounterSample (
@SampleTime datetime,
@ServerID int,
@PerfObject nvarchar(30),
@Counter nvarchar(50),
@Instance nvarchar(30),
@Value float
) AS
BEGIN
SET NOCOUNT ON;
DECLARE @tableName varchar(128);
SET @tableName = @PerfObject + '_' + @Counter + '_' + @Instance;
DECLARE @checkedTableName nvarchar(128);
SELECT @checkedTableName = TABLE_NAME
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'dbo'
and TABLE_NAME = @tableName;
IF( @checkedTableName is not null )
BEGIN
DECLARE @insertcmd nvarchar(max);
SET @insertcmd =
N'INSERT INTO dbo.' + QUOTENAME( @checkedTableName ) + N' ( ' + char(10)
+ N' SampleTime, ServerID, Value ' + char(10)
+ N') VALUES (' + char(10)
+ N' @SampleTime, @ServerID, @Value ' + char(10)
+ N');';
DECLARE @paramlist nvarchar(1000);
SET @paramlist = N'@SampleTime datetime, @ServerID int, @value float';
--PRINT @insertcmd;
EXECUTE sp_executesql
@insertcmd,
@paramlist,
@sampleTime = @sampleTime,
@serverID = @ServerID,
@Value = @Value;
END
ELSE
BEGIN
RAISERROR('The required table does not exist.', 10, 1);
END
END
GO
Dynamic Tables Sample Data
So, in a real app the tables would be created on demand. In order to load sample data, though, it’s simpler just to initialize the database with the whole sample set of tables:
-- Setup for Sample Data
EXEC addCounterTable N'Memory', N'Pages/sec', N'Default'
EXEC addCounterTable N'PhysicalDisk', N'Avg. Disk Queue Length', N'_Total'
EXEC addCounterTable N'Processor', N'% Processor Time', N'_Total'
EXEC addCounterTable N'Processor', N'% Processor Time', N'0'
EXEC addCounterTable N'Processor', N'% Processor Time', N'1'
EXEC addCounterTable N'Processor', N'% Processor Time', N'2'
EXEC addCounterTable N'Processor', N'% Processor Time', N'3'
EXEC addCounterTable N'Processor', N'% Processor Time', N'4'
EXEC addCounterTable N'Processor', N'% Processor Time', N'5'
EXEC addCounterTable N'Processor', N'% Processor Time', N'6'
EXEC addCounterTable N'Processor', N'% Processor Time', N'7'
EXEC addCounterTable N'Process', N'% Processor Time', N'sqlservr'
EXEC addCounterTable N'LogicalDisk', N'% Disk Read Time', N'_Total'
EXEC addCounterTable N'LogicalDisk', N'% Disk Time', N'_Total'
/* ... More executions to make req’d tables ... */
EXEC addCounterTable N'MSSQL$TEST2000:Locks', N'Average Wait Time (ms)', N'_Total'
Then generate the same quantity of sample data as in the first example:
-- Populate with Sample Data
-- Create 24 hours of samples
DECLARE @now datetime;
SET @now = '2009-01-01';
DECLARE @val float;
WHILE @now < '2009-01-02'
BEGIN
BEGIN TRANSACTION
SET @val = RAND() * 10;
EXEC saveCounterSample @now, 1, N'Memory', N'Pages/sec', N'Default', @val;
SET @val = RAND() * 10;
EXEC saveCounterSample @now, 1, N'PhysicalDisk', N'Avg. Disk Queue Length', N'_Total', @val;
SET @val = RAND() * 10;
EXEC saveCounterSample @now, 1, N'Processor', N'% Processor Time', N'_Total', @val;
SET @val = RAND() * 10;
EXEC saveCounterSample @now, 1, N'Processor', N'% Processor Time', N'0', @val;
SET @val = RAND() * 10;
EXEC saveCounterSample @now, 1, N'Processor', N'% Processor Time', N'1', @val;
SET @val = RAND() * 10;
EXEC saveCounterSample @now, 1, N'Processor', N'% Processor Time', N'2', @val;
SET @val = RAND() * 10;
EXEC saveCounterSample @now, 1, N'Processor', N'% Processor Time', N'3', @val;
/* ... More executions for all counters ... */
SET @val = RAND() * 10;
EXEC saveCounterSample @now, 1, N'MSSQL$TEST2000:Locks', N'Average Wait Time (ms)', N'_Total', @val;
COMMIT
SET @now = DATEADD( minute, 1, @now );
END
-- Use the first day's values and a random multiplier to mock test data for a month:
DECLARE @i int;
SET @i = 1;
WHILE @i < 28 BEGIN
DECLARE @insertstmt nvarchar(max);
SET @insertstmt = '';
SELECT @insertstmt = @insertstmt +
'insert into dbo.' + QUOTENAME( table_name ) + '( '
+ 'SampleTime, '
+ 'ServerID, '
+ 'Value '
+ ') select dateadd( day, ' + cast(@i as varchar(5) ) + ', SampleTime) as SampleTime, '
+ 'ServerID, '
+ 'Value * ( RAND() * 1.5 ) as Value '
+'from dbo.' + QUOTENAME( table_name ) + ' '
+ 'where ''2009-01-01'' <= SampleTime and SampleTime < ''2009-01-02'';' + CHAR(10)
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME != 'Servers';
-- print @insertstmt;
EXEC( @insertstmt );
SET @i += 1;
END
GO
-- Use the month's data and a random multiplier to generate mock data for more servers
DECLARE @i INT;
SET @i = 2;
WHILE @i <= 5 BEGIN
DECLARE @insertstmt NVARCHAR(max);
SET @insertstmt = '';
SELECT @insertstmt = @insertstmt +
'INSERT INTO dbo.' + QUOTENAME( table_name ) + '( '
+ ' SampleTime, '
+ ' ServerID, '
+ ' Value '
+ ') SELECT SampleTime, '
+ cast( @i as varchar(5) ) + ' as ServerID, '
+ ' Value * ( RAND() * 1.5 ) as Value '
+ 'FROM dbo.' + QUOTENAME( table_name ) + ' '
+ 'WHERE ServerID = 1;' + CHAR(10)
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME != 'Servers';
-- PRINT @insertstmt;
EXEC( @insertstmt );
SET @i += 1;
END;
First Impressions
The first thing I note when looking at the sample data is the size discrepancy: the first model requires 5.5 GB to store the counter data, where the second fits in 1.2 GB. Most of that difference has to do with the fact that the EAV “labels” for this case are quite wide, so there’s a lot of repeated text in the EAV model.
The second thing of note is the interleave of different types of perf counters in the one large table of samples, which is what we’re going to examine for performance. This depends some on the order of the cluster key, but essentially it looks like this:
|
2009-01-01 00:00:00.000 |
1 |
LogicalDisk |
% Disk Read Time |
_Total |
1.8793138675398 |
|
2009-01-01 00:00:00.000 |
2 |
LogicalDisk |
% Disk Read Time |
_Total |
2.52991540434083 |
|
2009-01-01 00:00:00.000 |
3 |
LogicalDisk |
% Disk Read Time |
_Total |
1.22021347428242 |
|
2009-01-01 00:00:00.000 |
4 |
LogicalDisk |
% Disk Read Time |
_Total |
0.696864840326046 |
|
2009-01-01 00:00:00.000 |
5 |
LogicalDisk |
% Disk Read Time |
_Total |
0.645544958044297 |
|
2009-01-01 00:00:00.000 |
1 |
LogicalDisk |
% Disk Read Time |
C: |
2.26820505352887 |
|
2009-01-01 00:00:00.000 |
2 |
LogicalDisk |
% Disk Read Time |
C: |
3.05343721676384 |
|
2009-01-01 00:00:00.000 |
3 |
LogicalDisk |
% Disk Read Time |
C: |
1.47271534391143 |
|
2009-01-01 00:00:00.000 |
4 |
LogicalDisk |
% Disk Read Time |
C: |
0.841068849517578 |
|
2009-01-01 00:00:00.000 |
5 |
LogicalDisk |
% Disk Read Time |
C: |
0.779129213808746 |
|
< … 100's of rows … > |
|
|
|
|
|
|
2009-01-01 00:01:00.000 |
1 |
LogicalDisk |
% Disk Read Time |
_Total |
7.72396922675191 |
|
2009-01-01 00:01:00.000 |
2 |
LogicalDisk |
% Disk Read Time |
_Total |
10.3979378149299 |
|
2009-01-01 00:01:00.000 |
3 |
LogicalDisk |
% Disk Read Time |
_Total |
5.01507038723846 |
|
2009-01-01 00:01:00.000 |
4 |
LogicalDisk |
% Disk Read Time |
_Total |
2.86411050056799 |
|
2009-01-01 00:01:00.000 |
5 |
LogicalDisk |
% Disk Read Time |
_Total |
2.65318607846297 |
|
2009-01-01 00:01:00.000 |
1 |
LogicalDisk |
% Disk Read Time |
C: |
9.49033453385117 |
|
< … &c > |
|
|
|
|
|
Rows that I imagine one query “cares” about (green) are dispersed through the table, with a lot of other rows in between (gray).
So, tune in for the next post where I’ll have report query performance information, and a look at the buffer cache contents when reading from these two designs.
|
-
So, found this incredible freeware program on Codeplex with a really simple UI to handle performance optimization for SQL Server:

I was gonna post the link here but I lost it, and now the site seems to have been taken down...
OK, I am obviously lying. Wouldn't that be sweet, though? I do have a hunch there are developers working on, or at least philosophically aiming for, that level of automation around index tuning and partitioning, etc, but it's a long way off.
It's worth considering this dichotomy, though, when doing any kind of physical database design: call it the "Pay me Now or Pay me Later" principle. Basic information theory has, among other implications, the notion that it takes more effort to put a system into a high degree of organization (low entropy) than a lower degree of organization. And, of course, there is the related principle that it's possible to trade computation (time) for storage, and vice versa. In very concrete terms, this means that when designing a database at the physical storage level, perhaps the most important theme we work with is, "How organized will the data be in its stored state, to provide optimal efficiency."
At one end of the slider above, we have a theoretical limit of the most highly organized, most read-optimal state, where we have traded maximum storage for minimum compute time. A database way at that end of the spectrum would have, hypothetically, a covering index for every possible query, all joins pre-resolved, and aggregates pre-built and stored for every possible aggregation. The time taken to query the data would essentially be only the time required to fetch the data and send it back, with zero computation. The cost, then, is that any time information is added to such a database, a huge amount of computation is required to "file" it into that very detailed model. This is the essence of "Pay me Now." Inserts are expensive, in order to store the data in a highly organized state, thereby reducing the computation necessary to read results out later.
At the other end, all the data we are attempting to store is in a uniform, undifferentiated "heap." Inserts are really fast and straightforward, rather like slinging laundry onto a pile, because no computation is required to add data to the store. This is "Pay me Later." Laundry goes onto the pile easily, but when it's time to find all the matching socks, you have some work ahead of you.
So, "Duh," you may be saying. Of course.
The place where things get sticky is that the in-between states in this dichotomy don't fall on a nice, smooth gradient. In fact, the specific methods used and the technologies employed in SQL Server - all the details of table organization, keys, indexes, normalization, partitioning - make that in-between space all wrinkly and convoluted. Some specific techniques help speed some types of queries but not other types; some have a large impact and some a small impact. Some choices (which clustered index?) are inherently limited, where you have to trade one thing for another. Some introduce data integrity or redundancy questions that require complicated code to resolve.
Because of that complexity, or "wrinkly landscape," we often get into conversations like Aaron Bertrand's recent, excellent discussion on EAV, or my rant about TPH related to ORM and polymorphism.
At a high level, if I can use the EAV example, I find that it's useful to apply the measure above, at least in an abstract way. If you have a performance problem of some kind with a database, it might be because you're in the wrong place on that slider. And when evaluating some design pattern for how to arrange data (Kimball star schema, or TPH, or conventional normalization, or EAV, partitioning, etc.), taking into consideration data integrity, the main question is, "will the design pattern get you to the right position on that slider?" What's wrong with EAV? Well, it depends. Part of the famous It Depends comes from the wrinkly landscape between Pay me Now and Pay me Later. It depends on the granular details of your application, and where EAV would put you on the pay-me-now or pay-me-later meter. It might be just the thing, or it could turn into a real performance or data integrity challenge, depending on how it applies to the problem at hand.
|
-
Spoiler: TPH is an evil trap
In the last installment I pleaded for the equal treatment of the database schema and object model when implementing an application with a database, on the grounds that the failure of either means the failure of the whole system. I think it follows that a decent application ought to have both a good quality database schema and a fabulous code base. If the database is an embarrassment, on relational grounds, then it compromises the whole, just as if the app code is an embarrassment on OO grounds.
(I am working from the assumption that we do try to create good quality software; there's the whole "sales driven design" notion that says we can't ever create anything of value because it's too costly; I reject that notion as ... well, boring. There would not be much to write here if we simply give up trying to make anything good.)
So, for the sake of argument let's assume we've picked an O/R Mapper to help create the app by automating much of the plumbing of repetitive SQL queries, and moving data to and fro. I am going to invoke NHibernate, because it seems decent, and mature, and so on. I think Entity Framework has the potential to turn into something great, if built out correctly in upcoming versions, yet it seems so far to be over-inflated with a lot of marketing noise, while it catches up to other solutions. Then again, I am not really an expert with ORM, and others could to a better job of comparing. Assume an appropriate level of abstraction is maintained between the database and the application by some implementation of either views, functions and stored procs, or by a nice, clean data access layer, which you can do with a mapper if you implement it correctly.
The most trivial bits of O/R mapping aren't that interesting - a simple atomic object corresponds to a simple row in a simple table using a surrogate key. A collection like List<T> in the app is a collection of such objects representing a subset of the rows from a table. "Product Object" implies "select ProductID, Name, Price from Products where ProductID = n." That's all quite basic, and something one would probably create manually using traditional coding techniques. The place where things get interesting and hairy is in the more complicated cases where the object model and the relational model don't match up so neatly. I'll cut straight to the chase: polymorphism is where things get ugly, and that's where we should start to be careful. And if you look up polymorphism related to popular ORM's, an issue rears its ugly head immediately: TPH or Table Per Heirarchy persistence model for polymorphism. Consider these descriptions:
NHibernate Inheritance Mapping docs
https://www.hibernate.org/hib_docs/nhibernate/html/inheritance.html
Oren Eini (Ayende Rahien)'s very concise blog description:
http://ayende.com/Blog/archive/2009/04/10/nhibernate-mapping-ndash-inheritance.aspx
Alex James excellent version pertaining to Entity Framework
http://blogs.msdn.com/alexj/archive/2009/04/15/tip-12-choosing-an-inheritance-strategy.aspx
I don't have anything against the smart, hard-working folks that put this stuff together, nor am I anti-ORM in any way. Still, amazingly, all these documents and blogs about both NHibernate and Entity Framework generally give about equal billing to three methods for representing polymorphic objects in a relational schema -- even though one, TPH (which, horribly, seems always to come first in every such list) violates the most basic best practice for relational database design. It's not normalized even in a trivial sense, it makes it impossible to enforce integrity on the data, and what's most "awesome:" it is virtually guaranteed to perform badly at a large scale for any non-trivial set of data.
Luckily, there is a workable implementation in most of these tools, Table per Type. What's missing from the docs and recommendations around TPH is some big, bold print that says,
"Never do TPH, because it's a horrible nightmare in the relational world, a violation of 30 years of best practice in database design, and probably makes a DBA somewhere cry inside. Always do TPT. TPH is presented here only as a twisted curiosity, and is to be avoided."
Lack of that warning about TPH means people might be tempted to actually use it, which is unfortunate.
I imagine all DBA's will hate TPH. Some Devs I am certain will complain that the DBA's are being obstinate or unreasonable, but here's the thing: TPH is a bad idea, and the DBA's know that. If the shoes and feet were reversed, so to speak, where the DBA's were demanding that Developers violate some fundamental programming principles in the design of their code -- without even really understanding it -- they might be a bit miffed as well, and would have valid reasons. The sad thing is that blunders like TPH can sour DBAs on the entire notion of ORM, based simply on the notion that ORM tools can seem to encourage bad database design. It really ought to be the other way around. But let's set the cultural argument aside and look at technical reasons why this method is to be avoided.
I worked with an ISV-provided piece of software for years that, by coincidence, had important, busy tables designed on the TPH principle. Ironically it didn't use anything as elaborate as an O/R mapper; the developers had just adopted the same notion for polymoprhism. In this case the application has an accounting function, among other things. Each financial transaction in the accounting part of the system, of every type (charges, payments, checks, invoices, etc.) is stored in one massive Transactions table. The table is 120 columns wide and practically every column is nullable, and might or might not have a value depending on what type of transaction was indicated - you guessed it - in the "discriminator" column. There are 16 types of transactions, if I recall. This massive table also stored both posted and un-posted transactions (hey, they're all "transactions," right?). It also contained the header records for batches of transactions that were entered at the same time. In order to "post" un-posted transactions, an elaborate process would change the discriminator/record type and set various dates and flags in the table, through a ... erm, "complex" ... procedure. A massive amount of this huge, largely empty table (nulls take space, even without values) had to be kept in cache on the SQL Server all the time. It was basically filling the RAM on the server with blank space, and the odd value here or there at random. Ironically, this design provided no advantage whatsoever. It's wasn't like there was a trade off of this advantage for that disadvantage. It was simply, purely horrible.
The application did not perform well, at all, at scale, and was just a constant source of expense and pain for us. Not because of some purist ideal of normalization, but because all the financial data was mixed into a great disorganized pile. The frustrating thing about it was that had it been based on a reasonable schema, there would have been no performance issue at all, and it would easily have worked for hundreds or a few thousands of end users on fairly modest hardware.
Of the docs and blog posts I have been able to find on this topic, most give only a cursory overview, along the lines of "you might want to do this, or that, it's not that important." Alex James does more in the way of explanation about how they work and why to pick one. Interestingly, there's a table on his blog page ranking them by different criteria, and the TPT, relational-friendly option he marks "winner" in every category except performance. I had the privilege of attending a session at the PASS Summit that he gave, which was excellent, and he is a very sharp guy with an understanding of the database side of this issue. Yet I do have to wonder about the "performance" argument, as it's not been my experience at all. Maybe I am jumping to conclusions, but I wonder if it's the old myth about "avoid joins to enhance performance" coming back around.
I suspect this: for a trivial case, where you have five or six object types that are quite similar, then stuffing them into one table might avoid some joins - as long as the cost of the interleaving of rows of different types, which reduces the efficacy of indexes and takes more memory and slows many types of queries down, doesn't cancel the intended optimization. However for real world cases with large sets of data, design-by-join-avoidance isn't really a good strategy. Better performance is gained, in most cases, by the partitioning of data into smaller, more highly sorted and organized sets. Yes, the server has to perform joins, but the cost of joins is much less than the cost of brute-force searching through a high volume of interleaved, mixed, sparse rows of different "types" in one massive table.
Perhaps a clearer analogy: suppose we had one uber-table, called "Object." Everything inherits from Object, so why not one massive, ultimate Table Per Heirarchy with Object as the root. We just put a discriminator that indicates what type of object each row represents, and a nullable column for every property of every object. Simple! The issue is that everything, in this case literally everything, is mixed together. Should perform swimmingly, because there are no joins!
But wait. There are other issues over and above joins. How does the server locate data in this table? How much cache would be required? What does an index look like? Clearly the server would perform better if the data were partitioned sensibly, into smaller groups, so that it could find related information. The answer, of course, is tables. Having multiple tables not just convenient, it's a performance optimization. In fact, there are many performance issues that can be solved by adding tables, not removing tables - most of the time if unrelated rows can be partitioned into separate sets (tables), it makes it easier and faster for the server to locate them.
So if performance is the one argument for TPH, the one that Alex says "trumps these other concerns," then I have to say I'm not quite buying it. In fact, I can't see why anyone would ever want to do TPH for anything other than a small toy of a system. In fact, can we turn that thing off in any of these O/R mappers? Where's the OFF button?
Back to the larger point, which is, "what are the major design issues for connecting apps to data?" I think the points I want to make are: a. the database schema still deserves careful attention and design, even in 2010, and b. the use of modern methodologies (agile, ORM, etc.) can be aligned with best practices but you have to be intentional about it. You have to be able to say things like, "TPH sounds neat-o, but it violates normalization fundamentals. What's up with that? That seems like a bad idea, even according to my olde-time gray-haired DBA over here. Why do that? What are my other options? Which ones don't make the database suck?" The end goal should be "and" - great database AND great application, no compromises.
|
-
This is the third part of a series ( Part 1, Part 2) thinking out loud about the decision making around data access for applications. Once you've considered how tightly bound your application code can safely be to tables, I would like to put two related thoughts out there.
First, it's counterproductive, over the long term, to think of only the application as the focus of all development effort, and the database just as "back end" or "persistence." In the current environment where practically all applications are OO, and all storage is relational, you will have two models to design and maintain, one object oriented and one relational. Is that ideal? Perhaps not, but ignoring one or the other, or wishing the distinction away only causes future pain and suffering.
Second, getting to effective performance and scalability is 90% smart design and only 10% capable hardware.
In case the connection between these statements is not obvious, consider the following pattern of failure that's repeated in IT departments in businesses everywhere:
- Developers and SME's or Domain Experts in an ISV work out super duper code that exactly fits business needs and wows users with elegant interface design and indispensable features. Sweet!
- Caught up in the heady fun of design and coding C# (or pick a language), they grudgingly come to realize that it'll be necessary to store the data from their app. In a halfhearted way, they figure out how to stuff the data into a database. "Darn that antiquated SQL and those restrictive tables! I need polymorphism! I need inheritance!" Some fundamental blunder is made like choosing EAV, or hopelessly wide tables of nullable columns, or a schema that mixes unrelated rows in the same huge, index-hostile table, or something else that cripples an otherwise very capable RDBMS.
- As product development continues through version 2 and version 3, the schema of the database becomes hopelessly fixed, as the app code is glued tighter and tighter to every detail of the database schema. The product sells well and gains market (again, it's awesome-looking from the front end, and has great features) and with the increased market share come some large customers.
- The product then hits a wall, whose foundation was laid back in step 2. The problem is unfixable, short of a rewrite, and large customers are gravely unhappy.
- Some hapless DBA working for a large customer is then spec'ing an unbelievably overpowered SQL server, to overcome, by brute force, the unsolvable design issue. Said DBA is shaking his/her head in disbelief. Dev's on the project start yelling "Tokyo Cabinet! SQL Server doesn't scale!"
- Because scalability via good coding is cheap, and scalability by massive hardware is not, a whole pile of cash is wasted.
Here I'm afraid I have to resort to a cornball analogy, so please bear with me. Today's apps are always two-headed beasts, like two dancers. Think Apolo Anton Ohno & Julianne Hough doing the Paso Doble. They only work well when, at some level, both the database and the application know how to boogie. One might lead, maybe the other follows, but they both have to dance. And the failure illustrated above is simple: one partner can dance and the other one can't. Maybe when they audition doing the hokey-pokey with 10,000 rows of data, everything's fine, but when it comes time for the Big Time Ballroom Finals, it makes no difference how well one dancer can dance if the other one is constantly tripping and falling.
When examined closely, the critical mistake is actually two-fold: first, the design of one of the two required data models (relational and OO) was ignored in preference for the other. Further, the application was bound tightly to the database schema such that any correction to that schema becomes a large-scale, breaking change. The team is truly backed into a corner.
To avoid this common disaster scenario, it's important, then, to take two precautions:
- Don't neglect the database design. Modern RBDMS's work well, and they do scale when designed correctly; it's easy, however, to make a schema that circumvents the very features that make databases perform. The relational model in your two-model system might not be an exact match for the OO object graph, if the system is to perform well. If your team is dev-centric, and you don't have a database wonk in house, find one anyway. At least have someone capable advise about the structure of the DB, before the system is stuck with an unworkable schema.
- Even if you don't use the database itself to encapsulate data access (see the previous post), do encapsulate it somehow in a clean, separate layer of code, so that the database has the possibility to be re-factored. Use something like the repository pattern. If you use an object/relational mapper, make sure it can adequately abstract the details of the database schema so that changes are possible over time. Review your design with the phrase, "How would I change this code if the underlying table changes in the database..."
|
-
Mercy, I think I made it. Today was the post-con for the PASS Summit, and I was in a great session hosted by Louis Davidson and Paul Nielsen. I say "hosted" because it was really more a guided, interactive discussion than a lecture, which is great for PASS, and especially great for the end of PASS when everyone's eyes are a little glazed over. It was great to get quality time with these two powerful brains. I also got to meet a couple more great folks - Aaron Bertrand and John Paul Cook, among others.
The facility folks were breaking down all the banners and booths during the day, which is always a little bittersweet. I was powerfully reminded of this feeling I used to have when I was a kid, watching TV, the credits would roll at the end of Dukes of Hazzard and Waylon Jennings would start in singing the theme song. I'd looked forward to the show all week, and it was awesome, and yet over too soon, which made me a little sad.
[If you are not familiar with the Dukes of Hazzard, it was one of the worst and most successful television programs in the U.S. in the early '80s. It appealed, completely transparently and without even attempting sophistication, to the American boy's ideal of muscle-car-driving southern Robinhood/outlaw. Plus, they jumped a Dodge Charger over something in every episode. I was eleven, OK? :-) ]
|
-
OK kids, I am tired. Spanked. Tapped. Fried. So much pure, I.V. delivered Awesome(R) in one week, and I am getting too old to absorb it all. Again all the Uber-Bloggers (you know who you are) have reported the play-by-play from the Summit, but still I have to shout out to a few good people from today: Dave Fackler's session on real-world SSIS was indeed great - stories about developing a real, working, geographically dispersed, complex ETL process with SSIS, both good and bad. Nothing like real world scenarios to sharpen things up. I'm not sure I would do exactly what he's doing, but that's why we have a Summit, to compare notes, and it was excellent and informative.
Neat lunch with a program manager from the storage engine and a bunch of SQL CAT and Field Support guys, my apologies for not remembering all their names. Also, here's a surprise - a session with the excellent Alex James on the new version of Entity Framework, which I think will be very, very good. Yes, you heard me correctly. I am a professional T-SQL-loving DBA and I'm putting it out there: Entity Framework is going to be a Very Good Thing (VGT(R)). Paul and Louis can beat me up in their post-conference session on DB design tomorrow. If I survive.
|
-
The summmit is rolling into it's last day (post-conference sessions aside). I got to see some amazing stuff yesterday. Hillarity at the Quest breakfast -- picture five DBAs, in a clown car -- but great content interleaved with the jokes. The MVP Deep Dives book seems to be a huge success, and the booksigning was very cool.
I got to meet more of my heros, like:
Erland Sommarskog (http://www.sommarskog.se/) who is interesting not only because he's Swedish (most good things come from Sweden) but because he was basically blogging about SQL Server before blogging was cool, perhaps even before "blog" was a word. Erland wrote some indispensable articles on error handling patterns and dynamic SQL that are a must-read for every DBA and database dev. If you don't know those articles, stop reading right here and open a tab. Seriously.
Brad McGehee -- this is a guy who can distill the best performance practices, no matter how complex, into coherent, understandable language for any DBA to fully understand. Also lives in Hawaii, lucky guy. I'd had some email correspondence with Brad, but it was nice to meet him face to face.
Brent Ozar, all around mover and shaker. How does he DO that?
Gail Shaw. Period. Awesome. Wonderwoman of the core SQL engine, indexes, you name it.
Buck Woody, Allen White, Kevin Kline -- too many to list them all.
In line at the book signing, by pure luck, I also bumped into Dave Fackler, whose SSIS session, from his description, sounds like it will be fantastic, and exactly what I need for a project at work.
Other high points - enjoyed Craig Utley's session on SSAS best practices, which was a great precursor for Chris Webb's Aggregation Design talk. Also Red Gate booth small talk. Love those folks. Project Madison at the HP booth was very cool. Got a great demo of SQL Sentry (unsolicited plug here) which looks terrific.
|
|
|
|
|
|