|
|
|
|
Adam Machanic, Boston-based independent database consultant, writer, and speaker, shares his experiences with programming, performance tuning, and optimizing SQL Server 2000, 2005, and 2008, in conjunction with related technologies such as .NET.
-
INSERT and EXEC: relational brothers in arms, helping you get your
data and then put it somewhere. But like all brothers, their
relationship has its ups and downs and sometimes you need to look just
below the surface to see the real issues. In this post I will cover
hidden penalties that a decision--seemingly purely aesthetic--can bring
to the fore.
To illustrate the first of these examples I will be using the following table:
USE tempdb GO
CREATE TABLE dbo.MyTable ( [SalesOrderID] [int] NOT NULL, [SalesOrderDetailID] [int] NOT NULL, [CarrierTrackingNumber] [nvarchar](25) NULL, [OrderQty] [smallint] NOT NULL, [ProductID] [int] NOT NULL, [SpecialOfferID] [int] NOT NULL, [UnitPrice] [money] NOT NULL, [UnitPriceDiscount] [money] NOT NULL, [LineTotal] [money] NOT NULL, [rowguid] [uniqueidentifier] NOT NULL, [ModifiedDate] [datetime] NOT NULL ) GO
Consider the following queries, disregarding the fact that this
isn't a good dynamic SQL example (I'm trying to keep it simple), and
take a moment to think about the differences between them from a
behavioral point of view:
EXEC (' INSERT dbo.MyTable SELECT * FROM AdventureWorks.Sales.SalesOrderDetail ') GO
INSERT dbo.MyTable EXEC (' SELECT * FROM AdventureWorks.Sales.SalesOrderDetail ') GO
It's quite common to insert some data into a table from dynamic SQL,
and for as long as I can remember, I've believed the choice between
these two forms of the query to be primarily an issue of style. My
preference has been for the latter style, as I feel that it's a bit
more readable. It also has the benefit of better supporting ownership
chains, but that's something that, in most cases, we don't have to
worry about any longer in SQL Server 2005 or 2008. But let's move
beyond style and potential security issues and get to the substance.
What actually happens when we run these two queries?
In the first case--insert done inside of the EXEC--the first step is
that the outer statement--the EXEC itself--is parsed and validated. Now
we jump down one level of context, into the string that was passed to
EXEC, and that string is parsed and validated. The referenced objects
are resolved, and the plan cache is checked. If there is no plan, the
query is compiled. And then it's executed. Data is streamed from
SalesOrderDetail into MyTable and when the query inside of the EXEC is
done control returns to the calling context--EXEC--and assuming there
were no errors, the EXEC is now completed.
In the second case, something quite a bit different occurs. The
insert is parsed and validated, and an INSERT EXEC plan is generated.
This plan does not involve SalesOrderDetail, as the query optimizer
doesn't know at this point in the process where the data will be
inserted from, since that step is dynamic. So the plan references
something called the "Parameter Table" (plan truncated for simplicity):
|--Table Insert(OBJECT:([tempdb].[dbo].[MyTable])) |--Top(ROWCOUNT est 0) |--Parameter Table Scan
Once this plan has been generated, we again jump down one level of
context, and the inner string is parsed and validated, the referenced
object resolved, the cache checked, and the compiled plan executed. But
what is this Parameter Table thing?
Here's where things start really diverging. Data is not, at this
point, streamed from SalesOrderDetail directly into MyTable. Rather, it
is streamed from SalesOrderDetail into the Parameter Table. And the
Parameter Table, as it turns out, is in actuality a hidden temporary
table. And not until all of the data has streamed into that hidden
temporary table is control resumed by the EXEC context, and only then
does the data start moving into its final home.
The natural question you might be asking yourself at this point is,
just how much overhead does this Parameter Table introduce into the
equation? The answer might just send you racing to check your stored
procedures: the additional cost for the Parameter Table is well over
100% as compared with doing the insert inside of the EXEC. The fact
that all of the data must be spooled to the Parameter Table before the
insert can even begin tells us that this must be true, and it can be
verified using a simple check against the sys.dm_exec_requests DMV, as
in the following example:
EXEC (' INSERT dbo.MyTable SELECT * FROM AdventureWorks.Sales.SalesOrderDetail ')
SELECT r.writes AS [INSERT inside EXEC writes] FROM sys.dm_exec_requests r WHERE r.session_id = @@SPID GO
INSERT dbo.MyTable EXEC (' SELECT * FROM AdventureWorks.Sales.SalesOrderDetail ')
SELECT r.writes AS [INSERT EXEC writes] FROM sys.dm_exec_requests r WHERE r.session_id = @@SPID GO
So that's that. We should avoid INSERT EXEC and try to do our inserts in the same context in which the SELECT is running--right?
Well, yes and no. There is another element at play here which I
haven't yet mentioned. What if we were only inserting a few rows, and
the table we were inserting into looked something like the following:
CREATE TABLE #MyTable ( [SalesOrderID] [int] NOT NULL, [SalesOrderDetailID] [int] NOT NULL, [CarrierTrackingNumber] [nvarchar](25) NULL, [OrderQty] [smallint] NOT NULL, [ProductID] [int] NOT NULL, [SpecialOfferID] [int] NOT NULL, [UnitPrice] [money] NOT NULL, [UnitPriceDiscount] [money] NOT NULL, [LineTotal] [money] NOT NULL, [rowguid] [uniqueidentifier] NOT NULL, [ModifiedDate] [datetime] NOT NULL ) GO
The only difference between this table and the previous one is that
this is a temporary table and the other is not. But temporary tables
have their own interesting little twists, especially when it comes down
to one of the key enemies in a highly-transactional system:
recompilation. As it turns out, doing the insert inside the EXEC will
cause the internal statement to recompile every time a new temp table
is encountered. This means that if you have a stored procedure that
creates a temp table, puts together a bit of dynamic SQL, and does an
insert inside of that dynamic SQL, you'll now have yourself a
recompilation problem.
To see this illustrated, try the following script. Here only a
single row is inserted as a result of the dynamic query, but it's
complex enough that the compile time more than overshadows the overhead
of the Parameter Table:
USE tempdb GO
DBCC FREEPROCCACHE GO
CREATE TABLE #AvgTimes ( CPU_time DECIMAL(19,4) NOT NULL, insert_type VARCHAR(25) NOT NULL ) GO
CREATE TABLE #MyTable ( [SalesOrderID] [int] NOT NULL, [SalesOrderDetailID] [int] NOT NULL, [CarrierTrackingNumber] [nvarchar](25) NULL, [OrderQty] [smallint] NOT NULL, [ProductID] [int] NOT NULL, [SpecialOfferID] [int] NOT NULL, [UnitPrice] [money] NOT NULL, [UnitPriceDiscount] [money] NOT NULL, [LineTotal] [money] NOT NULL, [rowguid] [uniqueidentifier] NOT NULL, [ModifiedDate] [datetime] NOT NULL )
EXEC (' INSERT #MyTable SELECT TOP(1) sod.* FROM AdventureWorks.Sales.SalesOrderDetail sod WHERE sod.UnitPrice > 10 AND sod.LineTotal > 100 AND EXISTS ( SELECT * FROM AdventureWorks.Sales.SalesOrderHeader soh JOIN AdventureWorks.Person.Contact c ON c.ContactID = soh.CustomerID WHERE soh.SalesOrderID = sod.SalesOrderID AND c.LastName LIKE ''Ab%'' ) AND EXISTS ( SELECT * FROM AdventureWorks.Production.Product p WHERE p.ProductID = sod.ProductID AND p.Color IS NULL ) AND NOT EXISTS ( SELECT * FROM AdventureWorks.Sales.SalesOrderHeader soh JOIN AdventureWorks.Sales.SalesPerson sp ON soh.SalesPersonID = sp.SalesPersonID WHERE soh.SalesOrderID = sod.SalesOrderID AND sp.CommissionPct > 50 ) ')
INSERT #AvgTimes SELECT r.cpu_time, 'INSERT inside EXEC' FROM sys.dm_exec_requests r WHERE r.session_id = @@SPID
DROP TABLE #MyTable GO 5
CREATE TABLE #MyTable ( [SalesOrderID] [int] NOT NULL, [SalesOrderDetailID] [int] NOT NULL, [CarrierTrackingNumber] [nvarchar](25) NULL, [OrderQty] [smallint] NOT NULL, [ProductID] [int] NOT NULL, [SpecialOfferID] [int] NOT NULL, [UnitPrice] [money] NOT NULL, [UnitPriceDiscount] [money] NOT NULL, [LineTotal] [money] NOT NULL, [rowguid] [uniqueidentifier] NOT NULL, [ModifiedDate] [datetime] NOT NULL )
INSERT #MyTable EXEC (' SELECT TOP(1) sod.* FROM AdventureWorks.Sales.SalesOrderDetail sod WHERE sod.UnitPrice > 10 AND sod.LineTotal > 100 AND EXISTS ( SELECT * FROM AdventureWorks.Sales.SalesOrderHeader soh JOIN AdventureWorks.Person.Contact c ON c.ContactID = soh.CustomerID WHERE soh.SalesOrderID = sod.SalesOrderID AND c.LastName LIKE ''Ab%'' ) AND EXISTS ( SELECT * FROM AdventureWorks.Production.Product p WHERE p.ProductID = sod.ProductID AND p.Color IS NULL ) AND NOT EXISTS ( SELECT * FROM AdventureWorks.Sales.SalesOrderHeader soh JOIN AdventureWorks.Sales.SalesPerson sp ON soh.SalesPersonID = sp.SalesPersonID WHERE soh.SalesOrderID = sod.SalesOrderID AND sp.CommissionPct > 50 ) ')
INSERT #AvgTimes SELECT r.cpu_time, 'INSERT EXEC' FROM sys.dm_exec_requests r WHERE r.session_id = @@SPID
DROP TABLE #MyTable GO 5
SELECT AVG ( CASE a.insert_type WHEN 'INSERT inside EXEC' THEN a.CPU_time ELSE NULL END ) AS [CPU time - INSERT inside EXEC], AVG ( CASE a.insert_type WHEN 'INSERT EXEC' THEN a.CPU_time ELSE NULL END ) AS [CPU time - INSERT EXEC] FROM #AvgTimes a GO
DROP TABLE #AvgTimes GO
So what have we learned today? The choice between INSERT EXEC and
INSERT inside of EXEC is not purely stylistic and has definite
performance implications. Here are the rules I'll be following from now
on:
- When working with permanent tables, always avoid INSERT EXEC if
possible. There are some cases where it won't be possible to avoid. I
only showed dynamic SQL in this post, but INSERT EXEC also applies to
stored procedures. Can you safely rip apart all of the stored
procedures in your system in order to avoid this issue? Maybe not quite
as easily as you can rip apart the dynamic SQL within stored procedures.
- When working with temporary tables, evaluate the complexity of the
operations, the amount of data to be inserted, and most importantly,
test every solution! The more rows that are inserted as a result of the
INSERT EXEC, the more the overhead you'll get from the Parameter Table.
On the flip side, the more complex the dynamic statement, the more
overhead you'll get from recompilations. Every scenario is bound to be
different and you may just learn something about your processes by
doing this extra bit of analysis.
Enjoy!
|
-
Several months ago I asked Joe Sack why his team, the SQL Server Premier Field Engineers, wasn't blogging and sharing the insights that result from their work with a number of high-end SQL Server customers. He replied that he thought it was a great idea, and then I heard nothing until last week, when I was happy to receive an e-mail from Joe saying that the blog had finally been created. The team has already put up a few interesting posts and I'm sure there are a lot more to come, so check it out! I'll be adding this blog to the SQLblog External Feeds Roller shortly, if you would rather grab the content from there.
|
-
On Tuesday night I was honored to present a new talk to the Cape Cod .NET Users Group that meets in Plymouth, MA. The talk is called "T-SQL Power! Learning to Harness the Under-Used OVER Clause". Following is the abstract for the talk: First introduced in SQL Server 2005, the OVER clause is an ANSI SQL
enhancement that gives you tremendous control when dealing with
aggregations. By using the OVER clause, query writers can
simultaneously aggregate columns based on multiple groups. The feature
also enables the query engine to provide windowing mechanisms for
ranking and row numbering. Leveraging these powerful language
enhancements allows you to solve a surprisingly large number of
difficult query problems—including custom paging schemes, data
de-duplication, "top-N" problems, and complex statistical calculations.
Even better, this feature can be creatively applied to help with
performance optimization of certain tough queries. In this session, you
will learn all of these techniques and see why, after applying the OVER
clause in dozens of projects since the release of SQL Server 2005, I
consider it to be one of the most powerful T-SQL features available.
I always enjoy trying out a new talk for the first time, especially when the talk is on a topic that I really enjoy. Based on some of the feedback and reactions I got, I have a few tweaks to make, but I think the talk is on the right track. My next scheduled delivery of this talk is in November, at the SQL Server Connections conference. Should be a great time! As promised on Tuesday night, demos are attached to this blog post. Enjoy!
|
-
After weeks of putting it off, I finally found the time and spent the last day and a half judging the Grouped String Concatenation Challenge. I would like to congratulate the winner, Peter Larsson, who submitted a great query and walks away with a shiny new MSDN Premium subscription. For those who are interested, following is a breakdown of the judging process, along with some commentary: Submission Process
To begin with, e-mails. As I mentioned in the first post, I ignored all e-mails that didn't follow the directions. Luckily this was only a few submissions. I felt it rather odd that people would spend a not insignificant amount of time working up a solution only to not bother to read the guidelines thoroughly. But that's human nature, I suppose. Round 1
Once I collected all of the queries that followed the e-mail rules (all of which are included in the attached ZIP file), I began testing against an expanded version of AdventureWorks (the script for that is also included). I decided to eliminate any queries that did not produce the correct output data based on my sample set, or which took longer than 30 seconds to complete. The majority of queries did complete in a reasonable amount of time, and many were eliminated because the output simply wasn't correct. The biggest issue was ordering of the elements in the comma-delimited sets. I also deducted points from one person's entry because of invalid column names, but I decided to let the entry ride to the next round. An important side note is that I created this competition with the sole intention of discovering new and different ways to do grouped string concatenation, and my hope was that someone would come up with a clever, fast solution. Unfortunately, that didn't happen, and every submission that used any technique except FOR XML PATH was eliminated in the first round of testing. I received some extremely creative solutions from a couple of people and I would like to mention them here:
- Alejandro Mesa's submissions made use of various XQuery techniques, and are very interesting to look at, although fairly slow
- Dean Cochrane's submission used an interesting idea of doing a MAX(CASE ...) pivot for the lists. Alas, the product names lists were not correct, so the submission didn't make it to the stress testing phase
- Scott Coleman tried a similar technique, actually using the PIVOT keyword. Unfortunately, this ran for over 200 seconds, so it was eliminated
Also interesting to note is that a few people tried recursive CTE solutions. These were all cancelled at the 300 second mark. Recursive CTEs, as mentioned before on here on SQLblog, simply do not scale in their current implementation. Round 2 After tabulating the Round 1 results I was left with 18 queries, and some obvious contenders. I ran each query through a SQLQueryStress session with 10 threads running 5 iterations each. In this phase the queries were separated into fairly distinct groups: Those that ran for around 5 minutes, those that ran for around 7-8 minutes, and those that ran longer. These groups were based, not surprisingly, on how much attention was paid by the query writers to the little details. For example, Peter Larsson's winning query cut down on logical reads dramatically by doing some of the grouping in a derived table, rather than in the outermost query as some of the other submissions did. Lesson learned: When doing aggregations, especially when joining a lot of tables, think about what you're really aggregating, and do the aggregation as early as possible. For example, if you need to aggregate sales per customer and get customer names, do the aggregation of the sales numbers first, then join out to get the customer names. Otherwise the query processor is forced to do more work than it has to do, and your query won't be as fast. Peter and a few other contestants understood this distinction and wrote queries that were much faster as a result. Round 3 Round 2 eliminated 4 queries, leaving me with 14 to judge based on query style. In order to judge consistently, I came up with 10 factors. A query was allotted 500 points to start, and failure to meet each factor resulted in a 50 point penalty. These factors were: - Consistent Indentation
- Does the query use the same rules for indentation in all parts? This is huge for readability and helps people understand where each section of the query starts and ends.
- Consistent Capitalization
- Does the query use the same rules for capitalization throughout? For example, keywords should be either all capitalized, or all lowercase.
- Capitalize Keywords
- I like to see keywords capitalized.
- Use AS for Alias Names
- AS is optional, and I've left it out in many queries I've written. But the more of other peoples' code I read, the more I realize that it really does help on the readability front. Use it. Always.
- Follow Capitalization of Base Tables/Columns
- If the base table is called OrderHeader, I want to see it used as OrderHeader when referenced in your query, rather than orderheader. A trainer I know found this out the hard way, when he reinstalled SQL Server on his laptop shortly before a training session, and used a case-sensitive collation rather than his previously-installed case-insensitive collation. He had been careless in adhering to capitalization for his training materials, and discovered the issue in front of the class. Oops.
- Long Horizontal Lists
- I don't like horizontal scrolling, and I find long lists difficult to read.
- Consistent Vertical Lists
- Put either a comma after each element or before each element, not both. Indent your lists the same way throughout. If you indent some items below the SELECT, don't put other items on the same line as the SELECT (or GROUP BY, or ORDER BY, etc)
- Alignment of Delimiters
- I follow a .NET-inspired style where I put delimiters on their own lines, and line them up vertically. This gives my code what I feel is an airy, easy-to-read feel. When reading others' code I look for some kind of alignment. Failure to align delimiters makes it very difficult to understand, again, where one section begins and another ends. By the way, common delimiters for this challenge included both parens and CASE...END.
- Comments
- Does the query have comments? Are the comments useful in understanding the logic?
- Aesthetics
- This is perhaps the most subjective. My general feeling on how I enjoyed reading the code.
All in all, the queries were pretty good. I would like to call out Rick Halliday, who had the highest score in this round with some very well formatted and highly readable code.
Round 4
After judging Round 3 I tallied all of the scores and was left with a tie for top 3: - Rick Halliday
- Leonid Koyfman
- Peter Larsson's query #4
All three of these queries were well thought out, but only one could win, so I took another pass through each. Rick's query, though extremely well written and readable, was eliminetad first due to the fact that it performed worse than the other two. This left Leonid and Peter. It was a tough choice, but I had to give the prize to Peter for taking the time to really think through the problem and figure out exactly how best to do the aggregations. Leonid was a very, very close second, and I really wish I had a consolation prize for him.
The End And that's that. Thank you to everyone who participated in the challenge. I hope it was as much a learning experience for you as it was for me. Congratulations again to Peter. All of the materials are attached in the ZIP file; please let me know if you have any questions, comments, etc.
|
-
-
Whew! Another TechEd session completed, and now I'm done for the week. Thanks to everyone who joined me today; I had a fantastic time talking about the ins and outs of Extended Events. Demos are attached. I realize how complex some of these topics are, so please feel free to drop me a note in the comments here with any questions or concerns. Also stay tuned--I'm continuing to work with and improve these samples and will be publishing updates from time to time in the coming months.
|
-
A huge thank you to everyone who attended today's session on exception handling! I had a great time and got some very good, on-point questions from the audience. As promised, the demos are attached to this blog post. Please let me know in the comments if you have any questions (whether or not you attended--everyone is allowed to join the party).
|
-
I'm not entirely certain what color that is, but follow me here (if you dare). In keeping with the Twitter mentality I'll keep this post ultra-short.
|
-
Two days ago, after posting what I thought was a pretty solid SQLCLR string splitting method, I received a comment telling me about a big thread on SQLServerCentral dedicated to the question of how best to split strings. So I jumped in, and went back and forth, and back and forth, and back... and forth...
Many, many messages and several revisions and re-revisions later, I am happy to present the newer, better, more bug-free SQLCLR string splitting function. Not only is it faster than the previous one I posted, but it also handles multi-character delimiters, ignores consecutive delimiters as well as pre and post delimiters, and best of all, it won't get itself stuck in an infinite loop in certain cases (always a really great stability feature).
This version walks the SqlChars character array rather than using the IndexOf method on the string, which we found to be a somewhat faster technique--and it's certainly the most scalable and memory efficient method I can imagine.
Special thanks to SQLServerCentral member Florian Reischl, who was the main person keeping the thread going with me during the last couple of days, and who managed to re-write my versions and eke out even better performance by modifying the algorithms. Great stuff--it was definitely the most fun I've had on a technical forum in quite a long time.
But without further ado, the code:
using System; using System.Collections; using System.Data; using System.Data.SqlClient; using System.Data.SqlTypes; using Microsoft.SqlServer.Server;
public partial class UserDefinedFunctions { [Microsoft.SqlServer.Server.SqlFunction( FillRowMethodName = "FillRow_Multi", TableDefinition = "item nvarchar(4000)" ) ] public static IEnumerator SplitString_Multi( [SqlFacet(MaxSize = -1)] SqlChars Input, [SqlFacet(MaxSize = 255)] SqlChars Delimiter ) { return ( (Input.IsNull || Delimiter.IsNull) ? new SplitStringMulti(new char[0], new char[0]) : new SplitStringMulti(Input.Value, Delimiter.Value)); }
public static void FillRow_Multi(object obj, out SqlString item) { item = new SqlString((string)obj); }
public class SplitStringMulti : IEnumerator { public SplitStringMulti(char[] TheString, char[] Delimiter) { theString = TheString; stringLen = TheString.Length; delimiter = Delimiter; delimiterLen = (byte)(Delimiter.Length); isSingleCharDelim = (delimiterLen == 1);
lastPos = 0; nextPos = delimiterLen * -1; }
#region IEnumerator Members
public object Current { get { return new string(theString, lastPos, nextPos - lastPos); } }
public bool MoveNext() { if (nextPos >= stringLen) return false; else { lastPos = nextPos + delimiterLen;
for (int i = lastPos; i < stringLen; i++) { bool matches = true;
//Optimize for single-character delimiters if (isSingleCharDelim) { if (theString[i] != delimiter[0]) matches = false; } else { for (byte j = 0; j < delimiterLen; j++) { if (((i + j) >= stringLen) || (theString[i + j] != delimiter[j])) { matches = false; break; } } }
if (matches) { nextPos = i;
//Deal with consecutive delimiters if ((nextPos - lastPos) > 0) return true; else { i += (delimiterLen-1); lastPos += delimiterLen; } } }
lastPos = nextPos + delimiterLen; nextPos = stringLen;
if ((nextPos - lastPos) > 0) return true; else return false; } }
public void Reset() { lastPos = 0; nextPos = delimiterLen * -1; }
#endregion
private int lastPos; private int nextPos;
private readonly char[] theString; private readonly char[] delimiter; private readonly int stringLen; private readonly byte delimiterLen; private readonly bool isSingleCharDelim; } };
Enjoy!
|
-
It seems like every couple of months we see yet another post on SQLCLR string splitting routines. Many bloggers, I suppose, are still struggling, even three years later, to find that "perfect" use case for SQLCLR. Is string splitting it? Probably not. And with SQL Server 2008 table-valued parameters now available, SQLCLR string splitting has become an even less interesting exercise than it was before. None the less, a recent post on the Less Than Dot site has inspired me to counter with my own best SQLCLR string splitting method, for those of you who are still interested in solving this problem.
I've noticed that almost invariably, the methods posted online stress how very easy it is to do string splitting in .NET, thanks to the String.Split method. And while this easy method tends to work pretty well for small strings and on workloads that don't need to scale, it quickly breaks down when any amount of load is introduced (something that, unfortunately, most writers don't bother considering). The Less Than Dot writer, "onpnt" did do some testing, and discovered that--surprise, surprise--String.Split isn't all that great.
The issue? Well, it all comes down to large memory allocations and the art of scalable .NET programming--an area about which many SQL Server developers can (and do) remain blissfully naĂŻve. In .NET, reduction of memory utilization--especially large allocation done en masse--is king, and String.Split does exactly the wrong thing. It takes the input string, breaks it into N chunks, and allocates all of the memory needed to store those chunks and pointers to those chunks, in one big huge operation. This can't possibly scale, and indeed it doesn't. I did a quick SQLQueryStress test of a TVF based on String.Split and saw fairly good performance when the input sentences were small (in the 40-400 byte range--see below), but after a certain point the AppDomains began recycling and performance became abysmal. Protections put in place for stability of the CLR host include memory leak detection, and this kicks in quite readily when we force allocation of so much memory in one shot--a good thing for the SQL Server instance as a whole, but not great when we're trying to really split a huge string.
Students of .NET who are concerned with scale (and really, everyone should be) are urged to look at the way problems are handled in LINQ. Here the vast majority of requests are internally handled using streaming iterator patterns, rather than moving around huge chunks of memory. This turns out to a much more scalable option for several reasons: Lower in-flight memory utilization, fewer large object heap allocations, and better access by the garbage collector to collect intermediate data that is no longer needed.
So how can we apply streaming to the string splitting problem? Rather than break the string up into all of its component parts upfront, we can walk the string step-by-step, only finding the next piece as required. In order to handle this, I created the following worker class:
public class splitIt : IEnumerator { public splitIt(string theString, char delimiter) { this.theString = theString; this.delimiter = delimiter; this.lastPos = -1; this.nextPos = -1; }
#region IEnumerator Members
public object Current { get { return theString.Substring(lastPos, nextPos - lastPos).Trim(); } }
public bool MoveNext() { if (nextPos >= theString.Length) return false; else { lastPos = nextPos + 1;
if (lastPos == theString.Length) return false;
nextPos = theString.IndexOf(delimiter, lastPos); if (nextPos == -1) nextPos = theString.Length;
return true; } }
public void Reset() { this.lastPos = -1; this.nextPos = -1; }
#endregion
private int lastPos; private int nextPos; private string theString; private char delimiter; }
This class is a simple enumerator implementation that looks for the next delimiter on each iteration, only when requested. Splitting strings in this way, rather than using String.Split, means that no huge upfront allocation takes place. Aside from the sentence itself, only one "chunk" is in play at any given time, and any chunks that have already been used can be garbage collected as needed when memory is tight.
Wiring this class up in a SQLCR TVF is just about as simple as when using String.Split:
[Microsoft.SqlServer.Server.SqlFunction(FillRowMethodName = "FillIt", TableDefinition = "output nvarchar(4000)")] public static IEnumerator faster_split ( SqlChars instr, [SqlFacet(IsFixedLength=true, MaxSize=1)] SqlString delimiter ) { return ( (instr.IsNull || delimiter.IsNull) ? new splitIt("", ',') : new splitIt(instr.ToSqlString().Value, Convert.ToChar(delimiter.Value))); }
public static void FillIt(object obj, out SqlString output) { output = (new SqlString((string)obj)); }
I've enhanced this example slightly compared with most of the usual suspects: A SqlFacet attribute is used to make sure that the delimiter is only a single character, and I've added a bit of additional code in the main method to deal with NULL inputs. The scalability difference between this method and the String.Split method is staggering in the simple tests I ran today on my SQL Server 2008 test server. With small sentences, even under moderate load (100 concurrent threads), each method performs more or less equivalently. But as sentence size increases to 50KB, the String.Split method begins slowing, taking almost 2 seconds per iteration, and the occasional AppDomain recycle is seen in the SQL Server log. The streaming method, on the other hand, continues to complete its job in just over 1/10th of a second, and causes no AppDomain recycles. Increasing to 500KB sentences, String.Split causes numerous AppDomain recycles and time per iteration increases to over 30 seconds, while the streaming method averages just 16 seconds per iteration. Jumping to 5MB sentences, String.Split causes almost continuous AppDomain recycles, and each iteration takes almost 6 minutes to complete. Yet with the streaming method, even with sentences of this size I am still unable to cause an AppDomain recycle to occur, and iterations complete in around 55 seconds. The test I did was quite simple, and I won't post too many details here as I prefer that you test with your own workloads and draw your own conclusion about how this method fares when compared with T-SQL versions or the naĂŻve String.Split method. I hope that if you do test you'll post back here with your results so that we can all learn the best way to handle these problems--whether or not string splitting really is all that interesting in the post-SQL Server 2005 world.
|
-
Update: Please respond here instead. Thanks for the great idea, Alex!
Please respond and let me know: A) Whether you change it at all
and B) If so, whether you use a fixed width font (the default font is non-fixed width) Background, in case you're curious: I modified a newer version of Who is Active?, based on some feedback I received from Aaron Bertrand asking for right-padding of the numbers in the result. I did this modification based on my assumption that most people won't change the default grid font, and even if they do change it they'll still use a non-fixed width font. I tested my modification on my end and sent it over to Aaron to see what he thought, and low and behold he has changed his own SSMS settings to use a fixed width font. Oops! I would be very interested in finding out how many of you do the same. If the vast majority of people do indeed leave the font unchanged, or at least in the same font family, perhaps I'll leave the right-padding as-is (I happen to like it, personally -- sorry, Aaron!). If I'm mistaken in my assumption, I'll rip it out. Thanks for letting me know, either way.
|
-
TechEd North America 2009 is just a few short weeks away, and I'm really looking forward not only to the show, but also the opportunity to visit Los Angeles for the first time in several years. I'm busy putting the finishing touches on my Extended Events talk, DAT402, "Microsoft SQL Server 2008: Performance Profiling and Troubleshooting with Extended Events", and I thought I should write a quick note to anyone out there who is thinking of attending. What's noteworthy about this talk is that the TechEd team modified the title a bit and removed the word "Advanced", but left the talk at 400-level and also left the original abstract untouched: Imagine tracking and aggregating wait statistics at the session level
rather than the server level. Imagine seeing exactly how long each step
in your query took and figuring out the real cost of that index scan.
Imagine capturing an exception with an associated callstack--no more
guessing about exactly what component failed and why. When you're done
imagining, open your eyes and attend this session to learn all of these
techniques and more, all thanks to Extended Events (XEvents)--the
powerful new tracing infrastructure in SQL Server 2008. Designed for
DBAs and developers who already understand the basics of XEvents, this
session goes from 0 to 60 in the first few minutes. See a number of
code examples and gain an understanding of how to maximize XEvents for
performance profiling and troubleshooting purposes. If you're serious
about making your SQL Servers fly, this is one session not to be missed.
When I gave this same talk earlier this year at SQL Server Connections in Orlando, some attendees apparently didn't understand that it really is an advanced talk, and showed up without quite enough background to follow along as well as they could have. I'm hoping to minimize that phenomenon at TechEd; so please keep in mind that although I do a quick (five-minute) refresher, that's it, and Extended Events is a much bigger topic than can be fully introduced in five minutes. Once the refresher part is over I jump into four fairly intense, demo-heavy deep-dive sections, so please, if you want to get your money's worth, make sure that you read one or more of the following in the next few weeks: Reading through these sources will give you the background so that you can focus on the performance troubleshooting aspects of the talk, rather than sitting there bogged down in the unfamiliar Extended Events syntax and terminology.
I'm looking forward to meeting some SQLblog readers at the event; if you're attending, even if you decide to skip my Extended Events talk, stop by the Database Practices TLC and say hi. Looking forward to seeing you in LA!
|
-
Ever tried writing a more or less database-agnostic administrative script?
If the experience didn't make your head spin, you either have a stronger constitution than most DBAs or your script doesn't actually work properly. The fact is, due to the way most of the metadata views are currently scoped in SQL Server, pulling data from multiple databases requires a rather annoying set of techniques, usually involving temporary tables, dynamic SQL, and cursors. For a real-world example, just take a look at my Who is Active? script.
This could all be so much easier if the product simply exposed global views, available from anywhere, and keyed with a database_id column. The views could be internally partitioned to ensure great performance if you only need data from a single database, yet still flexible enough to provide data across the entire server. I think this feature would make the overall administrative experience much nicer, so I've created a Connect item, and I hope that you will vote for it (5, of course!): http://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=432689 Thanks for your support!
|
-
It has been only a month and a half since I posted the last version of Who is Active? (v7.30),
but in that time I've made a huge number of changes, fixes, and
enhancements. The new version, v8.40, is faster, more robust, and
includes a few exciting features. Click here to download Who is Active? v8.40
Following is a list of some of the
things I've done with it, in no particular order, along with
descriptions where applicable. Added a collection interval
option, @DELTA_INTERVAL. This option causes the script to collect
numeric metrics -- such as reads, writes, and context switches --
twice. The script will wait between collections, based on the duration
passed into the parameter (in seconds). The script will then calculate
the differences (deltas), between the first and second collections, for
any requests that were active for each collection. These deltas will be
output in a new set of columns, reads_delta, writes_delta,
context_switches_delta, etc. For example, to use the default options and wait two seconds between collections, you would use the following: EXEC sp_WhoIsActive @DELTA_INTERVAL = 2;
If
you played with Who is Active? in the past you know that the script
outputs a lot of data in a lot of different columns. I started feeling
that the column list was getting unwieldy, so I implemented dynamic column lists
as a fix. There are two changes here. First of all, the default column
list will only contain columns associated with those features you have
turned on. So for example, if you don't have @GET_LOCKS enabled, the
locks column won't appear. The second part of this fix is a new feature to make things even more dynamic: custom column lists. I sent a slightly earlier test version to Jimmy May,
and his response was something along the lines of, "it's pretty good,
but I really don't like the column order." To make Jimmy happy, I
implemented the @OUTPUT_COLUMN_LIST option. To use this option, simply
pass in whatever columns you would like to see, in whatever order you
would like to see them, and the script will do the rest. Note that the
output will be the intersection of whatever options are enabled and
whatever columns are found in the list. Here's an example I've been using in my own work recently: EXEC sp_WhoIsActive @OUTPUT_COLUMN_LIST = '[login_name],[dd hh:mm:ss.mss],[sql_text],[reads],[writes], [context_switches],[physical_io],[wait_info],[blocking_session_id], [tempdb_writes],[tran_log_writes],[query_plan],[session_id]';
Another feature I added that makes things a bit more dynamic and customizable is the ability to sort the output
based on whatever column and direction you would like. Interested in
seeing the requests that are currently blocked? No need to scroll, just
ask for them to sort high: EXEC sp_WhoIsActive @SORT_COLUMN = '[blocking_session_id]', @SORT_COLUMN_DIRECTION = 'DESC';
While
viewing the results on the screen in SSMS is great, sometimes I like to
capture the results to a table for later analysis. So I added a few
features in this most recent version to support automated collection. To
begin with, I realized that the default, nicely-formatted output of the
script, while great for on-screen viewing, isn't good if you want to do
your own analysis later. So I added a new option, @FORMAT_OUTPUT, that
lets you control whether the output is or is not formatted. The default value for the parameter is 1, keeping with the previous behavior. I
also figured that if you want to do some later analysis it might be
nice to know what time period you're working with. So I added a column
called collection_time, which outputs a datetime instance representing
the time that the script finished running. Taking things
one step further, I decided that with all of the dynamic column options
it would be a pain to put together a table matching the output schema
if you start playing with different settings. So I made things easy; no
need to figure out what the output looks like; the script will write a CREATE TABLE statement for you. Simply enable the @RETURN_SCHEMA option and collect the output value from the @SCHEMA parameter. Finally,
due to nested INSERT/EXEC limitations, it's impossible to insert the
result into a table when calling the stored procedure. No worries, the
script will insert its output into a table for you. Just tell it where to send the data, via the @DESTINATION_TABLE option. Here's
an example of asking for the unformatted output schema, creating a
destination table based on the current day, and collecting data 10
times, waiting 15 seconds between each collection: DECLARE @destination_table VARCHAR(4000); SET @destination_table = 'WhoIsActive_' + CONVERT(VARCHAR, GETDATE(), 112);
DECLARE @schema VARCHAR(4000); EXEC sp_WhoIsActive @FORMAT_OUTPUT = 0, @RETURN_SCHEMA = 1, @SCHEMA = @schema OUTPUT;
SET @schema = REPLACE ( @schema, '<table_name>', @destination_table );
EXEC(@schema);
DECLARE @i INT; SET @i = 0;
WHILE @i < 10 BEGIN; EXEC sp_WhoIsActive @FORMAT_OUTPUT = 0, @DESTINATION_TABLE = @destination_table; SET @i = @i + 1; WAITFOR DELAY '00:00:15' END; GO
One of the key reasons I created the original version of Who is Active? was to help identify blocked and blocking SPIDs. And I thought it worked pretty well for that purpose until recently, when I discovered that sys.dm_exec_requests fails to properly identify blocking when queries go parallel. Oops. Fixed in the new version; it now uses the waiting tasks DMV for this purpose, instead of the requests DMV. As
part of this process I discovered that the joins to the tasks DMVs were
slightly flawed in v7.30, so I fixed them up. And I was able to play
some games to improve performance--enough that I decided to eliminate
the @TASK_INFO option, making this the default behavior. This means
that you'll always get correct blocker identification from Who
is Active? v.8.40, along with a few bonus task-based metrics including
physical I/O stats and the number of context switches. Interested in information about transactions? So are most DBAs, which is why I promoted the transaction start time, which was previously embedded in the tran_log_writes column, to a top-level column
of its own. This will still populate only with transactions that have
actually done a write, but from what my testers tell me this is okay
behavior. Let me know if you disagree. While I was in there I played
even more performance games, and greatly improved the speed of the
query when the @GET_TRANSACTION_INFO option is enabled. A few smaller fixes are also worth mentioning here. First of all, I tested the script against a case-sensitive instance, and fixed a few issues there. So it should work for you no matter what collation you're using. Second, thanks to a discussion I had with Roman Nowak, I was able to finally solve the entitization problem that plagued the "clickable" query output produced by the script. Third, I put on my security hat and added some code to verify input arguments, as well as did a review of all of the dynamic SQL used by the script to make sure that it is not injectable. Thank
you to everyone who sent me feedback and/or feature requests! If you
have an idea for th script, please make sure to send it my way. I have
quite a bit of momentum at the moment and want to keep going and see
just how far I can take this thing. I can only do this with your help. Enjoy! Click here to download Who is Active? v8.40
|
-
Over the past few days you might have noticed that the SQLblog external feeds roller had mysteriously vanished. We had some technical difficulties on Tuesday, things appeared to have fixed themselves by Wednesday morning, and then the problems returned Wednesday evening. So I decided to disable the feature. Honestly, I didn't think anyone would even notice, but I received some e-mails from some concerned parties; apparently there are people out there who actually use the roller. The bad news is that we have no idea why this issue cropped up, but the good news is that I've been watching it now for three days and it seems fine. I've just turned it back on, and as part of the process of cleaning up the mess I even added a few new blogs to the list. As always, I'm looking for more great blogs to add, so if you have any ideas leave a comment here. Enjoy, and please let me know if you see any further issues.
|
|
|
|
|
|