|
|
|
|
Adam Machanic, Boston-based independent database consultant, writer, and speaker, shares his experiences with programming, performance tuning, and optimizing SQL Server 2000, 2005, and 2008, in conjunction with related technologies such as .NET.
-
I started blogging approximately six years ago. I don't know the exact date due to the fact that my prior blog site no longer exists and the archives have long since been wiped, but it was sometime in 2004, probably in late summer or early fall--so right about now. And to be honest, it doesn't really matter anyway, except perhaps for late-night bragging rights at conferences. Since starting down the blogging path I've had plenty of peaks and valleys in my overall level of activity. Some months I've posted multiple times a week, and once I went over six months without a post. Even at the low points it seems as though readers have stuck with me. And for that I am grateful. Recently I've been reflecting a bit on this whole blogging thing, and I've started to wonder just who "you" are. What are my readers into? Where are they from? I know you're out there, and I know some basics: my posts get a respectable number of hits, and I have some idea of who you are by looking at web traffic stats. But most of you don't leave comments; the highest ratio of hits to comments I've ever seen on a post is probably 1%. The fact is, it wouldn't be much fun to write blog posts if there were no readers, and going forward into the next six years I would like to get a head start on writing stuff you'd actually like to read. In other words, I would love to know more about you. So here's my request: Leave me a comment below. Tell me a bit about yourself--as much or as little as you'd like. This can include your name (first? last? first and last?), where you're from (country? city?), who you work for, what you do for your company, your favorite SQL Server feature... Or anything else you'd like to share.
Thanks for reading these last six years.
|
-
-
If you've been working with SQL Server for long, you are no doubt familiar with the work of Andy Leonard. Andy is a true master of the art and science of data extraction, transformation, and loading. And as his recent blog series on the software business shows, he has also fine-tuned the art of dealing with the many types of people you're likely to meet when putting together enterprise-class solutions. What does it mean to take this combined experience and apply it to an SSIS training course? Simple: You're guaranteed both the career enhancement of learning ETL from one of the best in the industry, and the benefit of war stories from someone who has truly seen it all.
Whether you're new to ETL or have been using SSIS for years, Andy's "From Zero to SSIS" course will help you get to the next level. Andy will start with basic material and quickly progress into some of the most advanced SSIS training material available in a public class. This is a rare opportunity to take a full journey through a topic that is at once both easy to use and deeply complex. This class will take place October 6 - 8 at the Microsoft Technology Center in scenic Waltham, MA, a 20-minute drive from downtown Boston. And if the weather behaves you can expect to experience the class amidst the beautiful color spectrum of fall in New England.
From now through the end of the month, you can take advantage of a $150 discount on the course fee. Feel free to leave a comment below if you have any questions about the class. Enjoy!
|
-
I am honored to have been selected for a second year in a row to present a full-day seminar at the PASS Summit. This year's seminar will be a "post-con", and will be delivered on Friday, November 12. The title of the seminar is "A Day of Doing Many Things at Once: Multitasking, Parallelism, and Process Distribution," and the focus is on maximizing performance by taking full advantage of your server's CPU infrastructure. As IT professionals we're all used to the ever-increasing nature of server resources, but things have changed over the past few years. Next-generation server hardware is shipping with more CPU power, but in the form of more CPUs rather than faster CPUs. This means that we can no longer expect that new hardware will automatically make all of our existing processes faster and better; we need to understand how to tune our software to properly leverage these resources. A one-paragraph abstract and registration information is available here. What follows is a basic outline of what will be covered over the course of the day. Feel free to leave me a comment below if you have any questions about the seminar and what will or will not be included. My goal is to deliver a seminar that will give you a number of useful tools and techniques that you can take back to the office and immediately apply, and I think that the list below encompasses the majority of what you'll see as you work with parallel processing in currently shipping versions of SQL Server. Part 1: Background
- A Brief History Lesson
- CPU Evolution Over Time
- Heat and Dissipation
- A Brief Theory Lesson
- Moore's Law
- Amdahl's Law
- Gustafson's Law
- How This All Fits Together
- Windows Process/Thread Internals
- SQL Server Scheduler Internals
Part 2: Query Processor Parallelism
- How Queries are Processed in Parallel
- Parallel Iterators
- Non-Parallel Iterators
- Row Distribution Strategies
- How to Read and Mine Data From Parallel Query Plans
- Controlling Parallelism at the Query Level
- Ideal Parallel Query Patterns
- Parallelism Inhibitors and Workarounds
Part 3: Administration
- Server Settings that Influence Parallelism
- Affinity Masks
- Worker Threads
- MAXDOP
- Cost Threshold
- Resource Governor
- How to Configure Server Settings
- OLTP vs. OLAP vs. Mixed Workload Considerations
- Monitoring Parallel Processes
- Parallel Task Architecture and the Tasks DMV
- Parallel Waits and the Waiting Tasks DMV
- Finding Out How Much Work is Being Done
- Finding Parallel and Serial Plans in the Cache
- Diagnosing Intra-Query Parallel Deadlocks
Part 4: Alternative Solutions for Parallel Data Processing
Looking forward to seeing many of you in Seattle!
|
-
Last month a new PASS Virtual Chapter
was introduced, one with a theme that is near and dear to much of what I
like to work on: performance. Tomorrow, Tuesday August 3rd at noon Eastern time, the chapter
will have its second meeting, and I will be doing the presentation. The
topic is Parallelism and Performance. Here's the abstract: In
today's multi-core-driven world, query performance is very much
determined by how well you're taking advantage of the processing power
at your disposal. Are your big queries using every available clock tick,
or are they lagging behind? And if your queries are already going
parallel, can they be rewritten for even greater speed? In this session
you will learn the background necessary to take full advantage of
parallelism. We'll cover what parallelism is, why it's important, and
the basics of how to read parallel query plans. Examples will be shown
to illustrate some of the huge performance gains that can be had when we
learn to properly control SQL Server's parallel processing
capabilities. This session is a small preview of some of the material
that will be covered in Adam Machanic's full-day PASS Summit post-con, "A Day of Doing Many Things at Once."
This is a free webcast. Click here for more information and to register. Looking forward to seeing you there (well, virtually speaking).
|
-
Apologies for the late reminder; tomorrow is T-SQL Tuesday. This month's event is hosted by Robert Davis over at SQL Server Central and the theme is "Gettin' Schooled." How do you learn? Or how do you teach? Have you learned anything interesting recently that you'd like to share? Post about it! For full details see Robert's blog. By the way, you can always find the current T-SQL Tuesday post by going to http://www.tsql2sday.com. Big thanks to Aaron Nelson for starting and maintaining that link.
|
-
Almost six years ago--in November of 2004--I posted what would turn out to be one of my most popular blog posts in terms of number of reads, "Performance: ISNULL vs. COALESCE." (If you're curious, the post is dated July 2006 because I was too lazyit was difficult to transition the publication dates over with the posts when I transferred to SQLblog from my previous blog location.) In this post I set out to determine whether ISNULL is faster than COALESCE, even though I admitted that I didn't particularly care about the results: "Before getting to my own tests, I'd like to jump off on a quick
tanget. COALESCE vs. ISNULL? Who cares! This isn't a performance
question, this is a question of standards-conformant vs. proprietary
code. ISNULL is non-standard and provides less functionality than
COALESCE. Yet a lot of SQL Server developers love to use it, I suspect
because it's a lot easier to remember (and spell). So learn a new word
and type two extra characters and you'll end up with more maintainable,
more functional code. Sounds good to me -- which is why I am a big fan
of COALESCE."
I still agree 100% with what I said in this paragraph, and despite the fact that my tests showed ISNULL to be slightly faster than COALESCE, I didn't switch over and start using ISNULL. I use the tool that I'm comfortable with, that is standard, and that has more functionality. COALESCE. Seems like a no-brainer. And yet, that other post, as I mentioned, is one of my most popular ever. It gets loads of hits, usually from Google, Bing, and the like. People want to know which is faster. Can they give their code that extra edge? Recently a reader named Kit chastised me for not updating the body of the post, given that there was some new and "important" information disclosed by another reader in the comments. And this made me think about this issue yet again... What follows are my current views on this topic, based on what I've learned in the past six years, much of which has been spent doing performance consulting. There is too much to focus on to bother with this issue. In my original tests I noted a 10% time benefit of ISNULL over COALESCE. Those tests were done on SQL Server 2000, and I don't have a server to test on today, but in 2005 and 2008 the difference is closer to 1%. And even a 10% difference isn't enough to get me excited. In most of my performance engagements I start by looking for areas where I can get at least a 100% improvement in performance by making simple changes, and usually there so many of these "low-hanging fruit" that I don't even need to bother looking for lesser areas of improvement. Furthermore, the 10% benefit is applicable only to the ISNULL or COALESCE function itself, and will be totally overshadowed by the rest of the query. Each call to these functions takes approximately 0.002ms on my test server. Think about how much time data access from disk or even from memory takes. Do you really think you'll see an appreciable difference in total query performance by switching between these functions?
Sometimes switching between these two functions will appear to fix a problem--but it has nothing to do with relative performance. ISNULL and COALESCE seem like they're equivalent aside from the fact that ISNULL accepts only two arguments while COALESCE accepts any number. However, there is a subtle difference between the two. The output data type of ISNULL is the same data type as the first input argument, whereas the output data type of COALESCE is determined by the argument with the highest data type precedence. Consider the following two expressions: ISNULL('1', 2) COALESCE('1', 2)
The first expression, which uses ISNULL, returns a value typed as VARCHAR, since that's the type of the first argument. The second expression, on the other hand, will return a value typed as INTEGER, since given both of the arguments--one VARCHAR and one INTEGER--the INTEGER has a higher precedence. What does this have to do with query performance? Sometimes, when using ISNULL or COALESCE as part of a predicate, a user may end up with a data type mismatch that is not implicitly convertable and which therefore causes a table scan or other less-than-ideal access method to be used. A user that is using trial-and-error and guess work to solve performance problems may swap ISNULL for COALESCE or the other way around and discover that suddenly the query will appear to run much faster. The conclusion that will be made is that one function is faster than the other, but that's clearly not the case. In reality, highly selective index seeks tend to be a lot faster than index scans, and that's what swapping out the function caused to occur. When performance tuning, don't guess. Collect evidence, form hypotheses, and test thoroughly. SQL Server 2005 and 2008 gives us plenty of information to diagnose the vast majority of issues; there is no reason we should have to poke or prod in an attempt to find an answer. ISNULL has a really cool use case. If you're anything like me, you use SELECT INTO... a lot. And you may have discovered that sometimes you need to make special accommodations to control the resultant data type, size, precision, scale, and other metadata for each output column. The most difficult of these to control is nullability, which is mostly implicitly derived by the query engine and for which there does not exist a conversion. Take, for example, the following query: SELECT Color INTO C FROM Production.Product WHERE Color IS NOT NULL
We can clearly see that the Color column in our new table, C, will contain no NULLs, because that's what has been specified in the WHERE clause. But the query optimizer doesn't seem to make this leap, and should you run this query followed by a quick check using sp_help or the catalog views, you'll notice that C.Color is in fact nullable. I used to get around this by running ALTER TABLE ... ALTER COLUMN, but when working with a billion-row table I discovered that this is hardly a cheap process, and I went looking for a better solution. The answer to my problem? ISNULL: SELECT ISNULL(Color, '') AS Color INTO c FROM Production.Product WHERE Color IS NOT NULL
This query will produce a non-nullable Color column, exactly what I want and expect. And COALESCE won't work here. Why? I don't know, it just won't. Finally, a solid reason to use ISNULL that has nothing to do with performance.
What have we learned today? There is nothing to be gained from micro-optimization. If you have time to worry about ISNULL vs. COALESCE, you should probably consider an early retirement, because your job is done. One should never guess when performance tuning. It only leads to improper conclusions and less than ideal practices further down the road. Don't completely dismiss a tool. You may discover a great way to use it once you look further afield. Thank you for reading and as always, enjoy!
|
-
Ever since I started working with SQL Server, I've been seeing advice to filter system views based on session identifiers, in order to return only user processes. The general advice is to look for session IDs (or SPIDs) greater than 50. And this seems relatively safe if you look at the system views on an average server. A recent conversation on an MVP mailing list revealed that this magic number, while perhaps once a legitimate filter, is certainly not safe to use in SQL Server 2005 or SQL Server 2008. Several system features can--and will--use session IDs greater than 50, because there is simply not enough room otherwise. Examples include: - Large servers that use soft NUMA, because there is one checkpoint and lazy writer thread per NUMA node
- Asynchronous statistics updating, again (and especially) on larger servers
- Database mirroring, especially if a large number of databases are involved
- Service Broker activation, when a large number of activation tasks are being used
And there may be other cases as well. The point is, the number 50 is no longer a valid way to filter out system session IDs. So what is the correct way to proceed? Two options: If you're still using the legacy sysprocesses view--which I do, and which Who is Active does--you can use the fact that system processes have a blank host name, and filter on the hostname column as in the following query (note that all versions of Who is Active already do this; I didn't realize just how dangerous the magic number was in this case, but I always avoid them anyway and this situation only supports that mindset): SELECT * FROM sys.sysprocesses WHERE hostname > ''
UPDATE: Hostname is not bulletproof--see Dan's comments below. Turns out that an external process can tell SQL Server what its host name is, and can choose to send a blank string. Another column is not settable, however, and that's hostprocess. Same rules; look for the blank (and Who is Active has been updated accordingly, as of v9.89): SELECT *
FROM sys.sysprocesses
WHERE
hostprocess > ''
If you're using the DMVs, the sys.dm_exec_sessions view has a handy and well-named column, is_user_process: SELECT * FROM sys.dm_exec_sessions WHERE is_user_process = 1
So there you have it. User sessions, and only user sessions, the right way. Enjoy!
|
-
Thank you to everyone who attended my three sessions at this year's TechEd show in New Orleans. I had a great time presenting and answering the really great questions posed by attendees. My sessions were: DAT317 T-SQL Power! The OVER Clause: Your Key to No-Sweat Problem Solving Have you ever stared at a convoluted requirement, unsure of where to
begin and how to get there with T-SQL? Have you ever spent three days
working on a long and complex query, wondering if there might be a
better way? Good news: The OVER clause, first introduced in Microsoft
SQL Server 2005, can be used to quickly and easily solve a number of
problems that were previously very difficult or seemingly impossible. In
this session, learn to leverage aggregations and windowing operations
to gain insight without losing information, enabling you to answer a
number of interesting business problems with ease. Several demos are
shown to highlight the utility of the OVER clause for solving a large
number of difficult--yet common--query problems, including custom paging
schemes, data de-duplication, "top-N" problems, and complex statistical
calculations. You also learn how to creatively apply the feature to
help with performance optimization of certain classes of tough queries.
If you're tired of writing queries that just don't seem good enough,
attend this session to get to the next level.
DAT318 Auditing, Tracking, and Change Monitoring Technologies in Microsoft SQL
Server 2008 Regulatory bodies...end-users...your boss. They all want answers. Many
questions are easy enough to deal with: "Did someone drop my view?"
Others are a bit trickier: "What was the previous value of this row?"
And some are seemingly impossible: "Who selected the data from this
table over the past week?" For many DBAs, the answer to some or all of
these questions is often "Umm..." But don't blame yourself; getting this
information in SQL Server has never been especially easy—until now. SQL
Server 2008 ships with several new technologies designed to help you
track and report on exactly what happened, who did it, and when. In this
session, learn about SQL Server 2008 features: Change Tracking, Change
Data Capture, and SQL Server Audit, each of which provides a distinct
set of capabilities and has specific strengths and weaknesses. Looking
at each of these technologies in turn, you will see how they work and
where you might want to leverage them in your SQL Server infrastructure.
If you're used to saying "Umm..." get ready to say "I'll be right back
with the answer."
DAT03-INT Best Practices for Integrating Common Language Runtime in Microsoft SQL Server (co-presented with Pedro DeRose)
This session provides best practices, tips, and pros and cons of using
SQL Server CLR integration.
The attached ZIP file contains demos from each of the sessions. Enjoy, and as usual let me know if you have questions. Thanks again!
|
-
This month’s T-SQL Tuesday is hosted by Jorge Segarra, the “SQL Chicken.” The topic is rather open ended: What is your favorite new(ish) SQL Server feature? Love the DACPAC? Can’t wait for PDW? Post about it and tell us why! In other T-SQL Tuesday news, we now have a logo. Those of you who are participating in the event, take notice; the rules have changed. Now that we have a logo we’re simplifying the linkback and subject guidelines a bit. Henceforth you can title your post however you want. It just has to include the logo at the top, and the logo has to link back to the hosting blog. This should help de-clutter participating posts, while still identifying them as part of the T-SQL Tuesday event. Here’s the logo:
 Our supercool logo was created by JP Jones of Paige1Media. I hope that if you need some design work done you’ll contact Paige1Media. JP was extremely easy to work with and I think the output speaks for itself.
|
-
Congratulations! You've made it back for the the third and final installment of Parallelism Week here at SQL University. So far we've covered the fundamentals of multitasking vs. parallel processing and delved into how parallel query plans actually work. Today we'll take a look at the settings and options that influence intra-query parallelism and discuss how best to set things up in various situations. Instance-Level Configuration Your database server probably has more than one logical processor. As a matter of fact, as of the time of this writing--May, 2010--I would be shocked to see any new piece of hardware requisitioned for use as a "database server" that contains fewer than eight cores. Processing power is relatively plentiful and thanks to SQL Server's socket-based licensing scheme and the prevalence of multi-core CPUs, it's almost ridiculously cheap to put together a server with a huge number of logical processors. So there you are with your oh-so-powerful server. You've done your disk configuration, set up Windows appropriately, and installed SQL Server. You're almost done! But before you mark this project as finished, it's important to ponder the impact of a few instance-level settings that may impact parallelism: - Affinity Mask controls which logical processors are used by the SQL Server instance. This may not be considered a parallelism setting per se, but it's an important part of the story. Setting this option to too low a number obviously restricts what SQL Server can do in parallel. But it's important to consider this option, especially for servers running multiple SQL Server instances.
- Maximum Degree of Parallelism dictates, at an instance level, the maximum number of logical processors that will be used by a given query. More on this in the next section.
- Cost Threshold for Parallelism is the primary threshold used by the query optimizer in determining whether it should consider parallel plans as part of the optimization process. More on this in the How Much Does it Cost section.
I won't discuss Affinity Mask any further at this point, except to mention that the sp_configure "affinity mask" setting is listed in Books Online as a deprecated feature. It's been replaced with an option called ALTER SERVER CONFIGURATION SET PROCESS AFFINITY. As far as I can tell, the end result is the same. 98 Degrees (Because how often, really, do you get to name a section of a technical article after what is described in Wikipedia as an "adult contemporary boy band?" And doesn't "boy band" eliminate the possibility of said band making "adult" music? Or the other way around? But I digress...) Maximum Degree of Parallelism (MAXDOP) seems simple enough. You set it at the instance level and it controls the maximum number of logical processors that a given query can use, assuming that the query hasn't overridden the setting. But here things start getting trickier. First there's the question of what the "correct" setting actually is. And then there's the monitoring question, and the confusion around tasks vs. processor utilization. I'll clarify both of these as best I can. How do you set the thing? The default value is "0," which means "use up to as many logical processors as have been assigned to the SQL Server instance by the Affinity Mask option." This is, in my opinion, one of a few default settings that the SQL Server team really got wrong. SQL Server instances are usually used by a lot of people at once, and one of our goals as DBAs or DB developers is to maximize concurrency. This means getting all of the queries that are running at the same time to finish as quickly as possible, and that means being able to distribute the available resources in a relatively even manner. Processor time is one of the core resources that every query needs to use, and if a decent number of queries are each using every core simultaneously, processor time is going to quickly become the most limited resource on the system. In Part 1 of this week's series I mentioned "thrashing," and if you have 10 queries active on your system, each of which is using all 16 cores, thrashing is exactly what you'll have. Each query's tasks will spend as much--or more--time switching on and off of schedulers, and waiting to get back onto schedulers, as actually doing work. And that's clearly what you don't want to have happen in a highly concurrent environment. A lot of articles recommend an instance-wide MAXDOP setting of 1/2 of the number of logical processors allocated to the instance. I think that's a good starting point, especially for instances supporting data warehouses and mixed workloads, but it's certainly not the final answer. Some articles I've read recommend never going above a value of "8," and I think that's too Draconian a prescription. And a lot of people running OLTP workloads like to set the value to "1." Which can also be a bit severe. The fact is, too much parallelism will hurt concurrency, and too little will hurt the performance of big queries. So how do you figure out what setting to use? There's really only one surefire way: baselining, preferably in conjunction with load testing. Run a load test--or look at your production system's metrics if you have no test environment--and establish a CPU and query performance baseline. Change the MAXDOP setting and try again. Rinse and repeat until you find something optimal. For OLTP workloads, a setting of "1" is, indeed, quite often a good idea for most queries. Unfortunately, this value is problematic because the MXDOP setting also controls non-query operations such as index rebuilds. You usually want your index rebuilds to process in parallel whenever possible because they'll finish more quickly that way, thereby freeing resources for your users. Index operations and individual queries can override the instance-level MAXDOP setting by using the MAXDOP query hint. (A very well-named hint!) So on an instance with the sp_configure "maximum degree of parallelism" setting set to "1," I could tell the query optimizer to use up to 10 logical processors for the following query by using the hint: SELECT * FROM Tbl WHERE Col > 5 OPTION (MAXDOP 10)
Since you can override the setting on an individual query basis, the instance-wide value should be set in such a way as to positively influence the greater majority of queries running on the instance. But how do you find those that need special attention via hints? My friend Jimmy May pointed out the solution to me in a recent conversation: run a load test with a non-zero instance-level MAXDOP setting, baseline performance metrics for each query, then flip the value to "1," test again, and compare the results. Or the other way around. The queries which perform better using the setting you don't want to set at the instance level are the ones that need a hint. Simple enough!
"I have MAXDOP set to '4.' Why is my query spinning up 13 tasks?" As you become more familiar with how to monitor SQL Server you will invariably start looking at the sys.dm_os_tasks DMV and will quickly notice that MAXDOP does not seem to directly impact the number of tasks that a given request can use. This can be a bit confusing, since you might expect that a request using more tasks equates to a request using more logical processors. But that's not exactly how it works. In truth, a query's degree of parallelism determines how many tasks--at most--each individual iterator can spin up. And several iterators can run concurrently at any given time during a request's life cycle. The setting also determines the maximum number of schedulers all of those tasks can--by virtue of their associated workers--be bound to. A query with a degree of parallelism of "4" that is using 13 tasks most likely has three active iterators, each of which is using four tasks. So why 13 and not 12? Because there is always a task dedicated to coordinating the other tasks. In sys.dm_os_tasks and other task-related DMVs it will be the task with an exec_context_id (execution context identifier) of "0". The others will have a non-zero identifier. By the way, just because you ask for a given degree of parallelism doesn't meant you'll get it. The query processor may use a DOP lower than that which you requested. To find out the actual degree of parallelism for your query, take a look at the DegreeOfParallelism attribute of the QueryPlan node in the showplan XML.
How Much Does it Cost? SQL Server uses cost-based optimization to determine how best to process your queries. This means that the optimizer considers a number of potential query plans and assigns each a cost, or a score that acts as a relative means by which to compare different plans. This number is calculated by using a fixed price for the estimated number of times every given operation that is a part of the plan is expected to take place in order to satisfy your query. This same cost-based model is used to figure out whether to process your query serially or in parallel. Query optimization is an expensive process, so it happens in phases. Each phase takes a bit longer than the previous and during the latter phases the query optimizer considers more complex plans that are more difficult to cost. Parallel plans are certainly more complicated than serial plans, and so are reserved for the later phases of the optimization process. And this is where the Cost Threshold for Parallelism setting comes into play. The parallel plan selection process can be described fairly simply: During the early parts of the optimization process, the optimizer considers serial plans. Should these plans exceed the set cost threshold, parallel plans may be considered at later phases of the process. If one of the parallel plans is cheaper than the serial version, it's used. Modifying this setting can get a bit confusing, since it's difficult to see from parallel plans what impact changing the setting will have. If your parallel plan has an estimated cost of "6.58," setting the Cost Threshold to "7" may or may not cause the plan to be processed serially. What matters is not the parallel cost, but rather the cost of the alternative, serial version of the plan. And to discover that you must either change the instance-level MAXDOP setting, or use the MAXDOP query hint, to see how the estimated costs change. The default value of this setting is "5," and tweaking it is done in a similar fashion to what I described for the instance-level MAXDOP setting--with the exception that Cost Threshold for Parallelism cannot be overridden on per-query basis. For this reason it's even more important that you get it right, and many DBAs don't touch it at all. An interesting way that it can be used is by setting it to a higher-than-default value (some articles suggest using a value of "20" or higher), as an alternative to an instance-wide MAXDOP setting of "1." This guarantees that only queries estimated to be significantly expensive will be parallelized, and index rebuilds and other non-query processes will still be able to make use of parallelism without having to be overridden. Parallelism Inhibitors To finish things up I would like to point out one of the most frustrating parts of the parallelism game. You have your server configured the way you think it should be, you've written your massive query, and you're ready to scale it across multiple CPUs. But alas, the optimizer just won't generate a parallel plan. What's going on? Assuming that the query's estimated cost is higher than than Cost Threshold and that the MAXDOP setting isn't "1," you're most likely encountering one of the various features that inhibit parallelism. The list below is taken from a presentation written by Craig Freedman, a member of the SQL Server query processor team. The following features force the entire plan to be processed serially:
- All T-SQL UDFs*
- CLR UDFs with data
access
- Miscellaneous built-ins such as:
- OBJECT_ID(),
ERROR_NUMBER(), @@TRANCOUNT, …
- Dynamic cursors
The following features force a "serial zone", or a part of the plan to be processed serially even though other parts can be parallelized:
- System table scans
- Sequence functions
- TOP
- "Backward” scans
- Recursive queries
- TVFs*
- Global scalar aggregate
- Multi-consumer spool
* By "UDFs," Craig meant the scalar variety. And by "TVFs," he meant the multi-statement flavor. Inline TVFs do not have this issue. If you're having problems getting your query to go parallel, take a step back. Double-check the instance-wide settings. Double-check your query. Are you doing anything on the list above? If not, break your query into smaller component parts. Can you get them to go parallel? Are other queries operating in parallel as expected? This can be both tricky and frustrating, but be patient--you'll get there. And your patience will be rewarded. When properly leveraged, parallelism makes a huge difference, allowing some queries to perform several times better than they do when processed serially.
An End, and A Beginning
When I was approached by Jorge Segarra and asked to write a week's worth of posts, I chose a topic that I happened to be studying at the time and thought that it would be pretty easy to dash out a few articles. After investing over 12 hours in writing these three posts, I have to admit that I only covered around half of the topics I was hoping to, and some subjects at a much higher level than I originally intended. So while this is the end of my SQL University Parallelism Week contribution, you can expect several more posts on the topic of parallelism from me in the coming months. I have some very interesting things to share, and I hope you'll keep reading. This topic, at least for me, gets more and more interesting the more I learn about it, and we've only scratched the surface.
Acknowledgments
I would like to thank the following people, each of whom who helped me in some way, either directly or indirectly, as I researched and wrote this series of posts:
Jimmy May, Paul White, Craig Freedman, Conor Cunningham, Mladen Prajdic And thank you for reading. Enjoy, and as usual please let me know what points I can clarify.
|
-
Welcome back for the second part of Parallelism Week here at SQL University. Get your pencils ready, and make sure to raise your hand if you have a question.
Last time we covered the necessary background material to help you understand how the SQL Server Operating System schedules its many active threads, and the differences between its behavior and that of the Windows operating system's scheduler. We also discussed some of the variations on the theme of parallel processing. Today we'll take a look at the various elements you can expect to see in parallel query plans. After reading this post you will have a solid foundation for interpreting the parallel aspects of your plans, and you will be able to investigate what the query processor is actually doing as your query is evaluated. So without further ado I present… A Primer on Parallel Query Plans As mentioned in the prior post in this series, the query optimizer makes the initial "go" or "no-go" decision when it comes to parallelism. This is done by taking a number of factors into consideration in trying to determine whether it's worth the effort to process the query in parallel. But more on that in the next post; for now suspend your disbelief and assume that a parallel plan already has been generated. Like this one: A lot of things are going on in this plan, and we’ll take it step by step. If you’d like to take a look at the plan on your end, check out the .SQLPLAN file attached to this post. First of all, how do we know that this is a parallel plan? Aside from looking at the showplan XML, the answer is simple: the little yellow icon. Each iterator in the plan that may be processed in parallel has a circular icon with two arrows, superimposed over its bottom right corner. Notice that not every iterator in this plan is graced by the presence of the magic icon. We may refer to a given plan as "parallel" or not, but in actuality parallelism happens on a per-iterator basis. Some iterators can be processed in parallel, and others can’t. Furthermore, the optimizer may decide that a certain subtree of a plan should be processed in serial rather than in parallel. So it’s quite common to see a combination of serial and parallel iterators in your plans. In addition to the fact that not all iterators can be processed in parallel, most are not actually parallel-aware. In truth, the only two operations able to inherently deal with parallelism are index scans and exchange operations. Note that in this context “index scans” refers to both full scans--as shown in the plan above--and range scans labeled as “Index Seek.” Exchange operators--the primary components that control the flow of parallel data in the query processor--are labeled as “Parallelism” in the graphical showplan representation. They come in three flavors: - Gather Streams iterators accept multiple input sets of rows on a number of threads and output a single set of rows on a single thread. These iterators can be either merging or non-merging. The merging version takes multiple sets of sorted rows and merges them to output a single set of rows in the same sorted order.
- Distribute Streams iterators accept a single input set of rows and breaks the set into multiple sets of rows on multiple threads, based on one of a number of distribution schemes.
- Repartition Streams iterators are responsible for reading in multiple sets of rows and re-distributing them on different threads using a different distribution scheme than whatever was used for the input threads.
Data is distributed between threads of a parallel plan either by page or by row. As mentioned above, a number of schemes are available. The most common of these are: - Hash partitioning distributes rows based on a hash value calculated on a set of columns from the input stream. Based on the value, the row is sent to a specific output thread.
- Round-robin partitioning distributes rows relatively evenly across threads by sending the next available row to the next thread in the list, then moving to the next thread, and so on, before moving back to the first thread once every thread has received a row. This pattern is continued until there are no remaining rows to distribute.
- Broadcast partitioning distributes all input pages to every output thread.
To see which scheme is being used by an iterator in a plan you’re working with, investigate the showplan XML (right-click on the graphical showplan and click “Show Execution Plan XML”). Within the XML, search for Parallelism nodes and look at the PartitioningType attribute to see which distribution scheme is being used. And when applicable, refer to the set of ColumnReference nodes under the PartitionColumns subtree to see the columns that are being used in a row-based distribution scheme. Table scan operations use a demand partitioning scheme whereby threads are spun up and begin requesting pages from the scan iterator. After each thread finishes its downstream work with the rows in the previous page it received, it requests another page, until the iterator has no more pages left to distribute. This scheme helps to eliminate imbalance problems where one thread ends up with a lot more work to do than another due to data skew or other issues.
To gain more insight into how parallel plans work, I will walk you through the plan above, starting with the iterators at the top right. To begin, we’ll tackle the following piece, which is generated from a derived table that selects the top 1000 rows from SalesOrderDetail, ordered descending by SalesOrderDetailId: - The index scan iterator internally creates a number of threads, determined by a combination of the MAXDOP (maximum degree of parallelism) for the plan and a runtime decision by the query processor (more on this in the next post). Pages are read from the SalesOrderDetail table and are distributed to each of the threads using the demand scheme. On my dual-core notebook two threads are used, one of which receives approximately 55,000 rows, and the other approximately 65,000.
- The Top N Sort iterator, as mentioned before, is not parallel-aware, so the data on each of the two threads is sorted independently of the other. Because some or all of the top 1000 rows could be in either of the two streams, 1000 rows are returned by each, for a total of 2000 rows output by the iterator.
- In order to find the top 1000 rows, the two streams must be merged into one. The Gather Streams exchange iterator takes care of this, reducing the two input streams to one for input into the serial Top iterator.
- The Top iterator finds the top 1000 rows and outputs them to the next iterator downstream.
Information on the actual number of rows processed by each thread is available in the graphical showplan viewer. Select the iterator you’re interested in and press F4 on your keyboard. Expand Actual Number of Rows and you’ll see output like the following:  Next we’ll cover the segment of the plan downstream from the Top iterator:  - Due to the fact that the Top iterator was processed serially, its output is a single stream on a single thread. In order to continue processing the data in parallel, the stream is distributed into two sets of rows using the Distribute Streams exchange iterator. The sets are determined based on a hash of the SalesOrderId column.
- A bitmap is created for each set of rows, and each set feeds into a hash build. This means that in actuality two hash tables are built--one for each thread.
Keep that hash table in mind and consider the lower subtree: - First the SalesOrderHeader table is scanned. Just like the scan of SalesOrderDetail discussed above, this one is parallel-aware, and the pages are also partitioned using the demand scheme.
- Recall that the data in the previous segment we evaluated is partitioned based on a hash of SalesOrderId. Two hash tables have been built, each of which corresponds to a unique set of the IDs. In order to effect the join between the two tables, the rows from this lower segment of the plan will be used to probe the hash tables to find matching rows. Before this can happen, the rows must be sent through a Repartition Streams exchange iterator, such that streams can be created that align with the sets of data in the hash tables. So in this particular case, a hash distribution scheme is used and the rows are distributed based on a hash of SalesOrderId. Once this has been taken care of the hash match operations on either thread can be done independently of the operations on the other thread.
So What Did We Just See? If you get nothing else out of this blog post, I hope that you’ll remember the following bits: - A “parallel plan” will include one or more parts that can be processed in parallel, but these can be interspersed with other parts that are processed serially. The plan can, essentially, go in and out of a parallel state.
- Only a couple of types of operators are actually parallel-aware, and they know how to work with the threads in some way. The rest of the iterators can be run on multiple threads, but neither know nor care that parallelism is being used.
- A lot of information on parallelism is hidden both in the SSMS graphical showplan and the showplan XML itself. Getting comfortable with the F4 button--for more information on selected graphical showplan nodes--and a good editor for reading the underlying XML are both key to query plan mastery.
Next Time This marks the end of the second post in my three-part SQL University series on parallelism. The final post will deal with options and settings that impact parallelism, and we’ll also take a brief look at some of the parallelism metadata available via DMVs. Your homework is to look at parallel plans on your servers and let me know what questions you have. As usual, I’ll monitor the comments below and answer as soon as I can. Until we meet again, thanks for reading and enjoy!
|
-
Welcome to Parallelism Week at SQL University. My name is Adam Machanic, and I'm your professor.
Imagine having 8 brains, or 16, or 32. Imagine being able to break up complex thoughts and distribute them across your many brains, so that you could solve problems faster. Now quit imagining that, because you're human and you're stuck with only one brain, and you only get access to the entire thing if you're lucky enough to have avoided abusing too many recreational drugs.
For your database server, on the other hand, that multi-brain dream is reality. It's common for newer servers to have 16 or more logical processors, and numbers above 64--once the hallmark of extremely high-end hardware--are today no reason to raise an eyebrow. SQL Server is designed to take advantage of servers with a high number of processors, utilizing two or more at the same time to help it solve big problems faster. This is called parallelism, and understanding how it works and why is one the keys to making your database applications scale.
This first post for Parallelism Week details the foundation material you need to understand before we dive in and investigate some of the query processing components in more detail. The next post will discuss how parallel query processing works and the various components you'll see when you read parallel query plans, and the final post of the week will get into some of the means by which you can control parallelism at both the server and query level.

Figure 1: A four-brained computer, frying an egg.
Background: Threads, Threading, and Other Sewing Analogies
Each CPU in your computer can do only one thing at a time, albeit very quickly. But if forced to finish an entire task before moving on to the next, certain tasks would seem choppy and sluggish even running on extremely fast processors. To get around this issue modern operating systems are designed to be able to do lots of things at once (concurrently). This is done by taking advantage of the fact that the CPU can switch back and forth between tasks almost as quickly as it can get them done. Each program on your computer runs under a process space--a virtual construct to which resources such as memory and CPU time can be granted. And each process can use one or more logical threads to do work on the CPU. Under the context of individual logical threads, a program is able to appear to do more than one thing at once. In reality the way this works is via time slicing.
The Windows operating system includes a component called a scheduler, whose job it is to control which programs get CPU time and when. Operating system designers have come up with various methodologies that can be used to control CPU scheduling, and Windows uses a system called preemption. The idea behind preemption (or preemptive scheduling) is that by default each active thread (i.e., a thread that actually has some work to do) gets an equal amount of time on the CPU. The amount of CPU time is split equally, and the CPU switches back and forth between the various tasks vying for its attention. Each equal CPU time slice is called a quantum. In reality, some threads are more important than others, and these are said to have a higher priority. In a preemptive environment, the higher priority threads are able to force the lower priority threads to yield, in order that the higher priority threads can get more of the CPU time and finish their work more quickly.
Imagine yourself sitting at your desk, and imagine 10 of your co-workers standing around your desk, each asking you to help them with a different project. Now imagine that each minute you abruptly stop whatever conversation you were having and turn to the next co-worker, and then the next, over and over, until all of the projects are complete. This would quickly become confusing and overwhelming for us, but it's effectively what the CPU does as it works on multiple concurrent threads using time slicing. Each change from one thread to another is called a context switch, and although the CPU can do so much more effectively than the human brain, it's not free. If a CPU is overloaded with logical threads and is doing too many context switches, it is said to be thrashing, and can spend more time switching back and forth than getting actual work done.
SQL Server runs on top of the Windows operating system, and so it must work within the confines of the operating system's preemption model. However, internally SQL Server uses its own scheduler, which is one of the components of the SQL Server Operating System (SQLOS for short). Unlike Windows, the SQLOS scheduler uses a cooperative (also called non-preemptive) scheduling model. In this model each thread (abstracted within SQL Server by an entity called a worker) can spend as much time as it needs on the CPU until its task is finished. The idea is that preemption should not be necessary because threads will eventually have to stop and acquire non-CPU resources (e.g. data from the disk system) and will enter wait states, thereby liberating the CPU for other tasks. SQL Server's cooperative model is somewhat limited in its flexibility since again, as a Windows application, it must also abide by the operating system's preemptive model. This means that SQL Server's threads will still be subject to time slicing if there are competing processes running on the server. But in optimally-configured environments where that’s not the case, SQL Server’s cooperative scheduler will effectively be in control of the full lifetime of its threads, keeping them active until they are either finished or need to enter a wait state.
Note that for brevity I've simplified descriptions of the inner workings of SQLOS and scheduling in general. I have also completely bypassed much of the background on wait states. For more information on these topics, especially on the various wait states and how they impact performance, I highly recommend reading Tom Davidson's white paper, "SQL Server 2005 Waits and Queues."
Multitasking and Parallel Processing
Much of the detail around the execution models described in the previous section involves interaction between multiple programs, which the operating system makes us believe are running on the same CPU at the same time. This is known as multitasking, and enables me to write this post in my Web browser while two instances of SQL Server, PowerPoint, Word, and a number of other background processes are all doing work on my laptop. The quanta are so fast that I don't even notice, as I type this sentence, that the Web browser is being switched on and off of the CPU to which it's assigned.
SQL Server internally participates in a form of multitasking. Each worker within the process space is bound to a scheduler, which handles the cooperative scheduling for a single logical processor. Each scheduler can have many associated workers, but just like in the operating system only one worker at a time can actively work with the CPU. Therefore, it's possible on a dual-core server to have, e.g., 10 concurrent queries running. What's really happening is that the various queries--by virtue of the workers actually doing the processing--are switching in and out of wait states, so only up to two queries are actually consuming CPU time at any given moment. This assumes that each query uses only a single worker--which we'll see is not a safe assumption.
Above and beyond simple multitasking, SQL Server can break certain queries into component parts, each of which can be processed on a separate logical CPU. This is known as parallel processing, or, more simply, parallelism. When the query optimizer determines that a query--or, more accurately, one or more components of a query--can participate in parallelism (or can "be parallelized" or "go parallel"), two or more workers are used to do the processing necessary to satisfy the query. The various workers for all of the queries running concurrently within SQL Server are each subject to the cooperative scheduling model, and go on and off the schedulers and in and out of wait states as necessary. This means that large systems can often support hundreds or thousands of workers serving tens or hundreds of concurrent queries, even with comparatively few logical processors actually doing the work.
The Pros and Cons of Parallel Processing
Parallelism, as it turns out, is quite the double-edged sword.
On one hand, taking all of the work that a query has to do and doing it in parallel on several CPUs rather than on a single one can mean extreme performance gains. Certain queries can scale almost linearly as more CPUs are introduced. This means that for each additional CPU, run-time decreases proportionally, so a query running on 10 CPUs will run in approximately 1/10th as much time as the same query running on one CPU.
Certain operations, such as sorts, are especially well-suited to parallel processing due to the way their algorithms work. The best sort algorithms operate based on a predictable scale of N * log(N) operations per set, where N is the number of elements in the set. We can show mathematically that parallel sorts can in fact use less overall CPU time than the same sort on a single processor. If C is the number of CPUs across which a sort is being processed, then for any value of N or C greater than 1, C * ((N/C) * log(N/C)) is less than (N * log(N)). This means that as long as the cost of merging the sorted sub-ranges from each worker thread is not too high, the overall sort operation will require fewer clock ticks done in parallel than done in serial. More on this in a future post.
Taking an opposite stance on the performance question, consider that splitting up work and binding a single query to multiple workers does require some overhead. There is memory overhead in maintaining state for each worker, CPU overhead due to a greater number of context switches that must occur, and wait state overhead at the points at which worker threads must join, or synchronize with one another. In addition to these basic costs associated with actually processing queries in parallel, SQL Server has a limited number of pooled (reusable) threads to work with and if too many queries go parallel at the same time the thread pool can be exhausted, causing any new queries to be forced to wait until an active thread becomes available.
To combat these situations, SQL Server's query optimizer only creates parallel plans when parts of the query exceed a given cost threshold, meaning that the optimizer believes that there should be sufficient return to warrant the parallelism investment. This system eliminates unnecessary parallelism in many--but certainly not all--cases, but also inhibits parallelism in some situations where it may be beneficial. A DBA or developer concerned with creating a well-tuned system must understand parallelism well enough to monitor for its proper use, and must know how to modify settings appropriately to guide the decisions made by the optimizer. In this way, a balance can be achieved such that parallelism is used to make the biggest queries run faster, and not used for small queries for which there would be no gain.
Looking Forward
This concludes the first of three parts of parallelism week here at SQL University. Part two will cover parallel query plans, and part three will get into the various ways that we, as end-users, can control parallelism in the query engine. If you have any questions on what's covered in the post, please leave a comment below and I'll answer as soon as possible.
Enjoy!
|
-
I was tagged
by master blogger Aaron Bertrand and asked to identify five things
that should be removed from SQL Server. Easy enough, or so I thought...
1)
Tempdb. But I should qualify that a bit. Tempdb is absolutely
necessary for SQL Server to properly function, but in its current state
is easily the number one bottleneck in the majority of SQL Server
instances. Many other DBMS vendors abandoned the "monolithic,
instance-scoped temporary data space" years ago, yet SQL Server soldiers
on, putting more and more "stuff" into tempdb with every new version.
Some form of global tempdb may be required for temp tables and other
database-agnostic features, but for features like the version store,
index rebuilds, and DBCC, we should have a lot more control in order to
build the highest degree of performance and reliability into our
database solutions. Tempdb should be taken out back, shot a
few times, and rebirthed as a smaller, leaner global store for only
very specific types of data. Everything else should go into
specially-marked, database-scoped filegroups so that DBAs can have
finely-grained control over how and why the space is used. 2)
T-SQL Scalar UDFs. Scalar UDFs suck. There. I said it. And I'll
say it again. Did you know that scalar UDFs suck? I was just reminded of
this fact by a webcast done by Simon Sabin, but I've been saying
it for years. Scalar UDFs, at least of the T-SQL variety, are
painfully slow. And both varieties inhibit parallelism. Since their
SQLCLR counterparts aren't quite as painfully slow I'll leave them out
of this rant, but the T-SQL versions should be removed from the product.
Sure, they seem to be great during development, but as soon as the
application needs to scale they become a nightmare. Such timebombs
should not exist. SQL Server team, please defuse this situation and
replace the current UDFs with inline scalars.
3)
User-Defined Types (again, of the T-SQL variation). User-defined
"alias" types were never exactly a panacea. Oh, joy. Instead of
"VARCHAR(36)" I can refer to "AddressLine". But at least, prior to SQL
Server 2005, they could be bound to "rules" in certain cases. This, in
and of itself, was wrought with problems--I won't bore you with the
details--but then the "rules" were deprecated. So now we're left with
these aliases, and it feels much like the proverbial hammer waiting for a
nail to show up. Every once in a while I'll run into a database where
some well-meaning developer discovered alias types and went crazy. And
then--always after it was too late--discovered the numerous problems
they bring to the table (pun absolutely intended). SQL Server should
toss these out with the rest of the garbage and bring in what should
have been implemented to begin with: ANSI domains.
4)
SQL Server Management Studio. I'm going to have to agree with my
friend Buck
Woody on this one. SQL Server Management Studio should never have
been created. If I told you that I was going to take Microsoft Word,
cripple it until it became a weak text-editor lacking even spell check
support, and sell it as part of some other product, you would tell me I
was insane. Well, that's exactly what Management Studio is. Someone
decided that DBAs don't like working in Visual Studio, so the team
created a dumbed-down version of the shell, popped in a few extensions
for various SQL Server features, and there you had Management Studio.
The only problem? No one asked a real DBA what they thought, and since
all of the BI features are in Visual Studio every DBA I talk to already
has it installed anyway, in the form of BIDS. Management Studio should
be ripped out of Management Studio and brought back to where it makes
sense: As a Visual Studio plugin. This would instantly simplify the
lives of DBAs everywhere by reducing the number of environments they
need to work with, while automatically enabling a number of features,
such as proper support for plugins. A true win-win.
5)
Database Diagrams. I actually had trouble finding five things to
put onto this list. I'm generally pretty happy with SQL Server's
features. So I took a look around and this is the best thing I could
find. Honestly, my overall opinion of database diagrams is kind of
"meh." The feature has never been great. I don't use it, favoring
various third-party products or even Visio when I'm desperate. I don't
know why this feature was never enhanced to become a full ER tool, but
in its current state it's a joke. Might as well just give it up. And
that's that. Enjoy.
|
-
Thank you to everyone who attended today's 24 Hours of PASS webcast on Dynamic Management Objects! I was shocked, awed, and somewhat scared when I saw the attendee number peak at over 800. I really appreciate your taking time out of your day to listen to me talk.
It's always interesting presenting to people I can't see or hear, so I relied on Twitter for a form of nearly real-time feedback. I would like to especially thank everyone who left me tweets both during and after the presentation. Your feedback is extremely helpful.
Per the request of a few people, the deck and demos are attached. Note that these demos cover slightly more depth than what I did in the talk, since I only had 40 minutes and decided to go demoless (which I prefer in webcasts). This talk is also available in a live format, with full demos, for user groups and conferences; if you arrange either and are interested, contact me through the blog and we can discuss. Again, thank you so much for your time. Feel free to leave a comment here if you have any followup questions on the material, and I'll do my best to answer.
|
|
|
|
|
|