THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Linchi Shea

Checking out SQL Server via empirical data points

There is a limit to set based solutions

It typically goes like this.

 

You started with procedural solutions because that’s a ‘natural’ way of solving a problem. Then, you read articles and books and went to presentations, and everyone was preaching set-based solutions. You felt ashamed and started experimenting with set-based solutions. Your initial success plus even more bombardment from set-based solution best practice preachers led you to apply set-based solutions everywhere, or more modestly whenever possible. But as you ran into more real-world problems, you started to bump into cases that led you to question whether set-based solutions were really the panacea to cure the world’s SQL performance problems.

 

Well, like life, there is a limit to everything. The limit to set-based solutions in the real world is the limit of the SQL optimizer. The set-based best practice preachers assume that the optimizer is all mighty. You don’t need to tell it how, just tell it what you want in the form of a SQL query. It’ll figure out how for you.

 

A case in point is the use of views. So ideally, a view encapsulates what you want in its resultset. And then you want something else, so you just use a view, or two, in another view, and the latter gives you what you want without your being bothered with the how question. As you heap more and more views on top of each other, you get great satisfaction with the quintessential set-based solution because at the top of the view you can just do a simple SELECT ABC FROM v_YourView, and let SQL Server figure out how. Simple and elegant, exactly wht you expect from a set-based solution!

 

Except that it may reach a point that the simple SELECT now runs for much longer than you can tolerate. So you start to investigate, and start to add a bit of procedural solution here and there by adding a hint here or a hint there. And then, you realize that a little bit of procedural nudge isn’t enough to get you the performance, and start to break the seemingly simple/elegant but really complex query into smaller pieces and stitch them together with procedural code.

 

After all, some procedural solutions are not so bad!

 

Now, it’s time for me to dodge the set-based solution crowd.

Published Friday, December 12, 2008 1:18 PM by Linchi Shea
Filed under:

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Jack Corbett said:

I would consider myself part of the "set based" crowd most of the time, but I also am a realist and believe you design a solution that meets the need.  That may mean a set-based solution, a procedural one, or a combination.  

December 12, 2008 1:38 PM
 

Greg H said:

"As you heap more and more views on top of each other"...

"the simple SELECT now runs for much longer than you can tolerate"...

"start to add a bit of procedural solution here and there by adding a hint or a hint there"...

"start to break the seemingly simple but really complex query into smaller pieces and stitch them together with procedural code"

That's your primary problem!  It sounds like you're trying to bake a cake without following the recipe.  Also, how does "adding a hint or a hint there" = procedural solution?  You sound very confused...  I'm a bit believer in set-based solutions and only in rare scenarios should you need to use a procedural based solution (b/c a set-based solution just won't work).  

Please don't lead others to believe that set-based solutions are optimal.  I deal with too many of those people today as-is!

December 12, 2008 1:48 PM
 

Michael Swart said:

Just yesterday I put preliminary results into a temporary table before using it in a query later. We shouldn't be afraid of any particular solution, just because it comes from a different point of view.

I'm reminded of the Normalization, De-Normalization debate for example. Nobody has trouble with indexes or indexed views. Even though these are just denormalization techniques (talk about redundant data!) that we can all feel good about.

By the way, I've soooo been there with the nested views. People who design queries usually mean to simplify queries, but I've seen the opposite effect.

December 12, 2008 1:52 PM
 

Linchi Shea said:

> That's your primary problem!  It sounds like you're trying to bake a cake without following the recipe.

Not my problem, just the the problems I have to deal with.

December 12, 2008 2:03 PM
 

Linchi Shea said:

> Also, how does "adding a hint or a hint there" = procedural solution?

I don't know any set theory that has hints as its basic construct. When you give a hint, you are giving a 'procedural' directive on how to do it. Sure, this is more semantic than substantial. But that's how arguments go in this realm.

December 12, 2008 2:23 PM
 

Adam Machanic said:

Really interesting argument, Linchi.  I had never thought of hints that way before, and now that I am I agree 100%.  The whole point of "set-based" is to program declaratively rather than imperatively and hints ruin the abstraction.  

Also, agreed that hints just can't get you there sometimes.  I was tuning a complex query recently that was generating a really horrible plan and taking around 30 minutes to run.  I started hinting it up and eventually hit a wall where the QO said that it couldn't produce a plan due to the hints.  So I broke it apart into some temp tables and now the thing runs in around 30 seconds... Nothing wrong with temp tables--or even the occasional cursor--used judiciously, in the right ways.  The problem is that most people use them liberally, in all the wrong ways :-)

December 12, 2008 2:47 PM
 

Alexander Kuznetsov said:

Linchi,

The following is very well written:

"The limit to set-based solutions in the real world is the limit of the SQL optimizer."

Yet I also think that the limit of us humans also limits set-based solutions, and at about the same threshold. About the time the query becomes to big to be understood easily, the optimizer starts getting problems with it too, and the reason is simple: as the complexity of the query grows linearly, the time it takes to understand and optimize it grows exponentially.  Then it is time to split the monster query up...

December 12, 2008 3:16 PM
 

Ward Pond said:

Great post, Linchi.  And yes, I too am a member of the set-based mafia.

I've always preached that our ultimate goal should be to generate code that speaks lovingly to the optimizer.  There is certainly more than one route to that goal, and our journey is obviously complicated by the number of people who have run with scissors in the past (i.e., designed a schema and/or an API) who weren't really qualified to be doing so.

IMO, set-based thinking should be our first and primary approach..  but rule number one is that there are always exceptions.  If there weren't, DECLARE CURSOR would've been deprecated a long time ago.

December 12, 2008 3:21 PM
 

Gianluca said:

To me is very simple: you favor the set based solution, then you optimize only what really needs to be optimized using the most effective technique. Best thing, if possible, would be to leave somewhere the original implementation with an explanation of why you changed it. Optimizers get better and eventually you will be able to go back to the original implementation. Trying to be smart everywhere can render the code less understandable and will eventually limit future versions of the optimizer to exploit new techniques. To be honest, I think most people from the "set based" crowd know when to make compromises. OTOH I see a lot of problems (performance or less maintainable/readable code) caused by people going down the procedural road just because they ignore how to do something using a set-based solution.

December 13, 2008 12:31 PM
 

Scott R. said:

Linchi,

Glad to see you continue the side discussion we started in your earlier post (http://sqlblog.com/blogs/linchi_shea/archive/2008/12/04/executing-a-large-number-of-sql-scripts.aspx).  This topic is definitely worth exploring further.

The comment by Gianluca about favoring set-based solutions and optimize after the fact only if needed seems sensible to me.  It is possible that an overly complex “monolithic” set-based solution may perform poorly (as Adam M. mentioned), and would benefit from being manually decomposed into simpler set-based solutions using temp tables (much as a smarter optimizer would do if it could).

The set-based versus procedural-based topic spectrum is also similar to the “normalize / denormalize” topic spectrum that DBAs face with database design.  I always thought the advice of “normalize till it hurts, then denormalize till it works” was reasonable (see http://sqlblog.com/blogs/paul_nielsen/archive/2008/10/03/denormalize-for-performance.aspx) – and only denormalizing if and when needed after it is proven that the normalized solution won’t yield the desired objectives (performance, etc.).

Other spectrums of choices that have required similar favored positions include:

-  Managed versus unmanaged code for performance / compatibility / etc.

-  Even further back: 4GLs versus 3GLs versus assembly versus machine code

I have seen the set-based versus procedural-based spectrum also called:

-  Declarative-based (instead of procedural-based)

-  Specification-based (versus set-based)

-  Set-at-a-time versus row-at-a-time

The words differ, but the song remains the same.

In any case, my point was in favor of a simpler, more compact solution for getting the desired results by specifying more of “what is to be done” and less of “how it is to be done” where possible (with the understanding that it still needs to be done efficiently), without being so cryptic as to make it hard to understand or to change / maintain later.  A delicate balance, no doubt!

Your earlier interest in comparing Perl solutions to similar PowerShell solutions brings up that set-based considerations are not limited to RDBMS data sources.  PowerShell cmdlets provide .NET object resultsets from a variety on non-relation data sources (folders, files, processes, WMI data sources, etc.) and a basic pipeline to foster composite component solutions.  SQL Server Integration Services (SSIS) also provides similar pipeline-based, component solutions that leverage declarative approaches and can also use procedural approaches when needed.

Jamie Thomson describes it better than I could in his recent post comparing the “data integration” similarities and differences between PowerShell and SSIS pipelines:

http://blogs.conchango.com/jamiethomson/archive/2008/12/09/talking-pipelines-ssis-and-powershell.aspx

One item that Jamie covers is that when the PowerShell pipeline automatically “wires” the data content flows from one step to the next, it looks ahead at the data type required for a given field in the next step and automatically casts the corresponding field being passed from the previous step to that needed data type (in contrast to SSIS where pipeline data flows and data type change casting are manually configured).  Pretty neat!

I’m not a PowerShell bigot, or a VBScript bigot, or a Perl bigot.  I’m a “what works best for the least effort and is the easiest to maintain” bigot, which probably means using a mixed collection of tools with proven application models in sensible ways.

Scott R.

December 13, 2008 6:41 PM
 

steve dassin said:

This is a response I elicited from MS that paints the query optimization picture from their point of view. My point was that developers should develop and not be forced to become database engineers having to guess and twiddle about their database. From the thread:

microsoft.public.sqlserver.programming

Dec 1 2006

"MS, You have got to be kidding!"

http://tinyurl.com/62mtmu

The MS response is from: Roger Wolter[MSFT]

"I could spend the next six months blue-printing the engine in my Toyota and get quite a bit more power and speed out of it.  Does that mean Toyota builds junk or does it mean not everybody needs the ultimate in power and speed. Optimization is an extremely processor-intensive activity.  Doing a complete optimization of even a moderately complex query can take significantly longer than executing the query.  For this reason query optimizers have to

compromise.  There are generally two things all cost-based optimizers do. They go for a "good enough" solution rather than a perfect one.  Good enough means as good a solution as is available in a reasonable optimization time. If you as a DBA aren't satisfied with the result you can use hints to help the optimizer chose a better plan.  If you can live with a sub-optimal plan in a few cases then you let the optimizer do its thing.  There are hundreds

of thousands of people running SQL Server databases who have no idea what a hint is and get completely satisfactory results - just as there are thousands of Toyotas running with stock engines that manage to make it to

the shopping center.

The other thing that optimizers do to get better plans without excessive compilation time is they remember the optimization for a particular query so when they see the same query again they can use the plan they have already found.  The more cached queries the optimizer can find, the fewer optimization is has to do which in turn means it can spend more time doing each optimization because the effort is amortized over more executions of the query.  This is one of the main things SQL Server 2005 did to improve optimizer performance.  It used a variety of methods to increase the number

of plans that are reused.  One of the ways this can be done is through better parameterization.  For example a query plan hat computes the average housing price in Washington can also be used to compute the average housing price in Oregon if the state is treated as a parameter and not part of the query.  SQL Server 2005 does a better job of this than previous versions. This has a couple consequences.  First, the initial optimization pass may take a little longer because the optimization is more thorough and second,

if the value of the parameter used for the initial optimization isn't typical, it can lead to sub-optimal plans for later queries that reuse the plan.  For example, if the first execution was done for Wyoming where there aren't a lot of houses, the plan chosen may not work well for California or New York where there are quite a few.  If this is an issue, you can either run the query for New York after you bring the database up or use a hint to tell the optimizer to optimize for New York no mater which state the query

specifies - or tell the optimizer to optimize every time instead of using a cached plan.  The fact that the optimizer allows you to compensate for conditions that would otherwise lead it to make a bad decision seems like a good thing to me.

The other thing to remember is that in the vast majority of queries, a "good enough" plan will provide adequate performance so unless you have a huge database with very complex queries against very non-uniform data distributions, you don't ever have to worry about this.  I'm not an ace mechanic but my Toyota gets me to work every day.  If I decide to become a street racer I may need to learn more about engine rebuilding but until then my current ignorance doesn't affect my driving experience."

"LINQ - like pretty much ORDB is intended to expose data in a way that makes sense to a propeller-head developer who doesn't know which end of a database to hold.  The assumption is that if you know enough about SQL to want to use hints you'll use SQL. If a DBA wants to affect the query plans used by a LINQ application he can

use Plan Guides which were developed in SQL 2005 to support optimizer hints in environments (like LINQ) that don't support hints - hard for you to believe I'm sure but occasionally MS does know what they're doing."

Me, I tend toward the propeller crowd. But that puts me in the "good enough" catagory. If I want to crawl around inside something I can think of better things than sql server. And how would you like a doctor telling you the treatment you're getting is good enough for YOU:) But MS must feel their "good enough" is "Good'n Plenty":)

December 13, 2008 11:41 PM
 

ALZDBA said:

In many cases, the problems with nested functional views can "simply" be solved by rewriting the qeury to perform the essence of the task it needs to do.

The SQL-engine will solve every view on its own, does not know that e.g. parts of the functional view are useless but need to be covered once in the qeury,...

Indeed over using views will leed to run time materialization of data (tempdb).

The whole point of rewriting nested views, is to obey to the first

rule of any rdbms: Tell the system what you know.

December 14, 2008 3:32 PM
 

RussellH said:

Linchi or anyone,

Can someone please provide an example (with tables and scripts) that show an obvious case where the optimizer consistently gets a bad plan that can easily be fixed by either hints, procedural code, or procedural code with temporary tables?  A complete example with scripts to generate tables and data and the offending SQL would be great.  I know various database settings will also matter, but there could be examples that apply regardless of many settings.

I think most of us have seen queries like this in the field, but it would be good to see at least one generic case that anyone could reproduce.  Most of the examples that I have seen were quite complicated, and even after I "fixed" the performance of a particular query using procedural code, I always had the nagging suspicion that just rearranging the original SQL might have provided a better plan.

December 16, 2008 12:15 PM
 

Linchi Shea said:

> an obvious case where the optimizer consistently gets a bad plan

> that can easily be fixed by either hints, procedural code, or

> procedural code with temporary tables

If it's obvious and consistently wrong, they probably would already have fixed it. None of the cases I've seen is obvious. Well, they are probably obvious in that the queries are complex.

I don't have an obvious one that can be easily covered in a blog post. But I would think if you keep nesting views on top of views, you probably will hit sooner or later.

December 16, 2008 3:39 PM
 

Jiang said:

SQL is set-based, period. Any procedural code is the result of a bad design/coding somewhere in your solution.  

In early years of my sql programming, I found myself defending procedural type of sql programming because it seems to be easier to follow.  Now, as I get deeper into the game I know there is no excuse doing the bad-as* procedural sql programming.  Stop excuse yourself and start learning SQL as it is.

December 16, 2008 11:55 PM
 

Armando Prato said:

I code in SETs about 99% of the time and I rarely use any cursor logic except under specific circumstances.  I think the main problem lies in that a lot of SQL, A LOT, is written by non-db programmers who way overuse looping constructs.  When I started at my present position, I was tasked to optimize an upgrade script that ran in - get this - 3 DAYS.  It was a giant series of loops within loops.  Using SETs with the odd indexed temp table here and there, it was cut down to 3 minutes.  I think the point should be driven home that you should always look to solve your problem using SETs first before considering other methods.

December 17, 2008 12:46 PM
 

richard said:

Hi,

I'm not sure who the audience is for this but I'm a DBA/query optimizer at an internet company who works with C#/web programmers. Most of our procedural programmers don't even know what a set solution is and they're programming all DB stuff procedurally; that really hurts our efficiency. The difference between a query that runs in 500ms vs. 1500ms makes a huge difference to us.

I would say we are not even close to approaching our set based solution limits. We DBA's have a lot of education to do. But I commend you for opening up the topic for discussion.

Thank you,

Richard

December 17, 2008 1:17 PM
 

Peter said:

I liked the article and it IMHO is right in pointing out that there is more to databases then making sets out of everything. That said, there is no question in my mind that set based solutions currently run best on the current implementations of SQL Server. Keep this in mind when I make my following comments.

However....a lot of SQL wizzardry to make code faster replaces loops with not so obvious and often hard to read set based code that highly depends on build-in SCALAR functions to make them work.

Thus if you agree that SQL is about WHAT you want to get done and how the result should look like...then nearly all set based code is in fact also procedural in nature. It just operates on or loops over sets of data (very simplified).

Time for a mindgame:

1. Take the build in functions charindex() and substring()....

2. Then build a user defined schema bound deterministic scalar function, based on substring to implement the same logic as charindex does.

3. Next replace charindex in one of your very fast "set based" solutions with the new function. Theoretically, an optimal system would execute is just as fast. It does the same job, in the same way and with identical invocations and limitations.

But you will find that the user defined function (UDF) will slow things down to a crawl. But remember, there really is no difference at all....so why is performance suddenly horrible? It is still just as set based as before!

There are two conclusions I would make:

1. It is all about the implementation!!!

There are quite some things that Microsoft can do to make the procedural code presented in my mindgame close to half as fast (maybe even better) as the pre-compiled and optimised build functions. And since we already established it has nothing to do with sets, we can rule the algebra from the optimiser out as the cause of things in this particular case.

For our particular mindgame that leaves us with:

* The compiled code is not recognised as procedural (despite it being deterministic and not using sets) and is executed as multiple statements that do scalar operations but each time with the overhead that for large sets would be irrelevant.

* Local variable management is hopeless outdated.

* Code is not really compiled to optimized executable code, but is interpreted instead.

2. Set based is very much a misunderstood word!

Most often by those advocating it as the best since sliced bread. Their claims are based on their experience with SQL, without understanding the many more layers of execution involved beneath that SQL surface.

Ana analogy:

Just because you live in Iceland and its always cold there in your experience, that does not mean that Iceland is incapible of warm weather. To make the strong claim that Iceland cannot be warm, you must understand the fundamentals that give rise to warm and cold. In the end....there is nothing magic about Iceland and it can be warm there.

Now back to SQL sever...I just wanted to make an easy to understand analogy first without stepping on anyones toes here!

In the end set based solutions are executed instruction by instruction processing sets of data, there is nothing "set based" about the CPU it all runs on.

At a higher level the optimizer can use algebra magic to come up with ways to process data with less IO, but obviously that does not apply to the mindgame we just did before.

Set based solutions just declare their operations on whole sets of data. Thats it..the perfromace rests purely on implementaions, where algebra and optimized IO is used to perform as best it can.

Final thoughs:

I have done my share of doing set based wizzardry to optimize code for SQL Server. Most of the time this results in insane fast performance compared to procedural code, but...

Also having programmed assembler and C/C++ etc, I know that even SQL Server is DOG SLOW on many simple tasks that do not involve any or little I/O. Tasks higher level SQL code heavily depends on! For others to claim that it is set-based code that makes one solution naturaly superior to procedural solutions is laughable from this point of view.

For my result to materialize in SQL Server I have to do all kinds of irrational things and work arounds, just to keep within the set optimized parts of the current implementation and hope the optimizer will get the idea of what I want to have done.

In fact most developers would look in horror to the resulting code as it is very hard to spot what the real intention of it all is behind clouds of "optimizations".

In my view a good optimizer translates straight forward, human readable, easy to understand requests to optimal execution times with correct results. Procedural solutions arent conceptually rules out by this, but are neglected way too much in the current implementation to be competitive.

Some might not appreciate the things I write here, but I think I have shown to anyone that understands the mindgame that there is much more to it then set based vs procedural code. You just have to have faith in me when I say that optimized compliled code can be much faster then what you experience today in SQL Sever.

December 18, 2008 1:17 PM
 

Russell Fields said:

An article from 2002 discusses the problem of finding the row in an ordered set where some limiting value is reached.  For example, the sum of the current row and all earlier rows <= 10000.  This article offers "set-based", "procedural", and "hybrid" solutions.  Guess which was fastest (at the time).

http://www.sqlmag.com/Article/ArticleID/24146/Algorithms_Still_Matter.html

December 22, 2008 2:38 PM
 

RussellH said:

@Russell Fields

In Oracle or DB2, the set based solution is the fastest.  You can do running totals like this:

select ename, sal, sum(sal) over (order by sal,empno) as running_tot from emp

order by sal

I don't know if SQL Server 2008 can do this.  If not, I hope it can soon.

December 22, 2008 4:05 PM
 

Gustav Clark said:

It all depends on what you need to do, there's no one answer.

Firstly, as Peter says, the code that gets executed is procedural, no matter what it looks like in SQL, so it's a matter of what's the best way to get that written.  For straightforward work the optimiser will always win.  It should know the schema better than me and it can specify actions that Transact has never heard of.  At a minimum if it can exploit parallelism I haven't a hope of beating it, so whenever possible I think in sets and trust the optimiser. Especially, if I've got a one-off report to do then any delay in running straight Sql is nothing compared to the extra coding time needed for cursor controlled loops.

There is another situation though, those big production runs with complex requirements.  In these situations I can take my time, and the extra work in coding procedural code may pay off.  I've just come to the end of a long struggle with an Oracle database.  A straightforward update on a large dataset was slow, taking up 5 hours.  I chopped it about, added hints, extra indices, splut it with a cursor, even transformed it into a join plus insert statements, but nothing improved it to better than 2 hours,  and the resultant sql was very messy.  Eventually I gave in and accepted a strictly procedural option - the data is now merged offline in Perl on another machine, taking 20min, prior to a 30min insert which would have to happen anyway.  The cost is some tricky coding which depends too much on my knowledge of the data, i.e. maintainability.

To me this solution leaves both sides clean.  I need that procedural stage, and because it's runnning in a proper procedural environment it delivers.  Once I'm past that bottleneeck I know the database engine will do a much better job than me, so I can concentrate on the business requirements, which for me is always set-based.  Horses for courses.

December 22, 2008 5:39 PM
 

steve dassin said:

I would say for the industry in general for every one set based solution there are ten cursor ones. The setbacks clearly outnumber the setsquats. Sqlers have the situation backwards. It's really a question of is there a limit to procedural solutions! And you can see that very few actually hit that limit. From the point of view of developers, to say that 'sql server is optimized for set based processing' is at the very least irrelevant and at the very most untrue. From their view point the db is doing quite nicely thank you very much. As for developers going to sql presentations or being shamed into a set based approach I don't think so. This is more a setsquat myth. The reality is using a procedural approach is not seen as a setback:)

December 22, 2008 10:16 PM
 

Joe Celko said:

There is a formal proof that primitive recursive functions (i.e. declarative stuff) is equal to an automata with a push down stack (i.e. procedural stuff).  Big fat hairy deal.  

The long term advantage of declarative programming is that as the engine (SQWL, F#, or any other declarative language) improves, the old code will probably get better rather than stay the same.  

December 23, 2008 10:03 PM
 

gopal@sumudra.co.uk said:

That’s really wow I say because this would be a source of getting inspiration and that is all what is required to start the things, thank you.

December 11, 2009 6:38 AM

Leave a Comment

(required) 
(required) 
Submit

About Linchi Shea

Checking out SQL Server via empirical data points

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement