THE SQL Server Blog Spot on the Web
Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | Join | Help
in Search

Adam Machanic

Adam Machanic, Boston-based independent database consultant, writer, and speaker, shares his experiences with programming, performance tuning, and optimizing SQL Server 2000, 2005, and 2008, in conjunction with related technologies such as .NET.

Scalar functions, inlining, and performance: An entertaining title for a boring post

Scalar.  Function.

Wow.

Could any other combination of words evoke the same feeling of encapsulation, information hiding, and simplification of client code?  After years spent developing software in the procedural and OO worlds, it can be difficult--perhaps, even impossible--to migrate over to working with SQL Server and not consider how to architect your data access logic using some of the same techniques you'd use in the application tier.

In short: Why would you ever write the same piece of logic more than once?  Answer: You wouldn't (damn it!).  And so Microsoft bestowed upon the SQL Server community, in SQL Server 2000, the ability to write scalar user-defined functions.  And they could have been such beautiful things...

But alas, reality can be painful, and as developers tried these new tools they were struck with a strange feeling of sadness as their applications buckled under the weight of what otherwise would have been a wonderful idea. As it turned out, putting all but the simplest of logic into these scalar functions was a recipe for disaster. Why?  Because they're essentially cursors waiting to happen (but they don't look like cursors, so you may not know... until it's too late.)

The central problem is that when you wrap logic in a multistatement UDF, the query optimizer just can't unwrap it too easily. And so there's really only one way to evaluate a scalar UDF: call it once per row. And that is really nothing more than a cursor.

Seeing this behavior in action is easy enough; consider the following scalar function that some poor sap DBA working for AdventureWorks might be compelled to create:

CREATE FUNCTION GetMaxProductQty_Scalar
(
@ProductId INT
)
RETURNS INT
AS
BEGIN
DECLARE @maxQty INT

SELECT @maxQty = MAX(sod.OrderQty)
FROM Sales.SalesOrderDetail sod
WHERE sod.ProductId = @ProductId

RETURN (@maxQty)
END

Simple enough, right?  Let's pretend that AdventureWorks has a bunch of reports, each of which requires maximum quantity sold per product.  So the DBA, thinking he can save himself some time and keep everything centralized (and that is a good idea), puts all of the logic into a scalar UDF.  Now, when he needs this logic, he can just call the UDF.  And if the logic has a bug, or needs to be changed, he can change it in exactly one place.  And so life is great... Or is it?

Let's take a look at a sample query:

SELECT
ProductId,
dbo.GetMaxProductQty_Scalar(ProductId)
FROM Production.Product
ORDER BY ProductId

This query does nothing more than get the max quantity sold for each product in the Productin.Product table. And a look at the execution plan or the STATISTICS IO output might indicate that there's nothing too interesting going on here: The execution plan shows an index scan (to be expected, with no WHERE clause), followed by a compute scalar operation (the call to the UDF). And STATISTICS IO shows a mere 16 reads.

So why is this query so problematic? Because the real issue is hiding just beneath the surface.  The execution plan and STATISTICS IO didn't consider any of the code evaluated within the UDF! To see what's really going on, fire up SQL Server Profiler, turn on the SQL:BatchCompleted event, and make sure you're showing the Reads column. Now run the query again and you'll see that this seemingly-innocent block of T-SQL is, in fact, using 365,247 logical reads. Quite a difference!

Each of those "compute scalar" operations is really a call to the UDF, and each of the calls to the UDF is really a new query.  And all of those queries (all 504 of them -- the number of products in the Product table) add up to massive I/O.  Clearly not a good idea in a production environment.

But luckily, we're not done here yet (or this would be a very boring post). Because while the performance penalty is a major turnoff, I really do love the encapsulation afforded by scalar UDFs.  I want them (or a similar tool) in my toolbox... And so I got to thinking.

The answer to my dilemma, as it turns out, is to not use scalar UDFs at all, but rather to use inline table-valued UDFs and treat them like scalars. This means that queries get slightly more complex than with scalar UDFs, but because the funtions are inlined (treated like macros) they're optimized along with the rest of the query. Which means, no more under-the-cover cursors.

Following is a modified version of the scalar UDF posted above:

CREATE FUNCTION GetMaxProductQty_Inline
(
@ProductId INT
)
RETURNS TABLE
AS
RETURN
(
SELECT MAX(sod.OrderQty) AS maxqty
FROM Sales.SalesOrderDetail sod
WHERE sod.ProductId = @ProductId
)

This function is no longer actually scalar--in fact, it now returns a table. It just so happens that the table has exactly one column and exactly one row, and uses the same logic as the scalar UDF shown above. So it's still scalar enough for my purposes.

The query shown above, used to retrieve the maximum quantity sold for each product, will not quite work with this UDF as-is. Trying to substitute in the new UDF will result in nothing more than a variant on an "object not found" error.  Instead, you need actually treat this function like  it returns a table (due to the fact that it does).  And that means, in this case, a subquery:

SELECT
ProductId,
(
SELECT MaxQty
FROM dbo.GetMaxProductQty_Inline(ProductId)
) MaxQty
FROM Production.Product
ORDER BY ProductId

So there it is. We're now treating the table-valued UDF more or less just like a scalar UDF.  And the difference in I/O results is really quite astounding: 1267 logical reads in this case. Meaning that the scalar UDF solution is around 288 times more I/O intensive!

The question being, of course, was it worth it? The whole thing could have been written as one query, without the need for any UDFs at all. And the final query in this case is quite a bit more complex than the previous version, in addition to the fact that the encapsulation breaks down to some degree by forcing the caller to have some knowledge of how the UDF actually works. But I do feel that this sacrifice is warranted in some cases. Although the "greatest quantity sold" example shown here is simplistic, imagine other situations in which the same code fragments or logic are used over and over, due to lack of a good way of standardizing and centralizing them.  I know I've seen that a lot in my work, and some examples I can think of have included complex logic that might very well have been easier to maintain in a UDF.

This technique may not be perfect for every case, and it certainly has its tradeoffs. But it may be a useful trick to keep in the back of your mind for a rainy day in the data center when someone's scalar UDF solution starts breaking down and you need a fix that doesn't require a massive code rewrite.


Published Friday, August 04, 2006 12:07 AM by Adam Machanic

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Greg_Linwood said:

Hey Adam - nicely written post

Another issue that complicates measuring IO from UDFs is that the Profiler Reads output contains Proc Cache lookups - which are also 8K IOs, whereas statistics io doesn't. In SQL 2000 UDFs, the Proc Cache was looked up per-invocation of the UDF (per row), potentially skewing the values returned by Reads depending on the size of the cache. I've seen scenarios where the same statement (involving a UDF) recorded signficantly variant Reads during peak processing periods (when the cache is bloated) vs during average processing periods (when the cache has reduced to a smaller size).

In SQL 2005, this has been sensibly scaled back so that the proc cache is only looked up for the first invocation in a rowset, improving UDF performance somewhat by eliminating the repetitive cache lookups performed in SQL 2000..

Cheers
Greg
August 8, 2006 1:15 AM
 

glenntoy said:

Nice one.

When I started with SQL 2000 I was fascinated with its UDF feature, I used to do this way but my boss told to change it due to performance reason. As much as possible I do avoid performing a query inside the function, but in cases where it can't be avoided the best way is to optimized the query inside the function.
September 4, 2006 1:53 AM
 

Meher said:

Great Post Adam. Fascinating details of UDF. I always wondered how a Scalar UDF be converted to a Inline table valued UDF and this post actually shows that.

February 9, 2007 1:44 PM
 

Darren said:

great article, confirming what I've been seeing...really a surprise to me.  Good suggestions for handling as table functions...will try that.

December 13, 2007 2:02 PM
 

SJC said:

Thanks for this post Adam; I'm experiencing this problem at the moment and the article gives me great pointers on how to solve it to everyone's satisfaction.

February 1, 2008 11:16 AM
 

Why are Subqueries Dangerous? « Home of the Scary DBA said:

June 13, 2008 1:52 PM
 

Matt Miller said:

Adam -

Since you're using a TVF, you could "improve readability" with a CROSS APPLY statement. Since it's a single value, the perf penalties that come in from CROSS APPLY don't seem to pop up...

So it would look like

SELECT    ProductId,

         MaxQty

FROM Production.Product

CROSS APPLY dbo.GetMaxProductQty_Inline(ProductId) MaxQT

ORDER BY ProductId

June 27, 2008 10:30 AM
 

gene said:

Adam, this is a great post. I only want to add one practical detail to your post which support it.

Most of the time ( or in my experience of 15yr - all the time) programmers are not trained and do not care about performance tunning. However, application usually should be maintained for 3-5 years of it's life span.

This is usually overlooked. Finally, dba - person who did not develop the application should start tunning it and here is a big problem having UDF's and views.

When you say that UDF may bring benefits, did you take in count the fact that code which uses UDF's and views is almoust impossible to analize for performance improvements?

Try to traverse a code which use UDF which goes aganist view, which uses another 2 views or so!

This is a real price for 'sleek' UDF use. That's the real difference between experienced seasoned programmer who had met and learned this by his own experience and talanted 2-3 yr. programmer who measures his work by complexity of his queries and use of funcy new features like UDF.

I bet real good code is measured by it's ability to be troubleshouted easily too. And UDF's and views are sure means not to achieve that.

June 29, 2008 4:27 PM
 

sql server scalar function call said:

July 7, 2008 6:12 AM

Leave a Comment

(required) 
(optional)
(required) 
Submit

About Adam Machanic

Adam Machanic is a Boston-based independent database consultant, writer, and speaker. He has been involved in dozens of SQL Server implementations for both high-availability OLTP and large-scale data warehouse applications, and has optimized data access layer performance for several data-intensive applications. Adam has written for numerous web sites and magazines, including SQLblog, Simple Talk, Search SQL Server, SQL Server Professional, CoDe, and VSJ. He has also contributed to several books on SQL Server, including "Expert SQL Server 2005 Development" (Apress, 2007) and "Inside SQL Server 2005: Query Tuning and Optimization" (Microsoft Press, 2007). Adam regularly speaks at user groups, community events, and conferences on a variety of SQL Server and .NET-related topics. He is a Microsoft Most Valuable Professional (MVP) for SQL Server and a Microsoft Certified IT Professional (MCITP).
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement