THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Rob Farley

- Owner/Principal with LobsterPot Solutions (a MS Gold Partner consulting firm), Microsoft Certified Master, Microsoft MVP (SQL Server), APS/PDW trainer and leader of the SQL User Group in Adelaide, Australia. Rob is a former director of PASS, and runs training courses around the world in SQL Server and BI topics.

When is a SQL function not a function?

Should SQL Server even have functions? (Oh yeah – this is a T-SQL Tuesday post, hosted this month by Brad Schulz) TSQL2sDay150x150

Functions serve an important part of programming, in almost any language. A function is a piece of code that is designed to return something, as opposed to a piece of code which isn’t designed to return anything (which is known as a procedure). SQL Server is no different. You can call stored procedures, even from within other stored procedures, and you can call functions and use these in other queries. Stored procedures might query something, and therefore ‘return data’, but a function in SQL is considered to have the type of the thing returned, and can be used accordingly in queries. Consider the internal GETDATE() function.

SELECT GETDATE(), SomeDatetimeColumn
FROM dbo.SomeTable;

There’s no logical difference between the field that is being returned by the function and the field that’s being returned by the table column. Both are the datetime field – if you didn’t have inside knowledge, you wouldn’t necessarily be able to tell which was which. And so as developers, we find ourselves wanting to create functions that return all kinds of things – functions which look up values based on codes, functions which do string manipulation, and so on.

But it’s rubbish.

Ok, it’s not all rubbish, but it mostly is. And this isn’t even considering the SARGability impact. It’s far more significant than that. (When I say the SARGability aspect, I mean “because you’re unlikely to have an index on the result of some function that’s applied to a column, so try to invert the function and query the column in an unchanged manner”)

I’m going to consider the three main types of user-defined functions in SQL Server:

  • Scalar
  • Inline Table-Valued
  • Multi-statement Table-Valued

I could also look at user-defined CLR functions, including aggregate functions, but not today. I figure that most people don’t tend to get around to doing CLR functions, and I’m going to focus on the T-SQL-based user-defined functions.

Most people split these types of function up into two types. So do I. Except that most people pick them based on ‘scalar or table-valued’. I’d rather go with ‘inline or not’.

If it’s not inline, it’s rubbish. It really is.

Let’s start by considering the two kinds of table-valued function, and compare them.

These functions are going to return the sales for a particular salesperson in a particular year, from the AdventureWorks database.

CREATE FUNCTION dbo.FetchSales_inline(@salespersonid int, @orderyear int)
RETURNS TABLE AS 
RETURN (
    SELECT e.LoginID as EmployeeLogin, o.OrderDate, o.SalesOrderID
    FROM Sales.SalesOrderHeader AS o
    LEFT JOIN HumanResources.Employee AS e
    ON e.EmployeeID = o.SalesPersonID
    WHERE o.SalesPersonID = @salespersonid
    AND o.OrderDate >= DATEADD(year,@orderyear-2000,'20000101')
    AND o.OrderDate < DATEADD(year,@orderyear-2000+1,'20000101')
)
;
GO

CREATE FUNCTION dbo.FetchSales_multi(@salespersonid int, @orderyear int)
RETURNS @results TABLE (
    EmployeeLogin nvarchar(512),
    OrderDate datetime,
    SalesOrderID int
    )
AS
BEGIN
    INSERT @results (EmployeeLogin, OrderDate, SalesOrderID)
    SELECT e.LoginID, o.OrderDate, o.SalesOrderID
    FROM Sales.SalesOrderHeader AS o
    LEFT JOIN HumanResources.Employee AS e
    ON e.EmployeeID = o.SalesPersonID
    WHERE o.SalesPersonID = @salespersonid
    AND o.OrderDate >= DATEADD(year,@orderyear-2000,'20000101')
    AND o.OrderDate < DATEADD(year,@orderyear-2000+1,'20000101')
    ;
    RETURN
END
;
GO

You’ll notice that I’m being nice and responsible with the use of the DATEADD function, so that I have SARGability on the OrderDate filter.

Regular readers will be hoping I’ll show what’s going on in the execution plans here. Here I’ve run two SELECT * queries with the “Show Actual Execution Plan” option turned on.

image

Notice that the ‘Query cost’ of the multi-statement version is just 2% of the ‘Batch cost’. But also notice there’s trickery going on. And it’s nothing to do with that extra index that I have on the OrderDate column.

Trickery.

Look at it – clearly, the first plan is showing us what’s going on inside the function, but the second one isn’t. The second one is blindly running the function, and then scanning the results. There’s a Sequence operator which is calling the TVF operator, and then calling a Table Scan to get the results of that function for the SELECT operator. But surely it still has to do all the work that the first one is doing...

To see what’s actually going on, let’s look at the Estimated plan.

image

Now, we see the same plans (almost) that we saw in the Actuals, but we have an extra one – the one that was used for the TVF. Here’s where we see the inner workings of it. You’ll probably recognise the right-hand side of the TVF’s plan as looking very similar to the first plan – but it’s now being called by a stack of other operators, including an INSERT statement to be able to populate the table variable that the multi-statement TVF requires.

And the cost of the TVF is 57% of the batch!

But it gets worse.

Let’s consider what happens if we don’t need all the columns. We’ll leave out the EmployeeLogin column.

image

Here, we see that the inline function call has been simplified down. It doesn’t need the Employee table. The join is redundant and has been eliminated from the plan, making it even cheaper. But the multi-statement plan runs the whole thing as before, only removing the extra column when the Table Scan is performed.

A multi-statement function is a lot more powerful than an inline one. An inline function can only be the result of a single sub-query. It’s essentially the same as a parameterised view, because views demonstrate this same behaviour of extracting the definition of the view and using it in the outer query. A multi-statement function is clearly more powerful because it can contain far more complex logic.

But a multi-statement function isn’t really a function at all. It’s a stored procedure. It’s wrapped up like a function, but behaves like a stored procedure. It would be completely unreasonable to expect that a stored procedure could be simplified down to recognise that not all the columns might be needed, but yet this is part of the pain associated with this procedural function situation.

The biggest clue that a multi-statement function is more like a stored procedure than a function is the “BEGIN” and “END” statements that surround the code. If you try to create a multi-statement function without these statements, you’ll get an error – they are very much required. When I used to present on this kind of thing, I even used to call it “The Dangers of BEGIN and END”, and yes, I’ve written about this type of thing before in a similarly-named post over at my old blog.

Now how about scalar functions...

Suppose we wanted a scalar function to return the count of these.

CREATE FUNCTION dbo.FetchSales_scalar(@salespersonid int, @orderyear int)
RETURNS int
AS
BEGIN
    RETURN (
       
SELECT COUNT(*)
        FROM Sales.SalesOrderHeader AS o
        LEFT JOIN HumanResources.Employee AS e
        ON e.EmployeeID = o.SalesPersonID
        WHERE o.SalesPersonID = @salespersonid
        AND o.OrderDate >= DATEADD(year,@orderyear-2000,'20000101')
        AND o.OrderDate < DATEADD(year,@orderyear-2000+1,'20000101')
    );
END
;
GO

Notice the evil words? They’re required. Try to remove them, you just get an error.

That’s right – any scalar function is procedural, despite the fact that you wrap up a sub-query inside that RETURN statement. It’s as ugly as anything. Hopefully this will change in future versions.

Let’s have a look at how this is reflected in an execution plan. Here’s a query, its Actual plan, and its Estimated plan:

SELECT e.LoginID, y.year, dbo.FetchSales_scalar(p.SalesPersonID, y.year) AS NumSales
FROM (VALUES (2001),(2002),(2003),(2004)) AS y (year)
CROSS JOIN Sales.SalesPerson AS p
LEFT JOIN HumanResources.Employee AS e
ON e.EmployeeID = p.SalesPersonID;

image

image

We see here that the cost of the scalar function is about twice that of the outer query. Nicely, the query optimizer has worked out that it doesn’t need the Employee table, but that’s a bit of a red herring here. There’s actually something way more significant going on.

If I look at the properties of that UDF operator, it tells me that the Estimated Subtree Cost is 0.337999.

image

If I just run the query SELECT dbo.FetchSales_scalar(281,2003); we see that the UDF cost is still unchanged.

image

You see, this 0.0337999 is the cost of running the scalar function ONCE.

But when we ran that query with the CROSS JOIN in it, we returned quite a few rows. 68 in fact. Could’ve been a lot more, if we’d had more salespeople or more years.

image

And so we come to the biggest problem. This procedure (I don’t want to call it a function) is getting called 68 times – each one between twice as expensive as the outer query. And because it’s calling it in a separate context, there is even more overhead that I haven’t considered here.

The cheek of it, to say that the Compute Scalar operator here costs 0%! I know a number of IT projects that could’ve used that kind of costing method, but that’s another story that I’m not going to go into here.

Let’s look at a better way.

Suppose our scalar function had been implemented as an inline one. Then it could have been expanded out like a sub-query. It could’ve run something like this:

SELECT e.LoginID, y.year,
(SELECT COUNT(*)
    FROM Sales.SalesOrderHeader AS o
    LEFT JOIN HumanResources.Employee AS e
    ON e.EmployeeID = o.SalesPersonID
    WHERE o.SalesPersonID = p.SalesPersonID
    AND o.OrderDate >= DATEADD(year,y.year-2000,'20000101')
    AND o.OrderDate < DATEADD(year,y.year-2000+1,'20000101')
    ) AS NumSales
FROM (VALUES (2001),(2002),(2003),(2004)) AS y (year)
CROSS JOIN Sales.SalesPerson AS p
LEFT JOIN HumanResources.Employee AS e
ON e.EmployeeID = p.SalesPersonID;

image

Don’t worry too much about the Scan of the SalesOrderHeader underneath a Nested Loop. If you remember from plenty of other posts on the matter, execution plans don’t push the data through. That Scan only runs once. The Index Spool sucks the data out of it and populates a structure that is used to feed the Stream Aggregate. The Index Spool operator gets called 68 times, but the Scan only once (the Number of Executions property demonstrates this).

Here, the Query Optimizer has a full picture of what’s being asked, and can make the appropriate decision about how it accesses the data. It can simplify it down properly.

To get this kind of behaviour from a function, we need it to be inline. But without inline scalar functions, we need to make our function be table-valued. Luckily, that’s ok.

CREATE FUNCTION dbo.FetchSales_inline2(@salespersonid int, @orderyear int)
RETURNS table
AS
RETURN (SELECT COUNT(*) as NumSales
    FROM Sales.SalesOrderHeader AS o
    LEFT JOIN HumanResources.Employee AS e
    ON e.EmployeeID = o.SalesPersonID
    WHERE o.SalesPersonID = @salespersonid
    AND o.OrderDate >= DATEADD(year,@orderyear-2000,'20000101')
    AND o.OrderDate < DATEADD(year,@orderyear-2000+1,'20000101')
);
GO

But we can’t use this as a scalar. Instead, we need to use it with the APPLY operator.

SELECT e.LoginID, y.year, n.NumSales
FROM (VALUES (2001),(2002),(2003),(2004)) AS y (year)
CROSS JOIN Sales.SalesPerson AS p
LEFT JOIN HumanResources.Employee AS e
ON e.EmployeeID = p.SalesPersonID
OUTER APPLY dbo.FetchSales_inline2(p.SalesPersonID, y.year) AS n;

And now, we get the plan that we want for this query.

image

All we’ve done is tell the function that it’s returning a table instead of a single value, and removed the BEGIN and END statements. We’ve had to name the column being returned, but what we’ve gained is an actual inline simplifiable function. And if we wanted it to return multiple columns, it could do that too. I really consider this function to be superior to the scalar function in every way.

It does need to be handled differently in the outer query, but in many ways it’s a more elegant method there too. The function calls can be put amongst the FROM clause, where they can then be used in the WHERE or GROUP BY clauses without fear of calling the function multiple times (another horrible side effect of functions).

So please. If you see BEGIN and END in a function, remember it’s not really a function, it’s a procedure. And then fix it.

@rob_farley

Published Tuesday, November 08, 2011 11:01 AM by Rob Farley

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

wilfred van dijk said:

Impressive under-the-hood information about the dark side of functions!

November 8, 2011 2:53 AM
 

John Hennesey said:

Very nice post!  I have had to explain this numerous times to folks when their query takes > 30 mins to run and by inlining things it goes down to 5-10 seconds.  Now I can stop explaining and refer them to this post. :)  thank you very much!

November 8, 2011 8:49 AM
 

Joe Celko said:

Nice article. I alwasy felt that the UDF was added to make proceural programmers feel good and to help them avoid learning declarative programming :)

November 8, 2011 10:12 AM
 

ALZDBA said:

Great post ! Very informative.

A ref to keep at hand.

[i]It will be a heck of a job to put that to a guitar tune :-)[/i]

November 8, 2011 2:56 PM
 

Rob Farley said:

Thanks everyone. I'm going to write a follow-up to this about the use of stored procedures in SSRS. It'll be a little controversial. But I'm moving house this week, so it might not be for a few days.

And ALZDBA - it could be okay, perhaps something along the lines of "You're not what you appear to be, you shouldn't run procedurally" :)

November 9, 2011 12:53 AM
 

Cade Roux said:

I always thought inline TVFs should have been named parametrized views and grouped with them instead.  Their treatment by the optimizer and performance profile is much more closely aligned with that.

I very much hope one day we will get inline scalar functions.  It is a shame that very simple utility scalar functions are so much more easily outperformed by inline code.  For maintainability, sometimes some simple scalar functions can be very useful - it would be nice if you didn't have to take such a performance hit to use them.

November 14, 2011 11:17 AM
 

Rob Farley said:

Little Bobby Tables’ mother says you should always sanitise your data input . Except that I think she’s

November 21, 2011 10:57 PM
 

Paul White: Page Free Space said:

The humble Compute Scalar is one of the least well-understood of the execution plan operators, and usually

September 4, 2012 6:22 PM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement