THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Adam Machanic

Adam Machanic, Boston-based SQL Server developer, shares his experiences with programming, monitoring, and performance tuning SQL Server. And the occasional battle with the query optimizer.

T-SQL Tuesday #002: Is it XML, or Not?!?

The query optimizer is a finicky thing, and sometimes it doesn't understand exactly what you're trying to do until you give it a bit more information. The situation I'm going to describe in this post is one such case. By providing the optimizer with ever-so-slightly more data, it's possible to make some XML processing over 300 times faster.

Here's the situation: The XML I was working with was stored in a table, but not typed as XML. Rather, it had been typed as VARCHAR(MAX). This presents an interesting conundrum for the optimizer: should the query be optimized as though operations are being done on a string, or on XML?

If you would like to follow along with the examples below, here's some DDL and an INSERT statement to populate a test table using AdventureWorks data:

CREATE TABLE #myXML
(
    x VARCHAR(MAX)
)
GO

INSERT #myXML
(
    x
)
SELECT
    x
FROM
(
    SELECT
        *
    FROM AdventureWorks.Production.Product
    FOR XML PATH ('Product')
) AS y (x)
GO

Running this code will put one row into the temp table, with an XML document containing all of the product data from the AdventureWorks Production.Product table. And now, just for kicks, what if we want to pull all of the ProductIDs out of that document? Simple enough...

WITH
theXML
(
    x
) AS
(
    SELECT
        CONVERT(XML, x)
    FROM #myXML
)
SELECT
    node.value('ProductID[1]', 'INT')
FROM theXML
CROSS APPLY x.nodes('/Product') AS nodes(node)

This code isn't especially interesting or puzzling in and of itself. It converts the document(s) in the table to XML, runs them through the .nodes() function to produce one row per product node, and pulls out the first ProductID attribute found. If you've actually run the code on your end at this point, you know why I was puzzled: This code takes a full 20 seconds to run on my end. Which is a bit extreme, considering that there are only 504 products in the AdventureWorks database. In the real situation, the documents were several times bigger, and 20 seconds was over an hour in some cases. And that just wouldn't do.

And so much head-scratching ensued. And teeth gnashing. And cursing of the SQL Server programmability team. You know, a typical day at the office.

I pulled apart my code, put it back together again, and considered writing a CLR UDF to do the processing. But then I tried something on a whim:

DECLARE @x XML =
    (
        SELECT TOP(1)
            x
        FROM #myXML
    )

SELECT
    node.value('ProductID[1]', 'INT')
FROM @x.nodes('/Product') AS nodes(node)

And--shocker--this code returns all 504 ProductIDs seemingly instantly. (Actually, it takes around 28 milliseconds on my end.)

So was a cursor and document-by-document processing the answer? At first, it seemed so. But after further messing around I noticed something: adding TOP(1), so that only a single row was processed, wasn't taking too long. Could it be that the query processor was doing a lot more work than necessary, like converting the text to XML 504 times?

The TYPE directive can be used with FOR XML to make your query return the XML document typed as XML rather than typed as a string. Perhaps it would work here?

WITH
theXML
(
    x
) AS
(
    SELECT
        CONVERT(XML, x)
    FROM #myXML
    FOR XML PATH(''), TYPE
)
SELECT
    node.value('ProductID[1]', 'INT')
FROM theXML
CROSS APPLY x.nodes('/Product') AS nodes(node)

The empty path expression is needed because the TYPE directive only works in conjunction with a valid FOR XML mode. Using an empty path has zero net effect on the actual XML produced in this case. But using the TYPE directive has quite a huge effect: A reduction in query time to around 58 milliseconds on my end. The 30,000% speedup I promised you earlier.

So why does this work? A quick peek at the plans indicates that I was correct. Here's the first part of the plan for the first version of the query:

  |--Compute Scalar(DEFINE:([Expr1023]=[Expr1022]))
       |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1004], XML Reader with XPath filter.[id]))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1004]))
            |    |--Compute Scalar(DEFINE:([Expr1004]=CONVERT(xml,[tempdb].[dbo].[#myXML].[x],0)))
            |    |    |--Table Scan(OBJECT:([tempdb].[dbo].[#myXML]))
            |    |--Filter(WHERE:(STARTUP EXPR([Expr1004] IS NOT NULL)))
            |         |--Table-valued function

The second version is almost exactly the same, but for one additional iterator:

  |--Compute Scalar(DEFINE:([Expr1024]=[Expr1023]))
       |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1005], XML Reader with XPath filter.[id]))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1005]))
            |    |--UDX(([Expr1004]))
            |    |    |--Compute Scalar(DEFINE:([Expr1004]=CONVERT(xml,[tempdb].[dbo].[#myXML].[x],0)))
            |    |         |--Table Scan(OBJECT:([tempdb].[dbo].[#myXML]))
            |    |--Filter(WHERE:(STARTUP EXPR([Expr1005] IS NOT NULL)))
            |         |--Table-valued function

Notice the "UDX" iterator? That's an XML iterator that handles the conversion to typed XML. And in the first case, we don't get one, even though we've "technically" converted the string to XML at that point.

This story was puzzling, and somewhat arcane, but it has a moral that stretches far beyond this simple example: Only by giving the query optimizer much more information than any rational person might think necessary did we get a plan that does the right thing. And that is quite often the case when working with SQL Server. CHECK constraints, foreign keys, UNIQUE constraints, the DISTINCT keyword, GROUP BY, APPLY, and various other constructs are more than just ways to define your requirements or the output you're looking for. They can be used to provide information to the query optimizer to help it determine the best way to process your data. Information that can make your query return in a second instead of an hour. Information that will make your users happy and your project a success.

The secret to writing high performance T-SQL? Step out of your human mind. Un-puzzle. Be the optimizer. And until next month, thank you for reading this entry in T-SQL Tuesday #002!

Published Tuesday, January 12, 2010 5:19 PM by Adam Machanic

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Brad Schulz said:

Hi Adam...

I wrote about this very thing about a month ago and again last week:

http://bradsruminations.blogspot.com/2009/12/delimited-string-tennis-anyone.html

http://bradsruminations.blogspot.com/2010/01/delimited-string-tennis-again-final.html

In the first link above, I gave an explanation as to WHY the query is so slow.

I like your FOR XML PATH(''),TYPE approach... it's similar to the .query('.') approach I mentioned in the second link above.

--Brad

January 12, 2010 5:55 PM
 

Jonathan Kehayias said:

Very interesting Adam.  Thanks for sharing such a critical performance trick as a part of this weeks challenge.  I have some code I need to revisit now to see whether this applies.

January 12, 2010 6:27 PM
 

Rob Farley said:

Gotta love arming the QO with sufficient information, to be able to get proper plans.

Nice post!

January 12, 2010 10:43 PM
 

Adam Machanic said:

According to T-SQL Tuesday rules as ratified by me in the first and second T-SQL Tuesday posts, the T-SQL

February 8, 2010 2:16 PM
 

Adam Machanic said:

According to T-SQL Tuesday rules as ratified by me in the first and second T-SQL Tuesday posts, the T-SQL

February 8, 2010 2:16 PM
 

Dan said:

I actually have a question, rather, I need help figuring out why the code below runs 300% faster in the dev environment than in the QA env.

Some one suggested that replace the following construct with the one below, however, the runtime did not change.  

@request.value('(/searchRequest/mlsNumber)[1]'

--

@request.value('(/searchRequest/mlsNumber/text())[1]'

--

Please help.

Thanks,

Dan

DECLARE @request xml

DECLARE @latitudeMax float

DECLARE @latitudeMin float

DECLARE @longitudeMax float

DECLARE @longitudeMin float

DECLARE @agentUsername varchar(30)

DECLARE @firmCode varchar(30)

DECLARE @areas table

(

AreaID int

, Name varchar(200)

)

--

--

select @Request = '<searchRequest><header><requestType>ListingSearch</requestType></header><zipCode>91342</zipCode><propertyType><type>0</type></propertyType><listingStatus><status>5</status></listingStatus></searchRequest>'

--

--INSERT INTO ListingSearchIndex

-- (RequestID, SearchKey)

SELECT 470 , svo.MlsNum

FROM

mls_unified_svo_tbl svo (nolock)

WHERE

svo.LoadOnInternet = 1

AND svo.std IS NOT NULL

AND svo.snd IS NOT NULL

AND (svo.Status IN (5, 30, 45)

OR

(svo.Status IN (10, 20)

AND DATEDIFF(day, svo.StatusDate, GETDATE()) < 365

)

)

AND

CASE WHEN @request.exist('/searchRequest/mlsNumber') = 0

THEN 1

ELSE

CASE WHEN svo.MlsNum = @request.value('(/searchRequest/mlsNumber)[1]', 'varchar(20)') THEN 1

ELSE 0

END

END = 1

AND

CASE WHEN @request.exist('/searchRequest/zipCode') = 0

THEN 1

ELSE

CASE WHEN svo.zip = @request.value('(/searchRequest/zipCode)[1]', 'varchar(500)') THEN 1

ELSE 0

END

END = 1

AND

CASE WHEN @request.exist('/searchRequest/city') = 0

THEN 1

ELSE

CASE WHEN svo.city = @request.value('(/searchRequest/city)[1]', 'varchar(50)') THEN 1

ELSE 0

END

END = 1

AND

CASE WHEN @request.exist('/searchRequest/address') = 0

THEN 1

ELSE

CASE WHEN SUBSTRING(svo.address, 1, LEN(@request.value('(/searchRequest/address)[1]', 'varchar(100)'))) = @request.value('(/searchRequest/address)[1]', 'varchar(100)') THEN 1

ELSE 0

END

END = 1

AND

CASE WHEN @request.exist('/searchRequest/listingArea') = 0

THEN 1

ELSE

CASE WHEN svo.ar IN (SELECT AreaID FROM @areas) OR  svo.city IN (SELECT Name FROM @areas) THEN 1

ELSE 0

END

END = 1

AND

CASE

WHEN @latitudeMin IS NULL OR @latitudeMax IS NULL OR @longitudeMin IS NULL OR @longitudeMax IS NULL THEN 1

ELSE

CASE WHEN svo.Latitude BETWEEN @latitudeMin AND @latitudeMax

AND svo.Longitude BETWEEN @longitudeMin AND @longitudeMax THEN 1

ELSE 0

END

END = 1

AND

CASE WHEN @request.exist('/searchRequest/price/low') = 0

THEN 1

ELSE

CASE WHEN svo.lp >= @request.value('(/searchRequest/price/low)[1]', 'bigint') THEN 1

ELSE 0

END

END = 1

AND

CASE WHEN @request.exist('/searchRequest/price/high') = 0

THEN 1

ELSE

CASE WHEN svo.lp <= @request.value('(/searchRequest/price/high)[1]', 'bigint') THEN 1

ELSE 0

END

END = 1

AND

CASE WHEN @request.exist('/searchRequest/yearBuilt/low') = 0

THEN 1

ELSE

CASE WHEN svo.yb >= @request.value('(/searchRequest/yearBuilt/low)[1]', 'smallint') THEN 1

ELSE 0

END

END = 1

AND

CASE WHEN @request.exist('/searchRequest/yearBuilt/high') = 0

THEN 1

ELSE

CASE WHEN svo.yb <= @request.value('(/searchRequest/yearBuilt/high)[1]', 'smallint') THEN 1

ELSE 0

END

END = 1

AND

CASE WHEN @request.exist('/searchRequest/sqFootage/low') = 0

THEN 1

ELSE

CASE WHEN svo.sf >= @request.value('(/searchRequest/sqFootage/low)[1]', 'int') THEN 1

ELSE 0

END

END = 1

AND

CASE WHEN @request.exist('/searchRequest/sqFootage/high') = 0

THEN 1

ELSE

CASE WHEN svo.sf <= @request.value('(/searchRequest/sqFootage/high)[1]', 'int') THEN 1

ELSE 0

END

END = 1

AND

CASE WHEN @request.exist('/searchRequest/bed/low') = 0

THEN 1

ELSE

CASE WHEN svo.br >= @request.value('(/searchRequest/bed/low)[1]', 'smallint') THEN 1

ELSE 0

END

END = 1

AND

CASE WHEN @request.exist('/searchRequest/bed/high') = 0

THEN 1

ELSE

CASE WHEN svo.br <= @request.value('(/searchRequest/bed/high)[1]', 'smallint') THEN 1

ELSE 0

END

END = 1

AND

CASE WHEN @request.exist('/searchRequest/bath/low') = 0

THEN 1

ELSE

CASE WHEN svo.ba >= @request.value('(/searchRequest/bath/low)[1]', 'decimal') THEN 1

ELSE 0

END

END = 1

AND

CASE WHEN @request.exist('/searchRequest/bath/high') = 0

THEN 1

ELSE

CASE WHEN svo.ba <= @request.value('(/searchRequest/bath/high)[1]', 'decimal') THEN 1

ELSE 0

END

END = 1

AND

CASE WHEN @request.exist('/searchRequest/parking/low') = 0

THEN 1

ELSE

CASE WHEN svo.pkg_num >= @request.value('(/searchRequest/parking/low)[1]', 'int') THEN 1

ELSE 0

END

END = 1

AND

CASE WHEN @request.exist('/searchRequest/parking/high') = 0

THEN 1

ELSE

CASE WHEN svo.pkg_num <= @request.value('(/searchRequest/parking/high)[1]', 'int') THEN 1

ELSE 0

END

END = 1

AND

CASE WHEN @request.exist('/searchRequest/lotSize/low') = 0

THEN 1

ELSE

CASE WHEN svo.lsz >= @request.value('(/searchRequest/lotSize/low)[1]', 'int') THEN 1

ELSE 0

END

END = 1

AND

CASE WHEN @request.exist('/searchRequest/lotSize/high') = 0

THEN 1

ELSE

CASE WHEN svo.lsz <= @request.value('(/searchRequest/lotSize/high)[1]', 'int') THEN 1

ELSE 0

END

END = 1

AND

CASE WHEN @request.exist('/searchRequest/soldPrice/low') = 0

THEN 1

ELSE

CASE WHEN svo.sp >= @request.value('(/searchRequest/soldPrice/low)[1]', 'bigint') THEN 1

ELSE 0

END

END = 1

AND

CASE WHEN @request.exist('/searchRequest/soldPrice/high') = 0

THEN 1

ELSE

CASE WHEN svo.sp <= @request.value('(/searchRequest/soldPrice/high)[1]', 'bigint') THEN 1

ELSE 0

END

END = 1

AND

CASE WHEN @request.exist('/searchRequest/daysInStatus/low') = 0

THEN 1

ELSE

CASE WHEN DATEDIFF(d, svo.statusdate, GETDATE()) >= @request.value('(/searchRequest/daysInStatus/low)[1]', 'int') THEN 1

ELSE 0

END

END = 1

AND

CASE WHEN @request.exist('/searchRequest/daysInStatus/high') = 0

THEN 1

ELSE

CASE WHEN DATEDIFF(d, svo.statusdate, GETDATE()) <= @request.value('(/searchRequest/daysInStatus/high)[1]', 'int') THEN 1

ELSE 0

END

END = 1

AND

CASE WHEN @request.exist('/searchRequest/propertyType') = 0

THEN 1

ELSE

CASE WHEN svo.property_type IN (SELECT Request.PropertyType.value('.', 'int')

FROM @request.nodes('/searchRequest/propertyType/type') Request(PropertyType)) THEN 1

ELSE 0

END

END = 1

AND

CASE

WHEN @firmCode IS NULL THEN 1

ELSE

CASE WHEN @firmCode IN (svo.ListBrokerCode1, svo.ListBrokerCode2, svo.ListBrokerCode3) THEN 1

ELSE 0

END

END = 1

AND

CASE

WHEN @agentUsername IS NULL THEN 1

ELSE

CASE WHEN @agentUserName IN (svo.ListAgentID1, svo.ListAgentID2, svo.ListAgentID3) THEN 1

ELSE 0

END

END = 1

AND ((@request.exist('/searchRequest/listingStatus') = 1

AND svo.status IN (SELECT Request.Status.value('.', 'int') FROM @request.nodes('/searchRequest/listingStatus/status') Request(Status))

)

OR

(@request.exist('/searchRequest/listingStatus') = 0

AND svo.status NOT IN (40,25,15,35)

)

)

AND (@request.exist('/searchRequest/openhouse') = 0

OR EXISTS (SELECT ohl.MlsNum

FROM open_house_list ohl (nolock)

WHERE

ohl.MlsNum = svo.MlsNum

AND

ohl.public_yn = 'yes'

AND

ohl.openhouse_type IN ('new', 'review')

AND (@request.exist('/searchRequest/openhouse/low') = 0

OR ohl.openhouse_date >= @request.value('(/searchRequest/openhouse/low)[1]', 'datetime')

)

AND (@request.exist('/searchRequest/openhouse/high') = 0

OR ohl.openhouse_date <= @request.value('(/searchRequest/openhouse/high)[1]', 'datetime')

)

)

)

AND (@request.exist('/searchRequest/saleType') = 0

OR EXISTS (SELECT lst.MlsNum

FROM ListingSaleType lst (nolock)

WHERE

lst.MlsNum = svo.MlsNum

AND

(@request.exist('/searchRequest/saleType/auction') = 0

OR lst.IsAuction = 1

)

AND

(@request.exist('/searchRequest/saleType/foreclosure') = 0

OR lst.IsForeclosure = 1

)

AND

(@request.exist('/searchRequest/saleType/default') = 0

OR lst.IsDefault = 1

)

AND

(@request.exist('/searchRequest/saleType/reo') = 0

OR lst.IsREO = 1

)

AND

(@request.exist('/searchRequest/saleType/short') = 0

OR lst.IsShortPay = 1

)

AND

(@request.exist('/searchRequest/saleType/standard') = 0

OR lst.IsStandard = 1

)

AND

(@request.exist('/searchRequest/saleType/probate') = 0

OR lst.IsProbate = 1

)

)

)

October 6, 2011 7:53 PM
 

Adam Machanic said:

Hi Dan,

Sorry, but I can't debug something like that remotely. If you'd like e-mail me using the "Email" link on the upper righthand corner of the page and we'll set up a consulting arrangement so that I can get into your server and check things out.

--Adam

October 6, 2011 8:37 PM
 

dan said:

Hi, Adam

Sorry, I just saw your reply. Thank you for responding.

It turned out that the problem was environment specific, i.e., the two DBs, DEV Vs QA were different in size, etc.

Thank you much!

Dan

October 10, 2011 12:26 PM

Leave a Comment

(required) 
(required) 
Submit

About Adam Machanic

Adam Machanic is a Boston-based SQL Server developer, writer, and speaker. He focuses on large-scale data warehouse performance and development, and is author of the award-winning SQL Server monitoring stored procedure, sp_WhoIsActive. Adam has written for numerous web sites and magazines, including SQLblog, Simple Talk, Search SQL Server, SQL Server Professional, CoDe, and VSJ. He has also contributed to several books on SQL Server, including "SQL Server 2008 Internals" (Microsoft Press, 2009) and "Expert SQL Server 2005 Development" (Apress, 2007). Adam regularly speaks at conferences and training events on a variety of SQL Server topics. He is a Microsoft Most Valuable Professional (MVP) for SQL Server, a Microsoft Certified IT Professional (MCITP), and an alumnus of the INETA North American Speakers Bureau.

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement