THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | Join | Help
in Search

Ben Nevarez

  • Fooling the Query Optimizer

    Did you ever wanted to know which execution plans the Query Optimizer would generate for your queries should your tables have millions of records? You can actually generate those plans by using the undocumented ROWCOUNT and PAGECOUNT options of the UPDATE STATISTICS statement. These options can be used on small or empty tables and can be helpful for testing in some scenarios where you may not want to spent time or disk space creating big tables.

     

    By using this method you are tricking the Query Optimizer as it will generate execution plans using cardinality estimations as if the table really had millions of records. Note that this option, available since SQL Server 2005, only helps in creating the execution plan for your queries. Actually running the query will use the real data and of course will execute faster than a table with millions of records.

     

    UPDATE STATISTICS WITH ROWCOUNT, PAGECOUNT does not change the table statistics, only the counter of number of rows and pages of a table. But the Query Optimizer uses this information to estimate the cardinality of queries as I will show later. Also keep in mind that these are undocumented and unsupported options and should not be used in any production environment.

     

    Let us see an example. Run the following query to create a new table on the AdventureWorks database

     

    select * into dbo.Address

    from Person.Address

    Inspect the number of rows by running the following queries. It must show 19,614 rows.

    select * from sys.partitions

    where object_id = object_id('dbo.Address')

     

    select * from sys.dm_db_partition_stats

    where object_id = object_id('dbo.Address')

     

    Run the following query

     

    select * from dbo.Address

    where city = 'London'

     

    Running this query will create new statistics for the city column and will show the following plan. Note that the estimated number of rows is 434 and it is using a simple Table Scan operator

     

     clip_image002

     

    We can see where the Query Optimizer is getting the estimated number of rows by inspecting the statistics object. Run this query to see the name of the statistics object

     

    select * from sys.stats

    where object_id = object_id('dbo.Address')

     

    Then use the displayed statistics object name in the following statement (the name may be different in your case)

     

    dbcc show_statistics('dbo.Address', _WA_Sys_00000004_46136164)

     

    By looking at the histogram you can find the value 434 on EQ_ROWS for the RANGE_HI_KEY value ‘London’ (Statistics and histograms are explained on previous posts in this blog)

     

    clip_image004

     

    Now run the UPDATE STATISTICS WITH ROWCOUNT, PAGECOUNT (you can specify any other value for rowcount and pagecount)

     

    update statistics dbo.Address with rowcount = 1000000, pagecount = 100000

     

    If you inspect the number of rows again from sys.partitions or sys.dm_db_partition_stats, as shown previously, it will now show 1,000,000 rows. sys.dm_db_partition_stats also shows the new number of pages. Clear the plan cache and run the query again

     

    dbcc freeproccache

    go

    select * from dbo.Address

    where city = 'London'

     

    Note that the estimated number of rows has changed from 434 to 22,127.1 and that a different plan was generated using this new cardinality estimation. The Query Optimizer decided to parallelize this plan. But this is a very simple query, more dramatic plan changes can happen with more complex queries.

     

     

    clip_image006

     

    After execution the actual number of rows obviously is still is 434 but the Query Optimizer is not able to see this value.

     

    If you look at the statistics object again, using DBCC SHOW_STATISTICS as shown before, the histogram has not changed. One way to obtain the estimated number of rows shown in the new execution plan is calculating the percentage or fraction of rows for the value ‘London’ from the statistics sample, which in this case is 19,614, as shown on the header of the statistics object. So the fraction is 434 / 19,614 or 022127052. Then obtain the same percentage from the new “current” number of rows which is 1,000,000 calculated as 1,000,000 * 0.022127052 and we get 22,127.1 which is the estimated number of rows displayed in the plan.

     

    Finally, drop the table you just created

     

    drop table dbo.Address

     


  • Insecure Metadata

    I was looking at some metadata access today and this reminded me of when I talked about this topic two years ago at the PASS Summit in Denver.

     

    Back in SQL Server 2000 most of the existing metadata was available to members of the public role, that is, to any login that can connect to the SQL Server instance. SQL Server 2005 promises that a user can only see the metadata of an object the user owns or has access to. This was a huge security improvement. However, some items were still left available to any user, which in some cases could be a security concern.

     

    Let us test some of it. Create a new login called insecure without granting any permission to it. Connect as that user and see what kind of information from the server you are able to obtain. Try this first

     

    select * from sys.databases

     

    This new user is able to see the list of all the databases on the server, including the ones he does not have access to. Try

     

    select * from sys.configurations

     

    This time you can see some configuration information of your instance including some specific security information like if the ‘c2 audit mode’ is enabled.

     

    Some other data available to any SQL Server user is information about the location of the files of the current database, linked servers or backup devices. Try these

     

    select * from sys.database_files

    select * from sys.servers

    select * from sys.backup_devices

     

    Here I am using catalog views only but the same applies if you use the SQL Server 2000 compatibility views (like sysdatabases). You can also execute some stored procedures like these to get similar information

     

    sp_helpdb 'master'

    exec sp_configure

    exec sp_helpfile

    exec sp_helpserver

    exec sp_helpdevice

     

    Now let us look at the msdb database. By looking at the following system tables (yes, these are called system tables) you can obtain all the information regarding backup and restore operations performed on your instance. You can obtain information like database names, location of databases files, locations of database backup files and names of users who performed these backups.

     

    use msdb

    go

    select * from backupfile

    select * from backupmediafamily

    select * from backupset

    select * from restorefile

    select * from restorehistory

     

    You can also run some extended stored procedures or functions like the following

     

    exec master..xp_msver

    select serverproperty('ProductVersion')

    select serverproperty('IsIntegratedSecurityOnly')

     

    For example, the first two display the version (build) of SQL Server, which can show if the database administrator has installed the latest service pack or security updates. This can reveal some known security vulnerability of your instance.

     

    Perhaps you can see now that this information, like security configuration, database names, database file names, location of database backups, user names, etc. could be used by an attacker to get additional permissions or data.

     

    Certainly these are the defaults and many of them can be changed to be more secure, but many times this is not practical as some tools and applications depend on these defaults. For example, you can revoke the VIEW ANY DATABASE permission (executed as member of sysadmin) to secure the information displayed on sys.databases.

     

    revoke view any database to public

     

    But now only database owners can see their databases listed on sys.databases. Of course, this is a problem for a valid user who is not a database owner, who will only see the master and tempdb databases on this list. To test, grant db_datareader permissions to the insecure user to any user database. Test this again connected as the insecure login

     

    select * from sys.databases

     

    Only the master and tempdb databases are listed this time. One known issue of this configuration is that you can not use ODBC Data Source Administrator to create a new data source to connect to a database you have access to (unless you are the database owner)

     

    clip_image002

     

     

    Even Management Studio does not list all the databases you have access to (but you can use the USE statement)

     

     

    clip_image004

     

     

    But what is still ironic is that, even when you can not use ODBC Data Source Administrator or list your databases in Management Studio, you can still use db_name() to list all the database names anyway like in

     

    select db_name(5)

     

    Finally, note that a command introduced with SQL Server 2005, VIEW DEFINITION, could also help you to have more control on the access to SQL Server metadata.


  • Database Engine Tuning Advisor and the Query Optimizer

    Did you know that the Database Engine Tuning Advisor (DTA) uses the Query Optimizer to help you to create indexes, indexed views, and partitions for your databases? The DTA uses the Query Optimizer to estimate the cost of queries so it can select the choices with the lowest estimated cost. But, how can the Query Optimizer estimate the cost of a query using, for example, an index that does not exist yet?

     

    Creating indexes on a DTA session could be very expensive and can create some other performance problems in your database. In addition to that, when the Query Optimizer uses indexes to estimate the cost of a query, it uses only the index statistics; it does not need to access the index data.

     

    So, to avoid creating real indexes during a DTA session, SQL Server has a special kind of indexes called hypothetical indexes. Hypothetical indexes are not real indexes, they only contain statistics and can be created with the undocumented command CREATE INDEX WITH STATISTICS_ONLY. This command only creates the statistics for the index.

     

    You may not be able to see these indexes during a DTA session because they are dropped automatically. But you can see the CREATE INDEX WITH STATISTICS_ONLY and DROP INDEX commands if you run Profiler to see what the DTA is doing. You can also create these indexes manually as I will show you later.

     

    Hypothetical index have been available in previous versions of SQL Server where they were used by the DTA predecessor, the Index Tuning Wizard.

     

    Let us take a quick tour to some of these concepts here. Create a new table on the AdventureWorks database

     

    select *

    into dbo.SalesOrderDetail

    from sales.SalesOrderDetail

     

    Copy the following query and save it to a file

     

    select * from dbo.SalesOrderDetail

    where ProductID = 897

     

    Open a new DTA session. You can optionally run a Profiler session if you want to inspect what the DTA is doing. On workload file select the file containing the SQL statement that you just created. Specify AdventureWorks both for the database to tune and for the database for workload analysis. Click the Start Analysis button.

     

    When the DTA analysis finishes run this query to inspect the contents of the msdb..DTA_reports_query table

     

    select * from msdb..DTA_reports_query

     

    clip_image002

     

    Notice that the table contains some information like the query that was tuned and the current and recommended cost. The current cost, 1.2434, is easy to obtain by directly requesting an estimated execution plan for the query

     

     

    clip_image004

     

     

    Since the DTA analysis was completed, the needed hypothetical indexes were already dropped. In the next statement I will create the index recommended by the DTA, but instead of a regular index I will create it as a hypothetical index by adding WITH STATISTICS_ONLY.

     

    create clustered index cix_ProductID on dbo.SalesOrderDetail (ProductID)

    with statistics_only

     

    You can validate that a hypothetical index and statistics were created by running this (notice that the index is defined as hypothetical on the index_description field)

     

    sp_helpindex 'dbo.SalesOrderDetail'

     

    select * from sys.stats

    where object_id = object_id('dbo.SalesOrderDetail')

     

    However, at the moment I am not aware of a way outside the DTA to ask the Query Optimizer to consider these hypothetical indexes on an estimated execution plan. So I am not able to see where the previous recommended cost is coming from.

     

    Remove the hypothetical index by running this

     

    drop index dbo.SalesOrderDetail.cix_ProductID

     

    Implement the DTA recommendation this time as a regular clustered index

     

    create clustered index cix_ProductID on dbo.SalesOrderDetail (ProductID)

     

    After implementing the recommendation and running the query, the clustered index is in fact being used by the Query Optimizer and this time the estimated cost I got was 0.0033652, very close to the recommended cost listed before on the msdb..DTA_reports_query.

     

    Finally, drop the dbo.SalesOrderDetail table you just created.


  • Are You Using Scalable Shared Databases?

    Did you know that you can share read-only databases between several instances of SQL Server?

    Scalable Shared Databases is a very interesting SQL Server feature that many of us seem to almost have forgotten about it. Available for the first time in SQL Server 2005 and originally described on the Microsoft KB article 910378, it was later fully documented on Books Online. This Enterprise edition-only feature allows a read-only database to be accessed at the same time by two or more SQL Server instances (maximum recommended is 8). This configuration offers some performance benefits by allowing each of these instances to use its own resources like memory, CPU, and tempdb database. Scalable Shared Databases are used as reporting databases.

    The concept behind Scalable Shared Databases is very simple: you copy a detached database to a volume, configure that volume as read-only, and then you can attach this database by several SQL Server instances. A shared database will look like a regular read-only database in Management Studio.

    clip_image002

    By the way, it would be interesting to know if Microsoft plans to add additional functionality to this scalability feature or to extend it to read-write databases, perhaps something similar to what Oracle RAC is already doing.

    For more details, especially on restrictions and configuration, see the ‘Deploying a Scalable Shared Database’ entry on Books Online.


  • Presenting at PASS: How the Query Optimizer Works

    I am excited that this is going to be my second time speaking at the PASS Summit. I have been attending PASS every year since 2003: a few times in Seattle, but also in Orlando (just after hurricane Jeanne), Dallas (just after hurricane Rita), and Denver.

    My session, How the Query Optimizer Works, is scheduled for Wednesday, November 4, 1:30pm - 2:45pm, in room 201.

    http://summit2009.sqlpass.org/Agenda/ProgramSessions/HowtheQueryOptimizerWorks.aspx

    So, if you are attending PASS, I hope you can stop by my session or say hello at any of the PASS events.

    This is the description of my session:

    “The Query Optimizer is the component of SQL Server that attempts to determine the best way to execute a query by finding an efficient execution plan. This session will show you how a better understanding on how the Query Optimizer works and what information it needs to generate better execution plans, can help you to improve the performance of your databases. Learn about the high level structures of the Query Optimizer and some important factors that are considered during the optimization phase of query processing. See how you can provide SQL Server with appropriate statistics and indexes so it can perform better cardinality estimation and produce an efficient execution plan. Since the SQL Server Query Optimizer is a cost-based optimizer, this information will help it to better estimate the execution plan cost. Finally, although the Query Optimizer almost always selects a good enough execution plan for a query, it may not create an efficient plan for all the possible scenarios. See how you can troubleshoot these issues and how you can use other alternatives like hints or plan guides to force the query optimizer to produce a better execution plan.”

    Hope to see you there!

    Ben

    clip_image002


  • The Complexity of the Query Optimization Process

    Something that surprised some of the database professionals attending my presentation regarding the Query Optimizer, both at the Los Angeles and Orange County user groups, was to learn about the high level of complexity of the query optimization process. So, on this post I am going to talk a little bit about it.

     

    Query optimization is an inherently complex problem and of course this applies not only to SQL Server but also to any other relational database system. Reading some academic and research papers regarding the topic quickly reminds me of the kind of problems I was learning about at the Algorithms class for my Master’s degree in Computer Science, problems like the optimization of finding the least cost route through all nodes of a weighted graph, also known as the traveling salesman problem. These are the kind of problems for whom the time required to solve them increases very quickly as the size of the problem grows (By the way, some of these papers also mention that the SQL Server Query Optimizer is based on the Cascades Framework as defined by Goetz Graefe on ‘The Cascades Framework for Query Optimization’).

     

    Theoretically, in order to find the optimum execution plan for a query the entire search space must be used. But some complex queries may have thousands of possible execution plans. The reality is that even when the Query Optimizer typically can consider a large number of candidate execution plans, it can not perform an exhaustive search of all the possible plans for every query. The query optimization time for these queries would be unacceptable long, so the Query Optimizer must balance between optimization time and plan quality. Sophisticated algorithms and heuristics are used to find an efficient plan quickly, which may be the optimal plan or close to it.

     

    The SQL Server Query Optimization is a cost-based optimizer: it estimates the cost of each candidate plan and selects the one with the lowest cost. Accurate cost estimation is another challenging problem for query optimizers. Cost estimation uses a mathematical model that relies on cardinality estimation. To estimate this cardinality SQL Server creates and maintains statistics.

     

    Regarding the execution plan quality, Cesar Galindo-Legaria, Manager of the SQL Server Query Optimizer Team, describes in ‘SQL Server 2005 Practical Troubleshooting’ that in summary, “you can expect the best results in terms of plan quality on smaller queries; good reliability on larger join-only queries; and less reliability and more dependence on the original query syntax on large queries that also involve non-join operators”.


  • Rebuilding Indexes vs. Updating Statistics

    One of the questions I was asked recently while speaking at user groups, was regarding the order that jobs like rebuilding indexes or updating statistics should be performed as part of the database maintenance activities. Then I started writing this post about this topic on the weekend but was interrupted several times, including one of them to watch the premiere on VH1 of the movie Anvil: The Story of Anvil.

     

    In general, the order should not matter, at least if you carefully consider these important points:

     

    1) By default, the UPDATE STATISTICS statement uses only a sample of records of the table. Using UPDATE STATISTICS WITH FULLSCAN will scan the entire table.

     

    2) By default, the UPDATE STATISTICS statement updates both index and column statistics. Using the COLUMNS option will update column statistics only. Using the INDEX option will update index statistics only.

     

    3) Rebuilding an index, for example by using ALTER INDEX … REBUILD, will also update index statistics with the equivalent of using WITH FULLSCAN. Rebuilding indexes does not update column statistics.

     

    4) Reorganizing an index, for example using ALTER INDEX … REORGANIZE, does not update any statistics.

     

    So depending on your maintenance jobs and scripts several scenarios can exist.

     

    The simplest scenario is if you want to rebuild all the indexes and update all the statistics. As mentioned before, if you rebuild all your indexes then all the index statistics will also be updated by scanning all the rows on the table. Then you just need to update your column statistics by running UPDATE STATISTICS WITH FULLSCAN, COLUMNS. Since the first job only updates index statistics and the second one only updates column statistics, it does not matter which one you execute first.

     

    Some other more complicated scenarios include when you have a job which rebuilds your indexes depending on their fragmentation level. In these cases perhaps you want to update only those index statistics that were not touched by the index rebuild job, plus all the column statistics.

     

    Of course, the worst case scenario would be if you first rebuild your indexes, which also updates the index statistics by scanning the entire table, and later you run UPDATE STATISTICS using the default values, which again updates the index statistics but this time with a default sample. Not only are you updating your index statistics twice but you are overwriting the better of the two choices.

     

    Let me show you how these commands work with some examples using the AdventureWorks database. Create a new table dbo.SalesOrderDetail

     

    select * into dbo.SalesOrderDetail

    from sales.SalesOrderDetail

     

    The next query uses the sys.stats catalog view and shows that there are no statistics objects for the new table.

     

    select name, auto_created, stats_date(object_id, stats_id) as update_date from sys.stats

    where object_id = object_id('dbo.SalesOrderDetail')

     

    Use this query again to inspect the status of the statistics after each of the following commands. Now run the following query

     

    select * from dbo.SalesOrderDetail

    where SalesOrderID = 43670 and OrderQty = 1

     

    Use the previous sys.stats query to verify that two statistics objects were created, one for the SalesOrderID column and another one for the OrderQty column (they both have names starting with _WA_Sys as shown in the next figure). Now create the following index and again run the query to verify that a new statistics object for the ProductID column has been created. Notice the value of the auto_created column which tells if the statistics were created by the query optimizer.

     

    create index ix_ProductID on dbo.SalesOrderDetail(ProductID)

     

    clip_image002

     

    Run the next command to update the column statistics only. You can validate that only the column statistics were updated by looking at the update_date column which uses the STATS_DATE function to display the last date the statistics were updated.

     

    update statistics dbo.SalesOrderDetail with fullscan, columns

     

    clip_image004

     

     

    This command will do the same for the index statistics

     

    update statistics dbo.SalesOrderDetail with fullscan, index

     

    These commands will update both index and column statistics

     

    update statistics dbo.SalesOrderDetail with fullscan

    update statistics dbo.SalesOrderDetail with fullscan, all

     

    See how an index rebuild only updates index statistics

     

    alter index ix_ProductID on dbo.SalesOrderDetail rebuild

     

    Here you can verify that reorganizing an index does not update statistics

     

    alter index ix_ProductID  on dbo.SalesOrderDetail reorganize

     

    Finally, remove the table you have just created

     

    drop table dbo.SalesOrderDetail


  • Speaking at the Orange County SQL Server Professionals User Group

     

    I will be speaking at the Orange County SQL Server Professionals User Group this Thursday, October 1st, 2009. The topic is “How the Query Optimizer Works”. So if you are in the Orange County or Los Angeles area please stop by and say hello.

    The meeting starts at 6:30 PM. More details and directions can be found here

    Orange County SQL Server Professionals User Group

    http://www.sqloc.com


  • The Missing Indexes Feature

    Since I will be speaking about the Query Optimizer at the coming PASS Summit, I have been preparing my presentation and at the same time blogging about it. This time I will describe the Missing Indexes feature, seen from the point of view of the Query Optimizer.

     

    We know that it is the job of the Query Optimizer to find an efficient execution plan for a query. But we rarely see the Query Optimizer directly giving us indications about what it needs to produce a better execution plan. One of these cases is the Missing Indexes feature, which was introduced with SQL Server 2005.

     

    The Query Optimizer defines what the best indexes for a query are, and if these indexes do not exist, it will make this information available in the XML plan and the sys.dm_db_missing_index DMVs. And of course, by showing this information the Query Optimizer is also warning you that it might not be selecting an efficient plan. This information shows which indexes may be helpful to improve the performance of your query. You can even use SQL Server 2008 Management Studio to display the CREATE INDEX commands needed to create these indexes, as shown later.

     

    However, although this information about missing indexes is very helpful, this feature should not be used as a tuning tool and should not replace your own index analysis. Database administrators and developers should be aware of its limitations, as described on the Books Online entry ‘Limitations of the Missing Indexes Feature’.

     

    So let us take a quick look to see how this feature works. Create a dbo.SalesOrderDetail table on the AdventureWorks database with the following command

     

    select *

    into dbo.SalesOrderDetail

    from sales.SalesOrderDetail

     

    Run this query and ask for a graphical or XML execution plan

     

    select *

    from dbo.SalesOrderDetail

    where SalesOrderID = 43670

     

    This query can benefit from an index on the SalesOrderID column but no missing indexes information is shown this time. One limitation of the Missing Indexes feature is that it does not work on a trivial plan optimization, like in this case. You can verify that this is a trivial plan by looking at the graphical plan properties (Optimization Level shows as TRIVIAL) or by looking at the XML plan (StatementOptmLevel="TRIVIAL).

     

    You can avoid the trivial plan optimization by using more complex features. In our case we are just going to create a non related index

     

     

    create index ix_ProductID on dbo.SalesOrderDetail(ProductID)

     

    Note that the index created will not be used by our previous query but the query will no longer qualify for a trivial plan. Run the query again. This time the XML plan will contain something like this

     

    <MissingIndexes>

       <MissingIndexGroup Impact="99.703">

          <MissingIndex Database="[AdventureWorks]" Schema="[dbo]" Table="[SalesOrderDetail]">

             <ColumnGroup Usage="EQUALITY">

                <Column Name="[SalesOrderID]" ColumnId="1" />

             </ColumnGroup>

          </MissingIndex>

       </MissingIndexGroup>

    </MissingIndexes>

     

    And if you look at the graphical plan (only SQL Server 2008 Management Studio) you will see a Missing Index warning and a CREATE INDEX command

     

    clip_image002

     

    You can right-click on the graphical plan and select Missing Index Details to see the CREATE INDEX command that can be used to create this index

     

    /*

    Missing Index Details from SQLQuery1.sql

    The Query Processor estimates that implementing the following index could improve the query cost by 99.703%.

    */

     

    /*

    USE [AdventureWorks]

    GO

    CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]

    ON [dbo].[SalesOrderDetail] ([SalesOrderID])

     

    GO

    */

     

    Create the recommended index after you provide a name to it. This time if you run the same query again and look at the execution plan you will see that an Index Seek operator is using the index you have just created and both the Missing Index warning and the MissingIndex element of the XML plan are gone.

     

    Finally, remove the dbo.SalesOrderDetail table you have just created.

     

    drop table dbo.SalesOrderDetail


  • Speaking at Los Angeles SQL Server Professionals Group

    I will be speaking at Los Angeles SQL Server Professionals Group this Thursday, September 17, 2009. The topic is “How the Query Optimizer Works”. So if you are in the area please stop by and say hello.

     

    The meeting starts at 6:30 PM. More details and directions can be found here

     

    Los Angeles SQL Server Professionals Group

    http://sql.la


  • How the Query Optimizer Uses Statistics – Part II

    I got a question from a reader of my post How the Query Optimizer Uses Statistics (http://sqlblog.com/blogs/ben_nevarez/archive/2009/09/03/how-the-query-optimizer-uses-statistics.aspx) and I thought that it would be a good idea to post my answer here. Basically the request was to expand the previous example for a predicate with two columns.

     

    First, a reminder that the histogram only shows the values of the first column of the statistics object. So, how the Query Optimizer does this?

     

    One way to find the answer to this is, well, just to run an example and see what the Query Optimizer does. Let us run a query, see which statistics are automatically created and inspect those statistics.

     

    This is the code sent by the reader. Run this to create a table and populate it with some data.

     

    create table MyStatsTest (

          id int identity(1, 1),

          ProductGroupID int,

          ProductID int

    )

     

    declare @i int

    set nocount on

    set @i = 1

    while @i < 100000

    begin

    insert MyStatsTest (ProductGroupID, ProductID)

    select datepart(millisecond, getdate()) % 2, datepart(millisecond, getdate())

          set @i = @i + 1

    end

     

    Make sure you start with no statistics objects in the table. You can run this to verify that there are no statistics

     

    select * from sys.stats

    where object_id = object_id('MyStatsTest')

     

    If there are some statistics, perhaps after running a query, you can drop them using a command like this (the name of your statistics objects may be different)

     

    drop statistics MyStatsTest._WA_Sys_00000002_7D78A4E7

     

    Run the first query to see the statistics created automatically by the Query Optimizer.

     

    select * from MyStatsTest

    where ProductId < 17 and ProductGroupId = 1

     

    Now you can run the previous query again and notice that two statistics objects were created. Use DBCC SHOW_STATISTICS to display both histograms like in the next example

     

    dbcc show_statistics('MyStatsTest', _WA_Sys_00000003_014935CB)

     

    You can use the first histogram to estimate the number of records for the ProductId < 17 predicate using the method I described in my previous post. I got 1,744 rows which is the sum of the values 392, 302, 313, 419 and 318 (Your table has different data so you will get a different value).

     

    clip_image002

     

    Using the second histogram you can notice that 60.0316003% of the records have ProductGroupid = 1 (This is 60,031 divided by the total number of records 99,999).

     

     

    clip_image004

     

     

    Since this is using the AND operator you need to obtain the 60.0316003% of 1,744. This is 1,046.95 which is the estimated number of rows shown on the execution plan of the query.

     

     

    clip_image006

    You can do the same calculation for ProductGroupId = 0 and you will get an estimated number of rows of 697.049 which are also shown on the execution plan. Notice that you may need to run DBCC FREEPROCCACHE between tests to clear the plan cache, otherwise it may show a cached execution plan.

     

    Finally, let us change the query to use OR operator (instead of AND)

     

    select * from MyStatsTest

    where ProductId < 17 or ProductGroupId = 1

     

    The plan shows 60,728 estimated rows. Since this is using an OR operator this value could be obtained as the union of the following two queries

     

    select * from MyStatsTest

    where ProductId < 17   

     

    and

     

    select * from MyStatsTest

    where ProductId >= 17 and ProductGroupId = 1

     

    The first query estimates 1,744 rows and the second one 58,984, for a total of 60,720 rows. You can use the previous method to estimate these values.

     

    The same value could also be obtained as the union of these two queries

     

    select * from MyStatsTest

    where ProductId < 17 and ProductGroupId = 0

     

    and

     

    select * from MyStatsTest

    where ProductGroupId = 1

     

    The first one shows 697.049 estimated rows and the second one 60,031, for a total of 60,728.049 estimated rows.


  • Statistics Used by the Query Optimizer in SQL Server White Paper

    In case you are not aware, the excellent white paper “Statistics Used by the Query Optimizer in SQL Server” has been updated for SQL Server 2008.

     

    You can find it here

     

    Statistics Used by the Query Optimizer in SQL Server 2008

    Writer: Eric N. Hanson and Yavor Angelov

    Contributor: Lubor Kollar

    http://msdn.microsoft.com/en-us/library/dd535534.aspx


  • How the Query Optimizer Uses Statistics

     

    This post shows how the Query Optimizer uses statistics to estimate the selectivity of expressions during query optimization.

     

    You can also use this as a second part of my last post, The Query Optimizer and Parameter Sniffing. Here I will show you the advantage of the use of statistics when the Query Optimizer can “sniff” the parameter values compared to just guessing the selectivity of expressions when local variables are used. So I will be using the same query as in that previous post.

     

    Notice that there are many other details that can not be covered here so this post will focus on a very simple example to show how the Query Optimizer creates and uses statistics.

     

    To start open your AdventureWorks database and run this to display the current statistics on the ProductID column of the Sales.SalesOrderDetail table

     

    dbcc show_statistics('Sales.SalesOrderDetail', IX_SalesOrderDetail_ProductID)

    This will display the header, density vector and histogram of the statistics object.

     

    1) Understanding the Histogram

     

    First I will explain the meaning of the values of a histogram’s steps. Let us take a look at step 86, shown here

     

    clip_image002

     

    RANGE_HI_KEY is the upper boundary of a histogram step. The value 826 is the upper boundary for step 85 and 831 is the upper boundary for step 86. This means that step 86 may contain only values from 827 to 831.

     

    Run the following query to obtain the real number of records for ProductIDs 827 to 831 to compare them against the histogram

     

    select ProductID, COUNT(*) as Total

    from Sales.SalesOrderDetail

    where ProductID between 827 and 831

    group by ProductID

    This produces the following result

    clip_image004

    Going back to the histogram, EQ_ROWS is the estimated number of rows whose column value equals RANGE_HI_KEY. In our example RANGE_HI_KEY is 831 and the number of records with ProductID 831 is 198. The same value is shown on the histogram.

     

    RANGE_ROWS is the estimated number of rows whose column value falls inside the range of the step, excluding the upper boundary. In our example, this is the number of records with values from 827 to 830 (831, the upper boundary, is excluded). The histogram shows 110 records and we could obtain the same value by getting the sum of 31 records for 827, 46 records for 828, 0 records for 829 and 33 records for 830.

     

    DISTINCT_RANGE_ROWS is the estimated number of rows with a distinct column value inside this range, excluding the upper bound. In our example we have records for three distinct values: 827, 828, and 830, so DISTINCT_RANGE_ROWS is 3. There are no records for ProductID 829 and 831 is again excluded.

     

    Finally, AVG_RANGE_ROWS is the average number of rows per distinct value and it is calculated as RANGE_ROWS / DISTINCT_RANGE_ROWS. In our example, we have a total of 110 records for 3 DISTINCT_RANGE_ROWS, so that gives, 110 / 3 = 36.6667 also shown on the histogram for step 86.

     

    Now let us see how the statistics are used to estimate the selectivity of the queries.

     

    2) When the Query Optimizer knows the value

     

    Let us see the first query

     

    select * from Sales.SalesOrderDetail

    where ProductID = 831

     

    clip_image006

     

    Since 831 is a RANGE_HI_KEY on step 86, the Query Optimizer will use the EQ_ROWS value and the estimated number of rows will be 198.

     

    Now run the same query with the value 828

    clip_image008

    This time the value is inside the range of step 86 but it is not a RANGE_HI_KEY so the Query Optimizer uses the value calculated before as AVG_RANGE_ROWS. Actually, we get the same estimated number of rows for any of the other values in the range (except RANGE_HI_KEY). This also includes 829, even when there are no records for this ProductID.

     

    Let us try now a nonequality operator and try to find the number of records less than 714. For these we need to calculate the sum of the values of both RANGE_ROWS and EQ_ROWS for steps 1 thru 7, which give us a total of 13,223 rows.

     

    clip_image010

     

    This is the query and the estimated number of rows is shown on the execution plan

     

    select * from Sales.SalesOrderDetail

    where ProductID < 714

     

    clip_image012

    3) When the Query Optimizer does not know the value

     

    In the case when the Query Optimizer does not know the value used in the expression, like when local variables are used, the Query Optimizer can not use the histogram so it will use some other information including the statistics density information or it will try to guess the selectivity using some fixed percentages. First, let us try using the equality operator.

     

    declare @pid int = 897

    select * from Sales.SalesOrderDetail

    where ProductID = @pid

     

    The Query Optimizer is not able to see the value 897 in this query. As explained in my previous post, the Query Optimizer does not know the value of the @pid local variable at optimization time. So it will use the density value of the ProductID column, 0.003759399, as listed on the second section, density vector, of the DBCC SHOW_STATISTICS output. If we multiply this value by the total number of records, 121,317, we will get 456.079 which will be shown in the next execution plan.

    clip_image014

    Finally, run this query with a nonequality operator

     

    declare @pid int = 897

    select * from Sales.SalesOrderDetail

    where ProductID < @pid

     

    Same as before, the value 897 does not matter; any other value will give you the same estimated number of rows and execution plan. The estimated number of rows is always 30% of the total number of records for a nonequality operator. In this case the 30% of 121,317 is 36,395.1 as shown next.

    clip_image016


  • The Query Optimizer and Parameter Sniffing

     

    As we all know, the SQL Server Query Optimizer uses statistics to estimate the cardinality and selectivity of predicates of a query to produce an optimal execution plan. The Query Optimizer accomplishes this by first inspecting the values of the query parameters. This behavior is called parameter sniffing and it is a good thing. Getting an execution plan tailored to the current parameters of a query improves the performance of your applications.

     

    We also know that the plan cache can store these execution plans so they can be reused the next time the same query needs to be executed again. This saves optimization time and CPU resources as the query does not need to be compiled again.

     

    However, although the Query Optimizer and the plan cache work fine together most of the times, occasionally some performance problems can appear. Since the Query Optimizer may produce different execution plans for the same query, depending on its parameters, caching and reusing only one of these plans may be a performance issue for some other instance of this query that could benefit from a better plan.

     

    This is a known problem with queries using explicit parameterization like in, for example, stored procedures. So let us see an example of what the problem is and a few recommendations on how to fix it.

     

    1) The problem

     

    Let us write a simple stored procedure using the Sales.SalesOrderDetail table on the AdventureWorks database

     

    create procedure test (@pid int)

    as

    select * from Sales.SalesOrderDetail

    where ProductID = @pid

    Run this to display the amount of disk activity generated by the query

     

    set statistics io on

    and execute the stored procedure

     

    exec test @pid = 897

    The Query Optimizer estimates that only a few records will be returned by this query and decides to use an existing index, so if you look at the actual execution plan you will see an Index Seek and a Key Lookup operators. The I/O information will be similar to this

     

    logical reads 10, physical reads 0, read-ahead reads 0

     

    Now clear the plan cache to remove this plan and run the stored procedure again using a new parameter (Note: Be careful not to clear the plan cache of a production environment)

     

    dbcc freeproccache

    exec test @pid = 870

    This time you will get 4,688 rows and the execution plan will show a Clustered Index Scan. The I/O information will be similar to this

     

    logical reads 1240, physical reads 0, read-ahead reads 0

     

    In this example each execution created its own optimal execution plan. Now see what happen when the plan cache is not cleared before the second execution so they both use the same cached plan

     

    dbcc freeproccache

    exec test @pid = 870

    exec test @pid = 897

    This time the Query Optimizer will compile the first execution of the stored procedure and will create an optimal execution plan for the value 870. This will use a Clustered Index Scan and around 1,240 logical reads. Since this plan is cached, it will also be used for the second execution, using the value 897 and it will also show a Clustered Index Scan and around 1,240 logical reads. This second execution is using 124 times more reads than its optimal plan, as shown previously.

     

    You can try the other combination as well, clearing the plan cache, and running these two executions of the stored procedure, but this time using the value 897 on the first one and 870 on the second one.

     

    2) Optimize for a typical parameter

     

    There might be cases when most of the executions of a stored procedure use the same execution plan and/or you want to avoid the optimization cost. For these cases you can use the OPTIMIZE FOR hint. Use this hint when an optimal plan is generated for the majority of values used for the parameter. Only the few executions using an atypical parameter will not have an optimal plan.

     

    Suppose that almost all of the executions of our stored procedure would benefit from the previous plan using an Index Seek and a Key Lookup.  You could write the stored procedure this way

     

    alter procedure test (@pid int)

    as

    select * from Sales.SalesOrderDetail

    where ProductID = @pid

    option (optimize for (@pid = 897))

    When you run the stored procedure it will be optimized for the value 897, no matter what parameter value was specified for the execution. Test the following case

     

    exec test @pid = 870

    You can find this in the XML plan

     

    <ParameterList>

        <ColumnReference Column="@pid" ParameterCompiledValue="(897)" ParameterRuntimeValue="(870)" />

    </ParameterList>

    This clearly shows which value was used during optimization and which one during execution. In this case the stored procedure is optimized only once, and the plan is stored on the plan cache and reused as many times as needed. The benefit of using this hint is that you have total control on which plan is stored on the plan cache.

     

    3) Optimize on every execution

     

    If you want the best performance for every query the solution might be to optimize for every execution. You will get an optimal plan on every execution but will pay for the optimization cost. To do this use the RECOMPILE hint as shown here.

     

    alter procedure test (@pid int)

    as

    select * from Sales.SalesOrderDetail

    where ProductID = @pid

    option (recompile)

    The XML plan for this execution

     

    exec test @pid = 897

    will show

     

    <ParameterList>

        <ColumnReference Column="@pid" ParameterCompiledValue="(897)" ParameterRuntimeValue="(897)" />

    </ParameterList>

    Some other solution that has been traditionally implemented before has been using local variables but usually this is not a good idea. By doing this you are not only disabling parameter sniffing but also disabling the choice of the Query Optimizer to use the statistics histogram to find an optimal plan for the query. This solution will use the same execution plan for all the executions but may not be the optimal plan for any of them.


  • The Phases of Query Optimization

    One of the most interesting tools that you can use to gain additional knowledge on how the Query Optimizer works is the sys.dm_exec_query_optimizer_info DMV. This view contains cumulative query optimizer statistics since the SQL Server instance was started and it can also be used to get optimization information for a specific query or workload.

     

    In this post I will show you how you can use this DMV to get information regarding the phases of query optimization used by SQL Server. Unfortunately, all the optimizer events shown in this section are undocumented and marked as “Internal only” in Books Online.

     

    As shown in the SQL Server documentation, this view has three fields: counter, which is the name of the optimizer event; occurrence, which is the number of occurrences of the optimization event for this counter; and value, which is the average property value per event occurrence.

     

    To obtain the optimization information for a specific query you can take snapshots of this DMV before and after the query is executed and compare them to find the events that have changed. Keep in mind that if you execute a query that it is already on the plan cache, it may not cause a new optimization and may not be shown in this view. This DMV may also capture some other optimization events happening on the SQL Server instance at the same time that your query is executing.

     

    To start, run the following code to create three tables

     

    create table table1 (a int)

    create table table2 (a int)

    create table table3 (a int)

     

    Trivial Plan

     

    The SQL Server query optimizer is a cost-based optimizer but this cost-based optimization has an expensive startup cost. To avoid this cost for the simplest queries where cost-based optimization is not needed, SQL Server uses the trivial plan optimization. The next example shows a query that takes benefit of a trivial plan. The DMV output shows one trivial plan optimization of a query accessing one table with a maximum DOP of 1.

     

    select * from table1

     

    clip_image002

    Of course, you can also find out if a trivial plan was used during optimization by looking at the properties of the graphical plan, shown as Optimization Level TRIVIAL, or by looking at the XML plan, shown as StatementOptmLevel="TRIVIAL". If a query does not qualify for a trivial plan both of these properties will be shown as FULL instead.

     

    If a trivial plan is not found, the Query Optimizer will start the cost-based optimization.

     

    Many SQL Server users believe that it is the job of the Query Optimizer to search for all the possible plans for a query and to finally select the most efficient one. Because some queries may have a huge number of possible query plans, this may not be possible or may take too long to complete. Instead, the Query Optimizer uses three search phases and the optimization process can finish if a good enough plan is found at the end of any of these phases. If at the end of a phase the best plan is still very expensive the Query Optimizer will run the next phase. These phases are shown as search 0, search 1 and search 2 on the sys.dm_exec_query_optimizer_info DMV.

     

    Phase 0 – Transaction Processing

     

    The first phase is called the transaction processing phase and it is used for small queries typically found on transaction processing systems. The following example shows an optimization on phase 0, using 233 tasks for a query accessing 3 tables.

     

    select * from table1

    join table2 on (table1.a = table2.a)

    join table3 on (table1.a = table3.a)

     

    clip_image004

    Phase 1 – Quick Plan

     

    The next phase is called Quick Plan and it is appropriate for more complex queries. This phase may also consider parallelism. Note that, as shown in the next example, not every query qualifies for phase 0, so depending on the number of tables some queries may start directly on phase 1.

     

    select * from table1

    join table2 on (table1.a = table2.a)

     

    clip_image006

    Phase 2 – Full Optimization

     

    The last phase, called Full Optimization, is used for complex to very complex queries. This phase applies more sophisticated transformations than the previous ones.

     

    Timeout

     

    The DMV can also show a timeout event. When a timeout is found, the Query Optimizer stops the optimization process and returns the least expensive plan it has found so far. This timeout event is also shown on the properties of a graphical plan as Reason For Early Termination of Statement Optimization or on an XML plan as StatementOptmEarlyAbortReason.

     

    For example, the following output shows a timeout in phase 0, after 1,616 tasks on a query joining 12 tables.

    clip_image008

    To keep this post simple I have provided very small queries only, but you can experiment yourself with more complex and interesting queries. By the way, in Chapter 2 of Inside SQL Server 2005: T-SQL Querying, Lubor Kollar provides an excellent script to automatically extract the optimization information for a specific query from the sys.dm_exec_query_optimizer_info DMV.


More Posts Next page »
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement