THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Benjamin Nevarez

  • Writing a Book, and Moving my Blog

    I started blogging about SQL Server here at SQLblog back in July, 2009 and it was a lot of fun, I enjoyed it a lot. Then later, after a series of blog posts about the Query Optimizer, I was invited to write an entire book about that same topic. But after a few months I realized that it was going to be hard to continue both blogging and writing chapters for a book, this in addition to my regular day job, so I decided to stop blogging for a little while.

     

    Now that I have finished the last chapter of the book and I am working on the final chapter reviews, I decided to start blogging again. This time I am moving my blog to

     

    http://www.benjaminnevarez.com

     

    Same as my previous posts I plan to write about my topics of interest, like the relational engine, and basically anything related to SQL Server. Hopefully you find my new blog interesting and useful.

     

    Finally, I would like to thank Adam for allowing me to blog here.

  • Speaking at SQLSaturday #44 in Huntington Beach, CA (Los Angeles Area)

     

    I'll be presenting a session at SQLSaturday #44 in Huntington Beach, the first SQLSaturday on Southern California. The event takes place on Saturday, April 24 at the Golden West College on 15744 Goldenwest St, Huntington Beach, CA 92647.

    .

    For more information visit the following link

     

    http://sqlsaturday.com/44/eventhome.aspx

     

    My session is “How the Query Optimizer Works”. I hope to see you there.

    clip_image002

  • The Query Optimizer and Cost Estimation

    Last year when I presented my session regarding the Query Optimizer at the PASS Summit and at a couple of local user groups, I was asked how the estimated CPU and I/O costs in an execution plan are calculated. Since Microsoft does not publish how these costs are calculated all I could say was that each operator cost depends on the operator algorithm and the cardinality estimation.

     

    This time I am working on a related project and I thought that perhaps I could look into this again to show as an example. But since there are dozens of operators I decided to start with a simple one: the Clustered Index Scan operator. So I captured dozens of XML plans, used XQuery to extract their cost information and after some analysis I was able to obtain a basic formula for this specific operator.

     

    To show this in an example, let us look at the largest table in AdventureWorks. Run the following query and look at the estimated CPU and I/O costs for the Clustered Index Scan operator

     

    select * from Sales.SalesOrderDetail

    where LineTotal = 35

     

    clip_image002

     

    For a Clustered Index Scan operator I observed that the CPU cost is 0.0001581 for the first record plus 0.0000011 for any additional record after that. In this specific case we have an estimated number of records of 121,317 so we can use 0.0001581 + 0.0000011 * (121317 – 1) or 0.133606 which is the value shown as Estimated CPU Cost. In a similar way, I noticed that the minimum I/O cost is 0.003125 for the first database page and then it grows in increments of 0.00074074 for every additional page. Since this operator scans the entire table I can use the following query to find the number of database pages, which returns 1,234.

     

    select in_row_data_page_count, row_count

    from sys.dm_db_partition_stats

    where object_id = object_id('Sales.SalesOrderDetail')

    and index_id = 1

     

    In this case I have 0.003125 + 0.00074074 * (1234 – 1) or 0.916458 which is the value shown as Estimated I/O Cost.

     

    Finally, we add both costs, 0.133606 + 0.916458 to get 1.05006 which is the total estimated cost of the operator. In the same way, adding the cost of all the operators will give the total cost of the plan. In this case, the cost of the Clustered Index Scan, 1.05006, plus the cost of the first Compute Scalar operator, 0.01214, the second Compute Scalar operator, 0.01213, and the cost of the Filter operator, 0.0582322, will give the total cost of the plan, 1.13256, as shown next.

     

    clip_image004

     

    Finally, you can combine this with my previous post, Fooling the Query Optimizer at http://sqlblog.com/blogs/ben_nevarez/archive/2010/01/13/fooling-the-query-optimizer.aspx, to test with “bigger” tables by changing the number of records and pages of a table for testing purposes.

  • Fooling the Query Optimizer

    Did you ever wanted to know which execution plans the Query Optimizer would generate for your queries should your tables have millions of records? You can actually generate those plans by using the undocumented ROWCOUNT and PAGECOUNT options of the UPDATE STATISTICS statement. These options can be used on small or empty tables and can be helpful for testing in some scenarios where you may not want to spent time or disk space creating big tables.

     

    By using this method you are tricking the Query Optimizer as it will generate execution plans using cardinality estimations as if the table really had millions of records. Note that this option, available since SQL Server 2005, only helps in creating the execution plan for your queries. Actually running the query will use the real data and of course will execute faster than a table with millions of records.

     

    UPDATE STATISTICS WITH ROWCOUNT, PAGECOUNT does not change the table statistics, only the counter of number of rows and pages of a table. But the Query Optimizer uses this information to estimate the cardinality of queries as I will show later. Also keep in mind that these are undocumented and unsupported options and should not be used in any production environment.

     

    Let us see an example. Run the following query to create a new table on the AdventureWorks database

     

    select * into dbo.Address

    from Person.Address

    Inspect the number of rows by running the following queries. It must show 19,614 rows.

    select * from sys.partitions

    where object_id = object_id('dbo.Address')

     

    select * from sys.dm_db_partition_stats

    where object_id = object_id('dbo.Address')

     

    Run the following query

     

    select * from dbo.Address

    where city = 'London'

     

    Running this query will create new statistics for the city column and will show the following plan. Note that the estimated number of rows is 434 and it is using a simple Table Scan operator

     

     clip_image002

     

    We can see where the Query Optimizer is getting the estimated number of rows by inspecting the statistics object. Run this query to see the name of the statistics object

     

    select * from sys.stats

    where object_id = object_id('dbo.Address')

     

    Then use the displayed statistics object name in the following statement (the name may be different in your case)

     

    dbcc show_statistics('dbo.Address', _WA_Sys_00000004_46136164)

     

    By looking at the histogram you can find the value 434 on EQ_ROWS for the RANGE_HI_KEY value ‘London’ (Statistics and histograms are explained on previous posts in this blog)

     

    clip_image004

     

    Now run the UPDATE STATISTICS WITH ROWCOUNT, PAGECOUNT (you can specify any other value for rowcount and pagecount)

     

    update statistics dbo.Address with rowcount = 1000000, pagecount = 100000

     

    If you inspect the number of rows again from sys.partitions or sys.dm_db_partition_stats, as shown previously, it will now show 1,000,000 rows. sys.dm_db_partition_stats also shows the new number of pages. Clear the plan cache and run the query again

     

    dbcc freeproccache

    go

    select * from dbo.Address

    where city = 'London'

     

    Note that the estimated number of rows has changed from 434 to 22,127.1 and that a different plan was generated using this new cardinality estimation. The Query Optimizer decided to parallelize this plan. But this is a very simple query, more dramatic plan changes can happen with more complex queries.

     

     

    clip_image006

     

    After execution the actual number of rows obviously is still is 434 but the Query Optimizer is not able to see this value.

     

    If you look at the statistics object again, using DBCC SHOW_STATISTICS as shown before, the histogram has not changed. One way to obtain the estimated number of rows shown in the new execution plan is calculating the percentage or fraction of rows for the value ‘London’ from the statistics sample, which in this case is 19,614, as shown on the header of the statistics object. So the fraction is 434 / 19,614 or 022127052. Then obtain the same percentage from the new “current” number of rows which is 1,000,000 calculated as 1,000,000 * 0.022127052 and we get 22,127.1 which is the estimated number of rows displayed in the plan.

     

    Finally, drop the table you just created

     

    drop table dbo.Address

     

  • Insecure Metadata

    I was looking at some metadata access today and this reminded me of when I talked about this topic two years ago at the PASS Summit in Denver.

     

    Back in SQL Server 2000 most of the existing metadata was available to members of the public role, that is, to any login that can connect to the SQL Server instance. SQL Server 2005 promises that a user can only see the metadata of an object the user owns or has access to. This was a huge security improvement. However, some items were still left available to any user, which in some cases could be a security concern.

     

    Let us test some of it. Create a new login called insecure without granting any permission to it. Connect as that user and see what kind of information from the server you are able to obtain. Try this first

     

    select * from sys.databases

     

    This new user is able to see the list of all the databases on the server, including the ones he does not have access to. Try

     

    select * from sys.configurations

     

    This time you can see some configuration information of your instance including some specific security information like if the ‘c2 audit mode’ is enabled.

     

    Some other data available to any SQL Server user is information about the location of the files of the current database, linked servers or backup devices. Try these

     

    select * from sys.database_files

    select * from sys.servers

    select * from sys.backup_devices

     

    Here I am using catalog views only but the same applies if you use the SQL Server 2000 compatibility views (like sysdatabases). You can also execute some stored procedures like these to get similar information

     

    sp_helpdb 'master'

    exec sp_configure

    exec sp_helpfile

    exec sp_helpserver

    exec sp_helpdevice

     

    Now let us look at the msdb database. By looking at the following system tables (yes, these are called system tables) you can obtain all the information regarding backup and restore operations performed on your instance. You can obtain information like database names, location of databases files, locations of database backup files and names of users who performed these backups.

     

    use msdb

    go

    select * from backupfile

    select * from backupmediafamily

    select * from backupset

    select * from restorefile

    select * from restorehistory

     

    You can also run some extended stored procedures or functions like the following

     

    exec master..xp_msver

    select serverproperty('ProductVersion')

    select serverproperty('IsIntegratedSecurityOnly')

     

    For example, the first two display the version (build) of SQL Server, which can show if the database administrator has installed the latest service pack or security updates. This can reveal some known security vulnerability of your instance.

     

    Perhaps you can see now that this information, like security configuration, database names, database file names, location of database backups, user names, etc. could be used by an attacker to get additional permissions or data.

     

    Certainly these are the defaults and many of them can be changed to be more secure, but many times this is not practical as some tools and applications depend on these defaults. For example, you can revoke the VIEW ANY DATABASE permission (executed as member of sysadmin) to secure the information displayed on sys.databases.

     

    revoke view any database to public

     

    But now only database owners can see their databases listed on sys.databases. Of course, this is a problem for a valid user who is not a database owner, who will only see the master and tempdb databases on this list. To test, grant db_datareader permissions to the insecure user to any user database. Test this again connected as the insecure login

     

    select * from sys.databases

     

    Only the master and tempdb databases are listed this time. One known issue of this configuration is that you can not use ODBC Data Source Administrator to create a new data source to connect to a database you have access to (unless you are the database owner)

     

    clip_image002

     

     

    Even Management Studio does not list all the databases you have access to (but you can use the USE statement)

     

     

    clip_image004

     

     

    But what is still ironic is that, even when you can not use ODBC Data Source Administrator or list your databases in Management Studio, you can still use db_name() to list all the database names anyway like in

     

    select db_name(5)

     

    Finally, note that a command introduced with SQL Server 2005, VIEW DEFINITION, could also help you to have more control on the access to SQL Server metadata.

  • Database Engine Tuning Advisor and the Query Optimizer

    Did you know that the Database Engine Tuning Advisor (DTA) uses the Query Optimizer to help you to create indexes, indexed views, and partitions for your databases? The DTA uses the Query Optimizer to estimate the cost of queries so it can select the choices with the lowest estimated cost. But, how can the Query Optimizer estimate the cost of a query using, for example, an index that does not exist yet?

     

    Creating indexes on a DTA session could be very expensive and can create some other performance problems in your database. In addition to that, when the Query Optimizer uses indexes to estimate the cost of a query, it uses only the index statistics; it does not need to access the index data.

     

    So, to avoid creating real indexes during a DTA session, SQL Server has a special kind of indexes called hypothetical indexes. Hypothetical indexes are not real indexes, they only contain statistics and can be created with the undocumented command CREATE INDEX WITH STATISTICS_ONLY. This command only creates the statistics for the index.

     

    You may not be able to see these indexes during a DTA session because they are dropped automatically. But you can see the CREATE INDEX WITH STATISTICS_ONLY and DROP INDEX commands if you run Profiler to see what the DTA is doing. You can also create these indexes manually as I will show you later.

     

    Hypothetical index have been available in previous versions of SQL Server where they were used by the DTA predecessor, the Index Tuning Wizard.

     

    Let us take a quick tour to some of these concepts here. Create a new table on the AdventureWorks database

     

    select *

    into dbo.SalesOrderDetail

    from sales.SalesOrderDetail

     

    Copy the following query and save it to a file

     

    select * from dbo.SalesOrderDetail

    where ProductID = 897

     

    Open a new DTA session. You can optionally run a Profiler session if you want to inspect what the DTA is doing. On workload file select the file containing the SQL statement that you just created. Specify AdventureWorks both for the database to tune and for the database for workload analysis. Click the Start Analysis button.

     

    When the DTA analysis finishes run this query to inspect the contents of the msdb..DTA_reports_query table

     

    select * from msdb..DTA_reports_query

     

    clip_image002

     

    Notice that the table contains some information like the query that was tuned and the current and recommended cost. The current cost, 1.2434, is easy to obtain by directly requesting an estimated execution plan for the query

     

     

    clip_image004

     

     

    Since the DTA analysis was completed, the needed hypothetical indexes were already dropped. In the next statement I will create the index recommended by the DTA, but instead of a regular index I will create it as a hypothetical index by adding WITH STATISTICS_ONLY.

     

    create clustered index cix_ProductID on dbo.SalesOrderDetail (ProductID)

    with statistics_only

     

    You can validate that a hypothetical index and statistics were created by running this (notice that the index is defined as hypothetical on the index_description field)

     

    sp_helpindex 'dbo.SalesOrderDetail'

     

    select * from sys.stats

    where object_id = object_id('dbo.SalesOrderDetail')

     

    However, at the moment I am not aware of a way outside the DTA to ask the Query Optimizer to consider these hypothetical indexes on an estimated execution plan. So I am not able to see where the previous recommended cost is coming from.

     

    Remove the hypothetical index by running this

     

    drop index dbo.SalesOrderDetail.cix_ProductID

     

    Implement the DTA recommendation this time as a regular clustered index

     

    create clustered index cix_ProductID on dbo.SalesOrderDetail (ProductID)

     

    After implementing the recommendation and running the query, the clustered index is in fact being used by the Query Optimizer and this time the estimated cost I got was 0.0033652, very close to the recommended cost listed before on the msdb..DTA_reports_query.

     

    Finally, drop the dbo.SalesOrderDetail table you just created.

  • Are You Using Scalable Shared Databases?

    Did you know that you can share read-only databases between several instances of SQL Server?

    Scalable Shared Databases is a very interesting SQL Server feature that many of us seem to almost have forgotten about it. Available for the first time in SQL Server 2005 and originally described on the Microsoft KB article 910378, it was later fully documented on Books Online. This Enterprise edition-only feature allows a read-only database to be accessed at the same time by two or more SQL Server instances (maximum recommended is 8). This configuration offers some performance benefits by allowing each of these instances to use its own resources like memory, CPU, and tempdb database. Scalable Shared Databases are used as reporting databases.

    The concept behind Scalable Shared Databases is very simple: you copy a detached database to a volume, configure that volume as read-only, and then you can attach this database by several SQL Server instances. A shared database will look like a regular read-only database in Management Studio.

    clip_image002

    By the way, it would be interesting to know if Microsoft plans to add additional functionality to this scalability feature or to extend it to read-write databases, perhaps something similar to what Oracle RAC is already doing.

    For more details, especially on restrictions and configuration, see the ‘Deploying a Scalable Shared Database’ entry on Books Online.

  • Presenting at PASS: How the Query Optimizer Works

    I am excited that this is going to be my second time speaking at the PASS Summit. I have been attending PASS every year since 2003: a few times in Seattle, but also in Orlando (just after hurricane Jeanne), Dallas (just after hurricane Rita), and Denver.

    My session, How the Query Optimizer Works, is scheduled for Wednesday, November 4, 1:30pm - 2:45pm, in room 201.

    http://summit2009.sqlpass.org/Agenda/ProgramSessions/HowtheQueryOptimizerWorks.aspx

    So, if you are attending PASS, I hope you can stop by my session or say hello at any of the PASS events.

    This is the description of my session:

    “The Query Optimizer is the component of SQL Server that attempts to determine the best way to execute a query by finding an efficient execution plan. This session will show you how a better understanding on how the Query Optimizer works and what information it needs to generate better execution plans, can help you to improve the performance of your databases. Learn about the high level structures of the Query Optimizer and some important factors that are considered during the optimization phase of query processing. See how you can provide SQL Server with appropriate statistics and indexes so it can perform better cardinality estimation and produce an efficient execution plan. Since the SQL Server Query Optimizer is a cost-based optimizer, this information will help it to better estimate the execution plan cost. Finally, although the Query Optimizer almost always selects a good enough execution plan for a query, it may not create an efficient plan for all the possible scenarios. See how you can troubleshoot these issues and how you can use other alternatives like hints or plan guides to force the query optimizer to produce a better execution plan.”

    Hope to see you there!

    Ben

    clip_image002

  • Rebuilding Indexes vs. Updating Statistics

    One of the questions I was asked recently while speaking at user groups, was regarding the order that jobs like rebuilding indexes or updating statistics should be performed as part of the database maintenance activities. Then I started writing this post about this topic on the weekend but was interrupted several times, including one of them to watch the premiere on VH1 of the movie Anvil: The Story of Anvil.

     

    In general, the order should not matter, at least if you carefully consider these important points:

     

    1) By default, the UPDATE STATISTICS statement uses only a sample of records of the table. Using UPDATE STATISTICS WITH FULLSCAN will scan the entire table.

     

    2) By default, the UPDATE STATISTICS statement updates both index and column statistics. Using the COLUMNS option will update column statistics only. Using the INDEX option will update index statistics only.

     

    3) Rebuilding an index, for example by using ALTER INDEX … REBUILD, will also update index statistics with the equivalent of using WITH FULLSCAN. Rebuilding indexes does not update column statistics.

     

    4) Reorganizing an index, for example using ALTER INDEX … REORGANIZE, does not update any statistics.

     

    So depending on your maintenance jobs and scripts several scenarios can exist.

     

    The simplest scenario is if you want to rebuild all the indexes and update all the statistics. As mentioned before, if you rebuild all your indexes then all the index statistics will also be updated by scanning all the rows on the table. Then you just need to update your column statistics by running UPDATE STATISTICS WITH FULLSCAN, COLUMNS. Since the first job only updates index statistics and the second one only updates column statistics, it does not matter which one you execute first.

     

    Some other more complicated scenarios include when you have a job which rebuilds your indexes depending on their fragmentation level. In these cases perhaps you want to update only those index statistics that were not touched by the index rebuild job, plus all the column statistics.

     

    Of course, the worst case scenario would be if you first rebuild your indexes, which also updates the index statistics by scanning the entire table, and later you run UPDATE STATISTICS using the default values, which again updates the index statistics but this time with a default sample. Not only are you updating your index statistics twice but you are overwriting the better of the two choices.

     

    Let me show you how these commands work with some examples using the AdventureWorks database. Create a new table dbo.SalesOrderDetail

     

    select * into dbo.SalesOrderDetail

    from sales.SalesOrderDetail

     

    The next query uses the sys.stats catalog view and shows that there are no statistics objects for the new table.

     

    select name, auto_created, stats_date(object_id, stats_id) as update_date from sys.stats

    where object_id = object_id('dbo.SalesOrderDetail')

     

    Use this query again to inspect the status of the statistics after each of the following commands. Now run the following query

     

    select * from dbo.SalesOrderDetail

    where SalesOrderID = 43670 and OrderQty = 1

     

    Use the previous sys.stats query to verify that two statistics objects were created, one for the SalesOrderID column and another one for the OrderQty column (they both have names starting with _WA_Sys as shown in the next figure). Now create the following index and again run the query to verify that a new statistics object for the ProductID column has been created. Notice the value of the auto_created column which tells if the statistics were created by the query optimizer.

     

    create index ix_ProductID on dbo.SalesOrderDetail(ProductID)

     

    clip_image002

     

    Run the next command to update the column statistics only. You can validate that only the column statistics were updated by looking at the update_date column which uses the STATS_DATE function to display the last date the statistics were updated.

     

    update statistics dbo.SalesOrderDetail with fullscan, columns

     

    clip_image004

     

     

    This command will do the same for the index statistics

     

    update statistics dbo.SalesOrderDetail with fullscan, index

     

    These commands will update both index and column statistics

     

    update statistics dbo.SalesOrderDetail with fullscan

    update statistics dbo.SalesOrderDetail with fullscan, all

     

    See how an index rebuild only updates index statistics

     

    alter index ix_ProductID on dbo.SalesOrderDetail rebuild

     

    Here you can verify that reorganizing an index does not update statistics

     

    alter index ix_ProductID  on dbo.SalesOrderDetail reorganize

     

    Finally, remove the table you have just created

     

    drop table dbo.SalesOrderDetail

  • Speaking at the Orange County SQL Server Professionals User Group

     

    I will be speaking at the Orange County SQL Server Professionals User Group this Thursday, October 1st, 2009. The topic is “How the Query Optimizer Works”. So if you are in the Orange County or Los Angeles area please stop by and say hello.

    The meeting starts at 6:30 PM. More details and directions can be found here

    Orange County SQL Server Professionals User Group

    http://www.sqloc.com

  • The Missing Indexes Feature

    Since I will be speaking about the Query Optimizer at the coming PASS Summit, I have been preparing my presentation and at the same time blogging about it. This time I will describe the Missing Indexes feature, seen from the point of view of the Query Optimizer.

     

    We know that it is the job of the Query Optimizer to find an efficient execution plan for a query. But we rarely see the Query Optimizer directly giving us indications about what it needs to produce a better execution plan. One of these cases is the Missing Indexes feature, which was introduced with SQL Server 2005.

     

    The Query Optimizer defines what the best indexes for a query are, and if these indexes do not exist, it will make this information available in the XML plan and the sys.dm_db_missing_index DMVs. And of course, by showing this information the Query Optimizer is also warning you that it might not be selecting an efficient plan. This information shows which indexes may be helpful to improve the performance of your query. You can even use SQL Server 2008 Management Studio to display the CREATE INDEX commands needed to create these indexes, as shown later.

     

    However, although this information about missing indexes is very helpful, this feature should not be used as a tuning tool and should not replace your own index analysis. Database administrators and developers should be aware of its limitations, as described on the Books Online entry ‘Limitations of the Missing Indexes Feature’.

     

    So let us take a quick look to see how this feature works. Create a dbo.SalesOrderDetail table on the AdventureWorks database with the following command

     

    select *

    into dbo.SalesOrderDetail

    from sales.SalesOrderDetail

     

    Run this query and ask for a graphical or XML execution plan

     

    select *

    from dbo.SalesOrderDetail

    where SalesOrderID = 43670

     

    This query can benefit from an index on the SalesOrderID column but no missing indexes information is shown this time. One limitation of the Missing Indexes feature is that it does not work on a trivial plan optimization, like in this case. You can verify that this is a trivial plan by looking at the graphical plan properties (Optimization Level shows as TRIVIAL) or by looking at the XML plan (StatementOptmLevel="TRIVIAL).

     

    You can avoid the trivial plan optimization by using more complex features. In our case we are just going to create a non related index

     

     

    create index ix_ProductID on dbo.SalesOrderDetail(ProductID)

     

    Note that the index created will not be used by our previous query but the query will no longer qualify for a trivial plan. Run the query again. This time the XML plan will contain something like this

     

    <MissingIndexes>

       <MissingIndexGroup Impact="99.703">

          <MissingIndex Database="[AdventureWorks]" Schema="[dbo]" Table="[SalesOrderDetail]">

             <ColumnGroup Usage="EQUALITY">

                <Column Name="[SalesOrderID]" ColumnId="1" />

             </ColumnGroup>

          </MissingIndex>

       </MissingIndexGroup>

    </MissingIndexes>

     

    And if you look at the graphical plan (only SQL Server 2008 Management Studio) you will see a Missing Index warning and a CREATE INDEX command

     

    clip_image002

     

    You can right-click on the graphical plan and select Missing Index Details to see the CREATE INDEX command that can be used to create this index

     

    /*

    Missing Index Details from SQLQuery1.sql

    The Query Processor estimates that implementing the following index could improve the query cost by 99.703%.

    */

     

    /*

    USE [AdventureWorks]

    GO

    CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]

    ON [dbo].[SalesOrderDetail] ([SalesOrderID])

     

    GO

    */

     

    Create the recommended index after you provide a name to it. This time if you run the same query again and look at the execution plan you will see that an Index Seek operator is using the index you have just created and both the Missing Index warning and the MissingIndex element of the XML plan are gone.

     

    Finally, remove the dbo.SalesOrderDetail table you have just created.

     

    drop table dbo.SalesOrderDetail

  • Speaking at Los Angeles SQL Server Professionals Group

    I will be speaking at Los Angeles SQL Server Professionals Group this Thursday, September 17, 2009. The topic is “How the Query Optimizer Works”. So if you are in the area please stop by and say hello.

     

    The meeting starts at 6:30 PM. More details and directions can be found here

     

    Los Angeles SQL Server Professionals Group

    http://sql.la

  • How the Query Optimizer Uses Statistics – Part II

    I got a question from a reader of my post How the Query Optimizer Uses Statistics (http://sqlblog.com/blogs/ben_nevarez/archive/2009/09/03/how-the-query-optimizer-uses-statistics.aspx) and I thought that it would be a good idea to post my answer here. Basically the request was to expand the previous example for a predicate with two columns.

     

    First, a reminder that the histogram only shows the values of the first column of the statistics object. So, how the Query Optimizer does this?

     

    One way to find the answer to this is, well, just to run an example and see what the Query Optimizer does. Let us run a query, see which statistics are automatically created and inspect those statistics.

     

    This is the code sent by the reader. Run this to create a table and populate it with some data.

     

    create table MyStatsTest (

          id int identity(1, 1),

          ProductGroupID int,

          ProductID int

    )

     

    declare @i int

    set nocount on

    set @i = 1

    while @i < 100000

    begin

    insert MyStatsTest (ProductGroupID, ProductID)

    select datepart(millisecond, getdate()) % 2, datepart(millisecond, getdate())

          set @i = @i + 1

    end

     

    Make sure you start with no statistics objects in the table. You can run this to verify that there are no statistics

     

    select * from sys.stats

    where object_id = object_id('MyStatsTest')

     

    If there are some statistics, perhaps after running a query, you can drop them using a command like this (the name of your statistics objects may be different)

     

    drop statistics MyStatsTest._WA_Sys_00000002_7D78A4E7

     

    Run the first query to see the statistics created automatically by the Query Optimizer.

     

    select * from MyStatsTest

    where ProductId < 17 and ProductGroupId = 1

     

    Now you can run the previous query again and notice that two statistics objects were created. Use DBCC SHOW_STATISTICS to display both histograms like in the next example

     

    dbcc show_statistics('MyStatsTest', _WA_Sys_00000003_014935CB)

     

    You can use the first histogram to estimate the number of records for the ProductId < 17 predicate using the method I described in my previous post. I got 1,744 rows which is the sum of the values 392, 302, 313, 419 and 318 (Your table has different data so you will get a different value).

     

    clip_image002

     

    Using the second histogram you can notice that 60.0316003% of the records have ProductGroupid = 1 (This is 60,031 divided by the total number of records 99,999).

     

     

    clip_image004

     

     

    Since this is using the AND operator you need to obtain the 60.0316003% of 1,744. This is 1,046.95 which is the estimated number of rows shown on the execution plan of the query.

     

     

    clip_image006

    You can do the same calculation for ProductGroupId = 0 and you will get an estimated number of rows of 697.049 which are also shown on the execution plan. Notice that you may need to run DBCC FREEPROCCACHE between tests to clear the plan cache, otherwise it may show a cached execution plan.

     

    Finally, let us change the query to use OR operator (instead of AND)

     

    select * from MyStatsTest

    where ProductId < 17 or ProductGroupId = 1

     

    The plan shows 60,728 estimated rows. Since this is using an OR operator this value could be obtained as the union of the following two queries

     

    select * from MyStatsTest

    where ProductId < 17   

     

    and

     

    select * from MyStatsTest

    where ProductId >= 17 and ProductGroupId = 1

     

    The first query estimates 1,744 rows and the second one 58,984, for a total of 60,720 rows. You can use the previous method to estimate these values.

     

    The same value could also be obtained as the union of these two queries

     

    select * from MyStatsTest

    where ProductId < 17 and ProductGroupId = 0

     

    and

     

    select * from MyStatsTest

    where ProductGroupId = 1

     

    The first one shows 697.049 estimated rows and the second one 60,031, for a total of 60,728.049 estimated rows.

  • Statistics Used by the Query Optimizer in SQL Server White Paper

    In case you are not aware, the excellent white paper “Statistics Used by the Query Optimizer in SQL Server” has been updated for SQL Server 2008.

     

    You can find it here

     

    Statistics Used by the Query Optimizer in SQL Server 2008

    Writer: Eric N. Hanson and Yavor Angelov

    Contributor: Lubor Kollar

    http://msdn.microsoft.com/en-us/library/dd535534.aspx

  • How the Query Optimizer Uses Statistics

     

    This post shows how the Query Optimizer uses statistics to estimate the selectivity of expressions during query optimization.

     

    You can also use this as a second part of my last post, The Query Optimizer and Parameter Sniffing. Here I will show you the advantage of the use of statistics when the Query Optimizer can “sniff” the parameter values compared to just guessing the selectivity of expressions when local variables are used. So I will be using the same query as in that previous post.

     

    Notice that there are many other details that can not be covered here so this post will focus on a very simple example to show how the Query Optimizer creates and uses statistics.

     

    To start open your AdventureWorks database and run this to display the current statistics on the ProductID column of the Sales.SalesOrderDetail table

     

    dbcc show_statistics('Sales.SalesOrderDetail', IX_SalesOrderDetail_ProductID)

    This will display the header, density vector and histogram of the statistics object.

     

    1) Understanding the Histogram

     

    First I will explain the meaning of the values of a histogram’s steps. Let us take a look at step 86, shown here

     

    clip_image002

     

    RANGE_HI_KEY is the upper boundary of a histogram step. The value 826 is the upper boundary for step 85 and 831 is the upper boundary for step 86. This means that step 86 may contain only values from 827 to 831.

     

    Run the following query to obtain the real number of records for ProductIDs 827 to 831 to compare them against the histogram

     

    select ProductID, COUNT(*) as Total

    from Sales.SalesOrderDetail

    where ProductID between 827 and 831

    group by ProductID

    This produces the following result

    clip_image004

    Going back to the histogram, EQ_ROWS is the estimated number of rows whose column value equals RANGE_HI_KEY. In our example RANGE_HI_KEY is 831 and the number of records with ProductID 831 is 198. The same value is shown on the histogram.

     

    RANGE_ROWS is the estimated number of rows whose column value falls inside the range of the step, excluding the upper boundary. In our example, this is the number of records with values from 827 to 830 (831, the upper boundary, is excluded). The histogram shows 110 records and we could obtain the same value by getting the sum of 31 records for 827, 46 records for 828, 0 records for 829 and 33 records for 830.

     

    DISTINCT_RANGE_ROWS is the estimated number of rows with a distinct column value inside this range, excluding the upper bound. In our example we have records for three distinct values: 827, 828, and 830, so DISTINCT_RANGE_ROWS is 3. There are no records for ProductID 829 and 831 is again excluded.

     

    Finally, AVG_RANGE_ROWS is the average number of rows per distinct value and it is calculated as RANGE_ROWS / DISTINCT_RANGE_ROWS. In our example, we have a total of 110 records for 3 DISTINCT_RANGE_ROWS, so that gives, 110 / 3 = 36.6667 also shown on the histogram for step 86.

     

    Now let us see how the statistics are used to estimate the selectivity of the queries.

     

    2) When the Query Optimizer knows the value

     

    Let us see the first query

     

    select * from Sales.SalesOrderDetail

    where ProductID = 831

     

    clip_image006

     

    Since 831 is a RANGE_HI_KEY on step 86, the Query Optimizer will use the EQ_ROWS value and the estimated number of rows will be 198.

     

    Now run the same query with the value 828

    clip_image008

    This time the value is inside the range of step 86 but it is not a RANGE_HI_KEY so the Query Optimizer uses the value calculated before as AVG_RANGE_ROWS. Actually, we get the same estimated number of rows for any of the other values in the range (except RANGE_HI_KEY). This also includes 829, even when there are no records for this ProductID.

     

    Let us try now a nonequality operator and try to find the number of records less than 714. For these we need to calculate the sum of the values of both RANGE_ROWS and EQ_ROWS for steps 1 thru 7, which give us a total of 13,223 rows.

     

    clip_image010

     

    This is the query and the estimated number of rows is shown on the execution plan

     

    select * from Sales.SalesOrderDetail

    where ProductID < 714

     

    clip_image012

    3) When the Query Optimizer does not know the value

     

    In the case when the Query Optimizer does not know the value used in the expression, like when local variables are used, the Query Optimizer can not use the histogram so it will use some other information including the statistics density information or it will try to guess the selectivity using some fixed percentages. First, let us try using the equality operator.

     

    declare @pid int = 897

    select * from Sales.SalesOrderDetail

    where ProductID = @pid

     

    The Query Optimizer is not able to see the value 897 in this query. As explained in my previous post, the Query Optimizer does not know the value of the @pid local variable at optimization time. So it will use the density value of the ProductID column, 0.003759399, as listed on the second section, density vector, of the DBCC SHOW_STATISTICS output. If we multiply this value by the total number of records, 121,317, we will get 456.079 which will be shown in the next execution plan.

    clip_image014

    Finally, run this query with a nonequality operator

     

    declare @pid int = 897

    select * from Sales.SalesOrderDetail

    where ProductID < @pid

     

    Same as before, the value 897 does not matter; any other value will give you the same estimated number of rows and execution plan. The estimated number of rows is always 30% of the total number of records for a nonequality operator. In this case the 30% of 121,317 is 36,395.1 as shown next.

    clip_image016

More Posts Next page »
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement