THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

John Paul Cook

Script to determine cardinality for index analysis

When reviewing a client's database, I suspected there might be indexes on low cardinality (i.e., low "uniqueness") columns. Typing COUNT(DISTINCT colName)) became tedious quickly. Time for a script. Oops, wait, in deference to proponents of test driven development, time for a test case and then time for a script. Here's the test case:

CREATE TABLE dbo.CardinalTest(

      a bigint NULL,

      b binary(50) NULL,

      c bit NULL,

      d char(10) NULL,

      e date NULL,

      f datetime NULL,

      g datetime2(7) NULL,

      h datetimeoffset(7) NULL,

      i decimal(18, 0) NULL,

      j float NULL,

      k geography NULL,

      l hierarchyid NULL,

      m image NULL,

      n int NULL,

      o money NULL,

      p nchar(10) NULL,

      q ntext NULL,

      r numeric(18, 0) NULL,

      s nvarchar(50) NULL,

      t nvarchar(max) NULL,

      u real NULL,

      v smalldatetime NULL,

      w smallint NULL,

      x smallmoney NULL,

      y sql_variant NULL,

      z text NULL,

      aa time(7) NULL,

      bb timestamp NULL,

      cc tinyint NULL,

      dd uniqueidentifier NULL,

      ee varbinary(50) NULL,

      ff varbinary(max) NULL,

      gg varchar(50) NULL,

      hh varchar(max) NULL,

      ii xml NULL

);

The test case has a column for every data type in SQL Server 2008.

A cursor loop is need to iterate through all of the tables. Notice that instead of iterating through INFORMATION_SCHEMA.TABLES, a different approach was used in case you want to limit the analysis to the largest n tables. If you want to do this, add a TOP n clause. If you have a lot of columns, you can exceed the 4000 character limit of the @TSQL variable. If you don't use Unicode characters for table and column names, you can change @TSQL to VARCHAR(8000).

Now for the script:

DECLARE @TABLE_SCHEMA NVARCHAR(128);

DECLARE @TABLE_NAME NVARCHAR(128);

DECLARE @COLUMN_NAME NVARCHAR(128);

DECLARE @TSQL NVARCHAR(MAX);

 

DECLARE tableCursor CURSOR FAST_FORWARD FOR

/* The following query with a TOP n is useful by itself to show you the

   largest tables in your database. */

/* This query works for SQL Server 2005 and above. */

SELECT OBJECT_SCHEMA_NAME(object_id) AS schemaName  -- maybe add a TOP n 

     , OBJECT_NAME(object_id) AS tableName

FROM sys.dm_db_partition_stats

WHERE index_id < 2

AND OBJECTPROPERTY(object_id,'IsUserTable') = 1

ORDER BY sys.dm_db_partition_stats.row_count DESC;

/* This query works for SQL Server 2000. */

--SELECT TABLE_SCHEMA, TABLE_NAME

--FROM INFORMATION_SCHEMA.TABLES t

--WHERE T.TABLE_TYPE = 'BASE TABLE';

 

OPEN tableCursor;

 

SET ARITHABORT OFF     -- prevent divide by zero errors if table has no rows

SET ANSI_WARNINGS OFF  -- or if all values are NULLs

 

-- Perform the first fetch.

FETCH NEXT FROM tableCursor INTO @TABLE_SCHEMA, @TABLE_NAME;

-- Check @@FETCH_STATUS to see if there are any more rows to fetch.

WHILE @@FETCH_STATUS = 0

BEGIN

    DECLARE columnCursor CURSOR FAST_FORWARD FOR

    SELECT COLUMN_NAME FROM INFORMATION_SCHEMA.COLUMNS

    WHERE TABLE_SCHEMA = @TABLE_SCHEMA

    AND TABLE_NAME = @TABLE_NAME

    AND DATA_TYPE NOT IN ('geography','geometry','image','ntext','text','xml');

 

    OPEN columnCursor;

 

    SET @tsql = 'select ''' + @TABLE_SCHEMA + ''' as ''schema'', ''' + @TABLE_NAME + ''' as ''table'', count(*) as nrows';

 

    -- Perform the first fetch.

    FETCH NEXT FROM columnCursor INTO @COLUMN_NAME;

    -- Check @@FETCH_STATUS to see if there are any more rows to fetch.

    WHILE @@FETCH_STATUS = 0

    BEGIN

       SET @tsql = @tsql + ', cast(cast(count(distinct(' + QUOTENAME(@COLUMN_NAME) + ')) as decimal(15,3))/cast(count(*) as decimal(15,3)) as decimal(4,3)) as ' + QUOTENAME(@COLUMN_NAME);

       -- This is executed as long as the previous fetch succeeds.

       FETCH NEXT FROM columnCursor INTO @COLUMN_NAME;

    END;

    SET @tsql = @tsql + ' FROM ' + QUOTENAME(@TABLE_SCHEMA) + '.' + QUOTENAME(@TABLE_NAME);

    

    CLOSE columnCursor;

    DEALLOCATE columnCursor;

 

    EXEC sp_executesql @tsql;

 

    FETCH NEXT FROM tableCursor INTO @TABLE_SCHEMA, @TABLE_NAME;

END;

 

CLOSE tableCursor;

DEALLOCATE tableCursor;

Here's some sample output:

schema table            nrows BusinessEntityID CreditCardID ModifiedDate

------ ---------------- ----- ---------------- ------------ ------------

Sales  PersonCreditCard 19118 1.000            1.000        0.059

As you can see, both BusinessEntityID and CreditCardID have the highest possible cardinality (i.e., every value is unique) and are suitable for indexing. ModifiedDate values have low cardinality (low selectively), so it wouldn't make sense to create an index on the ModifiedDate column.

Cardinality is not a term that's common in SQL Server literature, but it is common in Oracle literature. I suspect that Oracle people don't actually know what cardinality is. They do know that whenever somebody says "the indexes were created based on cardinality" that they must knowingly nod their heads in agreement. But internally they feel like they're back in college hoping that the professor doesn't call on them. So, the next time some Oracle people talk smack about SQL Server, nonchalantly throw out cardinality in your conversation just to make them squirm.

The script was written for use at a SQL Server 2005 customer. If SQL Server 2008 is available, take a look at this SSIS feature described here: http://sqlblog.com/blogs/eric_johnson/archive/2009/09/11/ssis-2008-data-profiling-task.aspx

 

 

Published Thursday, November 19, 2009 8:32 AM by John Paul Cook

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

cbolyard said:

Great post; thanks for sharing!

November 19, 2009 11:11 AM
 

sql_noob said:

nice

one time i created an index on a bit column, but it didn't seem to help performance :)

November 19, 2009 1:13 PM
 

ScottWhigham said:

Hi John Paul,

Have you considered using the Data Profiling Task in SSIS to offer additional support? It does something similar.

November 19, 2009 1:29 PM
 

John Paul Cook said:

Scott, the script was written for an all SQL Server 2005 shop. I added a link to a post describing the SSIS solution you proposed.

November 19, 2009 2:25 PM
 

ArnieRowland said:

John,

Very nice. A great expnansion would be to enable saving the results -perhaps as XML, or write out to a file.

I would even consider filtering out columns that fell below a user determined cardinality threshold level.

Arnie

November 20, 2009 10:13 AM
 

Ranga Narasimhan said:

Great script! Does SQL Server use the logic in this script in their missing index details DMVs ? If not they should implement this so highly selective columns come first in the index they recomend.

January 9, 2012 2:28 PM

Leave a Comment

(required) 
(required) 
Submit

About John Paul Cook

John Paul Cook is a Technology Solutions Professional for Microsoft's data platform and works out of Microsoft's Houston office. Prior to joining Microsoft, he was a Microsoft SQL Server MVP. He is experienced in Microsoft SQL Server and Oracle database application design, development, and implementation. He has spoken at many conferences including Microsoft TechEd and the SQL PASS Summit. He has worked in oil and gas, financial, manufacturing, and healthcare industries. John is also a Registered Nurse who graduated from Vanderbilt University with a Master of Science in Nursing Informatics and is an active member of the Sigma Theta Tau nursing honor society. He volunteers as a nurse at safety net clinics. Contributing author to SQL Server MVP Deep Dives and SQL Server MVP Deep Dives Volume 2.

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement