When reviewing a client's database, I suspected there might be indexes on low cardinality (i.e., low "uniqueness") columns. Typing COUNT(DISTINCT colName)) became tedious quickly. Time for a script. Oops, wait, in deference to proponents of test driven development, time for a test case and then time for a script. Here's the test case:
CREATE TABLE dbo.CardinalTest(
a bigint NULL,
b binary(50) NULL,
c bit NULL,
d char(10) NULL,
e date NULL,
f datetime NULL,
g datetime2(7) NULL,
h datetimeoffset(7) NULL,
i decimal(18, 0) NULL,
j float NULL,
k geography NULL,
l hierarchyid NULL,
m image NULL,
n int NULL,
o money NULL,
p nchar(10) NULL,
q ntext NULL,
r numeric(18, 0) NULL,
s nvarchar(50) NULL,
t nvarchar(max) NULL,
u real NULL,
v smalldatetime NULL,
w smallint NULL,
x smallmoney NULL,
y sql_variant NULL,
z text NULL,
aa time(7) NULL,
bb timestamp NULL,
cc tinyint NULL,
dd uniqueidentifier NULL,
ee varbinary(50) NULL,
ff varbinary(max) NULL,
gg varchar(50) NULL,
hh varchar(max) NULL,
ii xml NULL
);
The test case has a column for every data type in SQL Server 2008.
A cursor loop is need to iterate through all of the tables. Notice that instead of iterating through INFORMATION_SCHEMA.TABLES, a different approach was used in case you want to limit the analysis to the largest n tables. If you want to do this, add a TOP n clause. If you have a lot of columns, you can exceed the 4000 character limit of the @TSQL variable. If you don't use Unicode characters for table and column names, you can change @TSQL to VARCHAR(8000).
Now for the script:
DECLARE @TABLE_SCHEMA NVARCHAR(128);
DECLARE @TABLE_NAME NVARCHAR(128);
DECLARE @COLUMN_NAME NVARCHAR(128);
DECLARE @TSQL NVARCHAR(MAX);
DECLARE tableCursor CURSOR FAST_FORWARD FOR
/* The following query with a TOP n is useful by itself to show you the
largest tables in your database. */
/* This query works for SQL Server 2005 and above. */
SELECT OBJECT_SCHEMA_NAME(object_id) AS schemaName -- maybe add a TOP n
, OBJECT_NAME(object_id) AS tableName
FROM sys.dm_db_partition_stats
WHERE index_id < 2
AND OBJECTPROPERTY(object_id,'IsUserTable') = 1
ORDER BY sys.dm_db_partition_stats.row_count DESC;
/* This query works for SQL Server 2000. */
--SELECT TABLE_SCHEMA, TABLE_NAME
--FROM INFORMATION_SCHEMA.TABLES t
--WHERE T.TABLE_TYPE = 'BASE TABLE';
OPEN tableCursor;
SET ARITHABORT OFF -- prevent divide by zero errors if table has no rows
SET ANSI_WARNINGS OFF -- or if all values are NULLs
-- Perform the first fetch.
FETCH NEXT FROM tableCursor INTO @TABLE_SCHEMA, @TABLE_NAME;
-- Check @@FETCH_STATUS to see if there are any more rows to fetch.
WHILE @@FETCH_STATUS = 0
BEGIN
DECLARE columnCursor CURSOR FAST_FORWARD FOR
SELECT COLUMN_NAME FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = @TABLE_SCHEMA
AND TABLE_NAME = @TABLE_NAME
AND DATA_TYPE NOT IN ('geography','geometry','image','ntext','text','xml');
OPEN columnCursor;
SET @tsql = 'select ''' + @TABLE_SCHEMA + ''' as ''schema'', ''' + @TABLE_NAME + ''' as ''table'', count(*) as nrows';
-- Perform the first fetch.
FETCH NEXT FROM columnCursor INTO @COLUMN_NAME;
-- Check @@FETCH_STATUS to see if there are any more rows to fetch.
WHILE @@FETCH_STATUS = 0
BEGIN
SET @tsql = @tsql + ', cast(cast(count(distinct(' + QUOTENAME(@COLUMN_NAME) + ')) as decimal(15,3))/cast(count(*) as decimal(15,3)) as decimal(4,3)) as ' + QUOTENAME(@COLUMN_NAME);
-- This is executed as long as the previous fetch succeeds.
FETCH NEXT FROM columnCursor INTO @COLUMN_NAME;
END;
SET @tsql = @tsql + ' FROM ' + QUOTENAME(@TABLE_SCHEMA) + '.' + QUOTENAME(@TABLE_NAME);
CLOSE columnCursor;
DEALLOCATE columnCursor;
EXEC sp_executesql @tsql;
FETCH NEXT FROM tableCursor INTO @TABLE_SCHEMA, @TABLE_NAME;
END;
CLOSE tableCursor;
DEALLOCATE tableCursor;
Here's some sample output:
schema table nrows BusinessEntityID CreditCardID ModifiedDate
------ ---------------- ----- ---------------- ------------ ------------
Sales PersonCreditCard 19118 1.000 1.000 0.059
As you can see, both BusinessEntityID and CreditCardID have the highest possible cardinality (i.e., every value is unique) and are suitable for indexing. ModifiedDate values have low cardinality (low selectively), so it wouldn't make sense to create an index on the ModifiedDate column.
Cardinality is not a term that's common in SQL Server literature, but it is common in Oracle literature. I suspect that Oracle people don't actually know what cardinality is. They do know that whenever somebody says "the indexes were created based on cardinality" that they must knowingly nod their heads in agreement. But internally they feel like they're back in college hoping that the professor doesn't call on them. So, the next time some Oracle people talk smack about SQL Server, nonchalantly throw out cardinality in your conversation just to make them squirm.
The script was written for use at a SQL Server 2005 customer. If SQL Server 2008 is available, take a look at this SSIS feature described here: http://sqlblog.com/blogs/eric_johnson/archive/2009/09/11/ssis-2008-data-profiling-task.aspx