THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Aaron Bertrand

Aaron is a Product Evangelist for SentryOne, makers of Plan Explorer and a performance monitoring and event management platform for the Microsoft Data Platform and VMware. He has been blogging here at sqlblog.com since 2006, focusing on manageability, performance, and new features, and also blogs at blogs.sentryone.com and SQLPerformance.com; has been a Microsoft MVP since 1997; tweets as @AaronBertrand; and speaks frequently at major conferences, user group meetings, and SQL Saturday events worldwide.

Determining distribution of NULL values

Today on the twitter hash tag #sqlhelp, @leenux_tux asked:

How can I figure out the percentage of fields that don't have data ?

After further clarification, it turns out he is after what proportion of columns are NULL.

Some folks suggested using a data profiling task in SSIS. There may be some validity to that, but I'm still a fan of sticking to T-SQL when I can, so here is how I would approach it:

  1. Create a #temp table or @table variable to store the results.
  2. Create a cursor that loops through all of the columns for all tables, including total row count for the table, and whether or not the column is in fact nullable.
  3. Within the cursor, for columns that are nullable, get a count where columns are NULL, and insert the data for each column into the #temp table or @table variable.
        - I think it's ok here to use WITH (NOLOCK) to obtain the count, since you are likely only looking for estimates
  4. Run queries against the results, depending on exactly what you are after - percentage for certain columns, percentage across the database, what have you.

Some sample code, though depending on your tables and data distribution I am not going to vouch for how long this will take - may be a good idea to restore a backup somewhere and run this kind of thing there, as opposed to your production system:

DECLARE @all_columns TABLE
(
  
[table]    NVARCHAR(512),
  
[column]   SYSNAME,
  
nullable   INT,
  
row_count  BIGINT,
  
null_count BIGINT
);

DECLARE
  
@sql        NVARCHAR(MAX),
  
@table      NVARCHAR(512),
  
@column     SYSNAME,
  
@nullable   BIT,
  
@row_count  INT,
  
@null_count INT;

DECLARE c CURSOR
LOCAL STATIC FORWARD_ONLY READ_ONLY
   FOR
   SELECT
      
QUOTENAME(OBJECT_SCHEMA_NAME(c.[object_id]))
        +
'.' + QUOTENAME(OBJECT_NAME(c.[object_id])),
      
c.name, c.is_nullable, p.row_count
  
FROM sys.columns AS c
  
INNER JOIN
  
(
      
SELECT [object_id], row_count = SUM(row_count)
      
FROM sys.dm_db_partition_stats
      
WHERE index_id IN (0,1)
      
GROUP BY [object_id]
  
) AS p
  
ON c.[object_id] = p.[object_id]
  
WHERE OBJECTPROPERTY(c.[object_id], 'IsMsShipped') = 0;

OPEN c;

FETCH NEXT FROM c INTO @table, @column, @nullable, @row_count;

WHILE
@@FETCH_STATUS = 0
BEGIN
   SET
@null_count = 0;

  
IF @nullable = 1 AND @row_count > 0
  
BEGIN
       SET
@sql = N'SELECT @null_count = COUNT(*)
FROM '
+ @table + ' WITH (NOLOCK)
WHERE '
+ QUOTENAME(@column) + ' IS NULL;';

      
EXEC sp_executesql
          
@query      = @sql,
          
@params     = N'@null_count INT OUTPUT',
          
@null_count = @null_count OUTPUT;
  
END

   INSERT
@all_columns([table], [column], nullable, row_count, null_count)
  
SELECT @table, @column, @nullable, @row_count, @null_count;

  
FETCH NEXT FROM c INTO @table, @column, @nullable, @row_count;
END

CLOSE
c;
DEALLOCATE c;

SELECT
  
[table],
  
[column],
  
nullable,
  
null_count,
  
row_count,
  
null_perc = CONVERT(DECIMAL(5,2), 100.0*null_count/NULLIF(row_count,0))
FROM @all_columns
ORDER BY [table], [column];

Hope that helps. Even though it uses NOLOCK, dynamic SQL, and a cursor. I'm such a naughty database guy.
 

Published Thursday, November 17, 2011 11:21 PM by AaronBertrand

Comments

 

Stephen Mandeville said:

SELECT @table, @column, @nullable, @total_rows, @null_count;

@total_rows is declared where?

November 18, 2011 8:07 AM
 

AaronBertrand said:

Sorry, swapped out names last minute. Missed one. Fixed.

November 18, 2011 8:14 AM
 

Greg Faulk said:

The script failed with two different errors on my development system. Our programmers sometimes choose reserved words for column names (I have no control over this) so I added quotename() for the column name dynamic SQL:

SET @sql = N'SELECT @null_count = COUNT(*)

FROM ' + @table + ' WITH (NOLOCK)

WHERE ' + QUOTENAME(@column) + ' IS NULL;';

Also the final select statement produces a divide by zero error when reporting a table with zero rows so I added a CASE statement:

SELECT

  [table],

  [column],

  nullable,

  null_count,

  row_count,

  null_perc = case when row_count = 0 then NULL else CONVERT(DECIMAL(5,2), 100.0*null_count/row_count) end

FROM @all_columns

where row_count != 0

ORDER BY [table], [column];

November 18, 2011 11:39 AM
 

AaronBertrand said:

Thanks Greg, I'm usually quite meticulous about QUOTENAME() but I guess you get what you pay for. I put the script together pretty quickly last night and perhaps should have held off until today. I've incorporated your fixes.

Cheers,

Aaron

November 18, 2011 2:39 PM
 

Shy Engelberg said:

If the column has an index on it, or column statistics defined, then using the DBCC SHOW_STATISTICS statement can be usefull.

you can itterate through the nullable columns using a Cursor and run DBCC SHOW_STATISTICS using dynamic SQL.

The downside is that it requires statistics to be defined on the column (by using an index or explicitly creating the statistics) and that the statistics will be updated. (WITH FULLSCAN)

November 20, 2011 8:35 AM
 

Uri Dimant said:

Hi Aaron

Great script, I have my two cents to be added :-)

Is that  possible to filter out for is_nullable and row_count before starting a cursor with CTE --WHERE row_count>0 and etc. in terms of performance? What do you think?

WITH cte

AS

(

SELECT

      QUOTENAME(OBJECT_SCHEMA_NAME(c.[object_id]))

       + '.' + QUOTENAME(OBJECT_NAME(c.[object_id])) tblname,

      c.name, c.is_nullable, p.row_count

  FROM sys.columns AS c

  INNER JOIN

  (

      SELECT [object_id], row_count = SUM(row_count)

      FROM sys.dm_db_partition_stats

      WHERE index_id IN (0,1)

      GROUP BY [object_id]

  ) AS p

  ON c.[object_id] = p.[object_id]

  WHERE OBJECTPROPERTY(c.[object_id], 'IsMsShipped') = 0

) SELECT * FROM cte WHERE is_nullable>0 AND row_count>0

November 20, 2011 9:22 AM
 

AaronBertrand said:

Uri, I left those columns in because I suspect most people would still want to see the full report. I don't think performance is a primary concern here because this is a one-off task that you're not running 50 times a day...

Shy, yes you can look at statistics where they're available, but as I told Uri performance isn't the number one priority here, and the code is much simpler if I just do a count instead of checking if there are stats, checking if they're updated, updating them if they're not, and still having to code the count method for columns that don't have stats.

November 20, 2011 10:14 AM
New Comments to this post are disabled

About AaronBertrand

...about me...

This Blog

Syndication

Archives

Privacy Statement