THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Kalen Delaney

T-SQL Tuesday #002: Non-aggregated columns in a GROUP BY query

It's time for the second T-SQL Tuesday, once again managed by Adam Machanic. Since my first T-SQL Tuesday post dealt with my long years with the SQL Server product, I decided to continue in that vein.  Adam asked for us to write something about a puzzle we had to deal with, involving TSQL. I decided to write about one of the most puzzling behaviors of the TSQL language, that fortunately went away many versions ago….

When I started working with Sybase, over 22 years ago, I was hired as a Technical Support Analyst. The company was new and small, and didn't have a lot of internal training procedures in place, so I was basically just given the manuals to read for a couple of weeks before being put on the phones with customers.

My only knowledge of SQL and relational database systems was from a one semester course I had taken at UC Berkeley, taught by Michael Stonebraker. But, at least that was something, better than nothing.  I remember reading the TSQL manual, practically cover to cover, and encountering a very strange feature of queries using GROUP BY. I didn't completely understand it, but it seemed very strange. But I just kept reading, because I would have to face my first customer call… very soon. I remember that very first call, because I actually was able to provide some support. The caller was puzzled by the behavior he was getting from a certain TSQL query, the problem description sounded very much like something like the mysterious GROUP BY behavior I had read about just a day or two before.  And I was able to solve his problem that day, on my very first call.

As good as my memory for SQL Server trivia is, I don't remember the exact query that the customer was calling about. (I do remember the customer though, as we went on to work together on many more interesting problems, and I have recently reconnected with him on LinkedIn.) Although I don't remember the exact query, I remember the behavior. It had to do with a TSQL 'extension' to the standard, that allowed you to use columns in a SELECT list containing aggregates, that were not aggregates and not part of the GROUP BY. Whew…

So let's see what I am talking about. I'm going to use the pubs sample database, which is the only one we had 22 years ago, and all sample queries in all the documentation were written to use it. I know that data very well.  There is a table called titles, and each book in the titles table has an attribute called type, as well as title, price, and a few others. The following query is typical of the use of aggregation and GROUP BY. It takes all the rows with the same value for type, and produces one row of output for each distinct value of type. The average is the average per set of rows, one value per group of rows with the same type.

USE pubs;
GO
SELECT type, AVG(price) as avg_price
FROM titles
GROUP BY type;

RESULTS:
type         avg_price
------------ ---------------------
business     13.73
mod_cook     11.49
popular_comp 21.475
psychology   13.504
trad_cook    15.9633
UNDECIDED    NULL

Way back in SQL Server 3 and 4 (and 4.2 and 4.21) we could add additional columns to the SELECT,  so I could add the title column, or the first 10 characters of the title, to the query.  We would get the results shown after the query.

SELECT
  SUBSTRING(title, 1, 10) as name,
    type, avg(price) as avg_price,
    count(*) as number
FROM titles
WHERE  type like '%cook%'
GROUP BY type

RESULTS:
name       type         avg_price   number        
---------- ------------ ----------  ------
Silicon Va mod_cook     11.49       2             
The Gourme mod_cook     11.49       2              
Onions, Le trad_cook    15.96       3              
Fifty Year trad_cook    15.96       3              
Sushi, Any trad_cook    15.96       3

So first, you might wonder what it means to add the extra column. Once the rows are grouped by type, adding some additional column basically undoes the grouping. Now I get individual data values instead of aggregates (summaries).  Except that the avg_price and number do seem to be summaries. It might almost start to make sense, until we add an additional WHERE filter:

SELECT
SUBSTRING(title, 1, 10) as name,
type, avg(price) as avg_price,
count(*) as number
FROM titles
WHERE  type like '%cook%'
  AND title LIKE 'S%'
GROUP BY type

RESULTS:
name       type         avg_price   number     
---------- ------------ ----------- -----------
Silicon Va mod_cook     19.99       1          
The Gourme mod_cook     19.99       1          
Onions, Le trad_cook    14.99       1          
Fifty Year trad_cook    14.99       1          
Sushi, Any trad_cook    14.99       1
  

Both the average and the count changed when we added the additional condition, but the data that comes back does not all satisfy the condition… there are titles that do NOT start with 'S'! So what's going on?

It turned out that once we broke the ANSI rules, and added a column to the SELECT list that was not part of an aggregate function and was not contained in the GROUP BY, then our WHERE clause changes its meaning. In this case, the WHERE was only used to determine what rows were used to compute the aggregates. This is why the values for avg_price and number changed with the addition of the second condition. If we wanted to control which rows come back, we needed to specify the condition in a HAVING clause.

SELECT
SUBSTRING(title, 1, 10) as name,
type, avg(price) as avg_price,
count(*) as number
FROM titles
WHERE  type like '%cook%'
GROUP BY type
HAVING title LIKE 'S%'

RESULTS:
name       type         avg_price   number          
---------- ------------ ------------ -------
Silicon Va mod_cook     11.49        2          
Sushi, Any trad_cook    15.96        3

This sort of seems like the results I might expect. If this behavior is puzzling to you, imagine trying to explain it to customers and students new to SQL. It was actually my good friend Tibor Karaszi who ran all these queries for me on a SQL Server 4.21 that he had available, so I could verify the results. After being totally baffled by the behavior, he said: “I’m glad we don’t have to handle such topics in the newsgroups anymore!”

Even though we no longer can add unaggregated, ungrouped columns to the SELECT list of an aggregate query, that doesn't mean the puzzles are all gone. One problem that I continued to encounter when teaching beginning TSQL classes, was WHY this query isn't legal:

SELECT title, type, count(*) as number
FROM titles;

SQL Server 2008 gives you a somewhat helpful error message:

Msg 8120, Level 16, State 1, Line 1
Column 'titles.title' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

Just a few versions ago, the message only reported that the titles.title column was invalid, with no explanation. In Books Online, the explanation was found in the section on GROUP BY. But if someone worte the preceding query, they didn't even use a GROUP BY and may have no idea there was any relationship of this query to grouping operations. It made for a very puzzling time, to try to figure out this query didn't work as expected.

Hopefully now there are fewer puzzling behaviors… but we'll just have to wait and see what everyone else blogs about.

Have fun (and THANKS Tibor!)

~Kalen

(posted Monday evening on the US West Coast, but already Tuesday UTC time.)

Published Monday, January 11, 2010 5:00 PM by Kalen Delaney

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Rob Farley said:

Love it! I often find big misunderstandings about grouping, but we're definitely better off than we were in those days. My post for today at http://msmvps.com/blogs/robfarley/archive/2010/01/12/t-sql-tuesday-t-sql-puzzle-with-having.aspx discusses a different nuance of the HAVING clause.

January 11, 2010 10:21 PM
 

mjswart said:

Great article. "History" lessons are fascinating :-)

Thanks Kalen.

January 12, 2010 9:26 AM
 

Jon Fife said:

Amazingly, some engines STILL allow selecting non-aggregates in a group by query without grouping them...

January 14, 2010 2:23 PM
 

John Dunleavy said:

Very interesting, Kalen.  I really enjoyed reading your post.  I started in SQL Server 6.5 myself, so it's great to hear stories about the older versions of SQL and the issues that were facing the administrators during those times.  The lessons learned along the way are eternally helpful and an appreciation for the improvements can really only be understood after you have gone through those things yourself.  Thanks for sharing.

January 20, 2010 11:23 AM
 

Anil Pillai said:

Fantastic history there, Kalen. Thank you.

January 28, 2010 10:05 AM
 

Adam Machanic said:

According to T-SQL Tuesday rules as ratified by me in the first and second T-SQL Tuesday posts, the T-SQL

February 8, 2010 2:14 PM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Favorite Non-technical Sites or Blogs

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement