THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Adam Machanic

Adam Machanic, Boston-based SQL Server developer, shares his experiences with programming, monitoring, and performance tuning SQL Server. And the occasional battle with the query optimizer.

Grouped String Concatenation: ... The Winner Is ...

After weeks of putting it off, I finally found the time and spent the last day and a half judging the Grouped String Concatenation Challenge. I would like to congratulate the winner, Peter Larsson, who submitted a great query and walks away with a shiny new MSDN Premium subscription.

For those who are interested, following is a breakdown of the judging process, along with some commentary:

 

Submission Process

To begin with, e-mails. As I mentioned in the first post, I ignored all e-mails that didn't follow the directions. Luckily this was only a few submissions. I felt it rather odd that people would spend a not insignificant amount of time working up a solution only to not bother to read the guidelines thoroughly. But that's human nature, I suppose.

 

Round 1

Once I collected all of the queries that followed the e-mail rules (all of which are included in the attached ZIP file), I began testing against an expanded version of AdventureWorks (the script for that is also included). I decided to eliminate any queries that did not produce the correct output data based on my sample set, or which took longer than 30 seconds to complete. The majority of queries did complete in a reasonable amount of time, and many were eliminated because the output simply wasn't correct. The biggest issue was ordering of the elements in the comma-delimited sets. I also deducted points from one person's entry because of invalid column names, but I decided to let the entry ride to the next round.

An important side note is that I created this competition with the sole intention of discovering new and different ways to do grouped string concatenation, and my hope was that someone would come up with a clever, fast solution. Unfortunately, that didn't happen, and every submission that used any technique except FOR XML PATH was eliminated in the first round of testing. I received some extremely creative solutions from a couple of people and I would like to mention them here:

  • Alejandro Mesa's submissions made use of various XQuery techniques, and are very interesting to look at, although fairly slow
  • Dean Cochrane's submission used an interesting idea of doing a MAX(CASE ...) pivot for the lists. Alas, the product names lists were not correct, so the submission didn't make it to the stress testing phase
  • Scott Coleman tried a similar technique, actually using the PIVOT keyword. Unfortunately, this ran for over 200 seconds, so it was eliminated

Also interesting to note is that a few people tried recursive CTE solutions. These were all cancelled at the 300 second mark. Recursive CTEs, as mentioned before on here on SQLblog, simply do not scale in their current implementation.

 

Round 2

After tabulating the Round 1 results I was left with 18 queries, and some obvious contenders.  I ran each query through a SQLQueryStress session with 10 threads running 5 iterations each. In this phase the queries were separated into fairly distinct groups: Those that ran for around 5 minutes, those that ran for around 7-8 minutes, and those that ran longer. These groups were based, not surprisingly, on how much attention was paid by the query writers to the little details. For example, Peter Larsson's winning query cut down on logical reads dramatically by doing some of the grouping in a derived table, rather than in the outermost query as some of the other submissions did.

Lesson learned: When doing aggregations, especially when joining a lot of tables, think about what you're really aggregating, and do the aggregation as early as possible. For example, if you need to aggregate sales per customer and get customer names, do the aggregation of the sales numbers first, then join out to get the customer names. Otherwise the query processor is forced to do more work than it has to do, and your query won't be as fast. Peter and a few other contestants understood this distinction and wrote queries that were much faster as a result.

 

Round 3

Round 2 eliminated 4 queries, leaving me with 14 to judge based on query style. In order to judge consistently, I came up with 10 factors. A query was allotted 500 points to start, and failure to meet each factor resulted in a 50 point penalty. These factors were:

  • Consistent Indentation
    • Does the query use the same rules for indentation in all parts? This is huge for readability and helps people understand where each section of the query starts and ends.
  • Consistent Capitalization
    • Does the query use the same rules for capitalization throughout? For example, keywords should be either all capitalized, or all lowercase.
  • Capitalize Keywords
    • I like to see keywords capitalized.
  • Use AS for Alias Names
    • AS is optional, and I've left it out in many queries I've written. But the more of other peoples' code I read, the more I realize that it really does help on the readability front. Use it. Always.
  • Follow Capitalization of Base Tables/Columns
    • If the base table is called OrderHeader, I want to see it used as OrderHeader when referenced in your query, rather than orderheader. A trainer I know found this out the hard way, when he reinstalled SQL Server on his laptop shortly before a training session, and used a case-sensitive collation rather than his previously-installed case-insensitive collation. He had been careless in adhering to capitalization for his training materials, and discovered the issue in front of the class. Oops.
  • Long Horizontal Lists
    • I don't like horizontal scrolling, and I find long lists difficult to read.
  • Consistent Vertical Lists
    • Put either a comma after each element or before each element, not both. Indent your lists the same way throughout. If you indent some items below the SELECT, don't put other items on the same line as the SELECT (or GROUP BY, or ORDER BY, etc)
  • Alignment of Delimiters
    • I follow a .NET-inspired style where I put delimiters on their own lines, and line them up vertically. This gives my code what I feel is an airy, easy-to-read feel. When reading others' code I look for some kind of alignment. Failure to align delimiters makes it very difficult to understand, again, where one section begins and another ends. By the way, common delimiters for this challenge included both parens and CASE...END.
  • Comments
    • Does the query have comments? Are the comments useful in understanding the logic?
  • Aesthetics
    • This is perhaps the most subjective. My general feeling on how I enjoyed reading the code.

All in all, the queries were pretty good. I would like to call out Rick Halliday, who had the highest score in this round with some very well formatted and highly readable code.

Round 4

After judging Round 3 I tallied all of the scores and was left with a tie for top 3:

  • Rick Halliday
  • Leonid Koyfman
  • Peter Larsson's query #4

All three of these queries were well thought out, but only one could win, so I took another pass through each. Rick's query, though extremely well written and readable, was eliminetad first due to the fact that it performed worse than the other two. This left Leonid and Peter. It was a tough choice, but I had to give the prize to Peter for taking the time to really think through the problem and figure out exactly how best to do the aggregations. Leonid was a very, very close second, and I really wish I had a consolation prize for him.


The End

And that's that. Thank you to everyone who participated in the challenge. I hope it was as much a learning experience for you as it was for me. Congratulations again to Peter. All of the materials are attached in the ZIP file; please let me know if you have any questions, comments, etc.

 

Published Sunday, May 31, 2009 5:07 PM by Adam Machanic

Attachment(s): GroupedString.zip

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Roman Nowak said:

Frankly, it's a little unclear to me why you didn't choose Leonid as winner. Logical IO is considered to be a poor metric (http://sqlblog.com/blogs/joe_chang/archive/2008/09/10/why-logical-io-is-a-poor-performance-metric.aspx). Don't you agree? After all it is Leonid solution which proved to be faster.

What's more having done a qick dirty check on i7 (no HT) and Peter's solution is  about 70% slower on this config (SQL 2005) - 2:21:3906 vs 1:17:6093 - that's a huge difference; so testbed seems to have a big influance here.

I hope it's not rude to have such a contrary opinion - your rules, you judge - but it just an opinion you asked for.

May 31, 2009 10:14 PM
 

Dave Ballantyne said:

Not that I wish to nit pick but , how were my entries deemed to have "order numbers not properly sorted" ?.  In both I ordered by OrderId,  where as Peter's winning entry order by OrderNumber.  The end result is the same though as OrderNumber is a Computer column based upon OrderId !.  Or can you point me to a customer id where there is a different order. A quick eyeball of the top 10 rows shows no difference.  Either way quite happy with the performance of my queries though.

June 1, 2009 6:47 AM
 

Adam Machanic said:

Roman: Yup, my rules, my verdict.  I had to choose one, so I did, based on one of the available metrics.

Dave: Yes, it's a computed column, but not explicitly sorting on the right thing seems to mess with the results.  Try running my expansion script and you can see this for yourself.  Why did you decide to sort on the ID column rather than the column you actually wanted the output sorted by?

June 1, 2009 8:32 AM
 

Alejandro Mesa said:

Congrats to Peter Larsson.

Thanks Adam; I enjoyed the fun.

AMB

June 1, 2009 10:25 AM
 

Adam Machanic said:

Update on the sorting issue, which disqualified several promising entries: The problem seems to be that the SalesOrderNumber column is not padded.  It's just a concatenation of 'SO' and the number converted to NVARCHAR.  My expansion script puts a lot of new numbers into the table, so we get the following situation:

SalesOrderID - Ordered

234

456

123456

SalesOrderNumber - Ordered

SO123456

SO234

SO456

June 1, 2009 10:45 AM
 

Virgil Rucsandescu said:

Hi Adam,

I couldn't find my name anywhere (checked Round 1 Eliminated, Round 2 Eliminated, Round 3 Eliminated). Are you sure you considered my script? Now I don't have access to my home home Outlook, but home I should have the message I sent you...

Was there any

June 1, 2009 12:06 PM
 

Adam Machanic said:

Hi Virgil,

I just checked my e-mail archive and don't see anything from you.  Are you sure you sent it?  If so, did you make sure to follow the rules exactly?  I had a SPAM exclusion for the correct subject line--if you didn't specify it properly your e-mail may have been discarded.

June 1, 2009 12:26 PM
 

Jacob Bennett said:

Hi all,

I'm trying to find the error in my query that made the ProductNames not order correctly but I can't seem to find the problem.  I've compared my results to Peter's and my output matches his.  Any help would be much appreciated.  My submission is in the "Round1_Eliminated" folder in Adam attachment.

June 1, 2009 2:46 PM
 

Adam Machanic said:

Jacob,

The problem is that you put the ORDER BY in a derived table under the FOR XML PATH, instead of incorporating it as part of the same query. TOP 100 PERCENT w/ ORDER BY is optimized out in derived tables and views so it's effectively meaningless.

June 1, 2009 3:36 PM
 

Jacob Bennett said:

Ah yes, old habits die hard.

June 1, 2009 4:54 PM
 

virgilrucsandescu said:

Hi Adam,

I forwarded few minutes ago my initial e-mail to you: (adam) (at) (this site) - I hope this is the right address ... Could you please take a look at it - I am curious what was wrong with the message, and it would be great if you could check whenever you have a little time my script (I am not competing against anybody here, just learning). Thank you so much!

June 1, 2009 7:12 PM
 

Rick Halliday said:

Adam,

thanks for an interesting challenge. I look forward to reading some of the other solutions to see if there is something I didn't consider or hastily dropped.

Rick

June 8, 2009 4:54 AM
 

Dean Cochrane said:

Hey Adam, no complaints from this quarter. Thanks for taking the time (which was substantial, I'll bet) to do this. I'll look through the other solutions. I'm sure I'll find some interesting things in there.

June 29, 2009 4:22 PM
 

Dean Cochrane said:

And hey, I don't feel so bad. Some pretty good coders went out in the first round. :)

June 29, 2009 4:30 PM
 

Jeff Moden said:

Heh... I guess I'm just missing it.  Where is the code from Adam Mechanic?  ;-)

"But rather than do the work all alone and simply post my solution..."

July 21, 2009 10:58 PM
 

Adam Machanic said:

Jeff, my solutions don't add much, if anything, to the group. My fastest effort was, like everyone else's, based on FOR XML PATH(''). And although I did come up with one "creative" solution based on techniques that no one else used, it performed about as well as the other unique solutions: Dismally.

July 27, 2009 11:04 AM
 

SQL and the like said:

Recently Adam Machanic issued a fresh SQL challenge based upon concatenating string data.  The results

September 8, 2009 5:59 AM
 

CesarF said:

followed your contest and congratulate you and all coders.

beside all fun i got...

do you know about some function as simple as:

SELECT concat(StrFld+', ') FROM Table GROUP BY KeyFld

that works with SQL2k?

maybe a c function or so that can be added?

do you know why MS postponed for so long this EVERYDAY functionality?

September 17, 2009 1:16 PM
 

http://www.sportskacentrala.com/phpinfo.php?a%5B%5D=%3Ca+href%3Dhttp%3A%2F%2Fconfidencegames.com%2Fgame-war-fire-age-hack%2F%3Ehttp%3A%2F%2Fconfidencegames.com%2Fgame-war-fire-age-hack%2F%3C%2Fa%3E said:

Adam Machanic : Grouped String Concatenation: ... The Winner Is ...

September 22, 2014 11:01 AM

Leave a Comment

(required) 
(required) 
Submit

About Adam Machanic

Adam Machanic is a Boston-based SQL Server developer, writer, and speaker. He focuses on large-scale data warehouse performance and development, and is author of the award-winning SQL Server monitoring stored procedure, sp_WhoIsActive. Adam has written for numerous web sites and magazines, including SQLblog, Simple Talk, Search SQL Server, SQL Server Professional, CoDe, and VSJ. He has also contributed to several books on SQL Server, including "SQL Server 2008 Internals" (Microsoft Press, 2009) and "Expert SQL Server 2005 Development" (Apress, 2007). Adam regularly speaks at conferences and training events on a variety of SQL Server topics. He is a Microsoft Most Valuable Professional (MVP) for SQL Server, a Microsoft Certified IT Professional (MCITP), and an alumnus of the INETA North American Speakers Bureau.

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement