THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Linchi Shea

Checking out SQL Server via empirical data points

Parallelism beyond T-SQL

People can do amazing things with T-SQL alone even when it may not be the best language to use. For instance, T-SQL was never designed or intended as a language to manage running multiple queries concurrently or running queries asynchronously. But I have seen many stored procedures written to start SQL Agent jobs and then manage the termination of these jobs by constantly polling the job status. In effect, these stored procedures leverage SQL Agent jobs to execute T-SQL code in multiple threads in parallel or execute T-SQL code asynchronously.

 

For some tasks, this approach works fine and is manageable. But in general, if you require running multiple queries in parallel either inside the same SQL Server instance or across multiple SQL Server instances, T-SQL is not the best language of choice.

 

Let’s look at the case of a typical data archiving task where you have multiple tables to archive. For each of these tables, you may need to (1) determine what must be archived, (2) copy the data to be archived from the table into an archive table, and (3) delete the data from the table. If you write everything in T-SQL, typically you’d loop through the tables and perform the above logic for each table in turn. This is obviously ‘single-threaded’. Although any of the queries may be processed with multiple threads inside the SQL Server engine because of the query parallelism support, the overall process is single threaded and synchronous in that step N+1 can only start when step N is finished. No doubt there is great potential to reduce the total batch processing time, if we can run some of these steps in parallel.

 

You could use the approach described earlier by creating multiple SQL Agent jobs, perhaps one for each table, and all your main procedure needs to do is to start these jobs and wait for their completion. In such as case, the data archive task is performed in multiple threads in parallel, one for each table. Even if the tables are scattered across multiple servers, you could still use this approach. You just have to configure some linked servers and start the SQL Agent jobs on their corresponding servers.

 

Is there any problem with this approach? As mentioned, for some tasks this is a perfect approach. Plus, the beauty is that everything is done in T-SQL only, and that’s right in your comfort zone. But then, that’s also where the problem is for some other tasks where more granular control is a must and interacting with non-SQL Server components may be necessary.

 

Let me continue with the data archiving task. More granular control can come in two areas: (1) concurrency or parallelism throttle, and (2) exception handling.

 

Except in the case where the number of tables is very small and does not change, you most likely would want to be able to dynamically throttle the degree of concurrency instead of simply running one SQL Agent job per table. Though doable, it would be very ugly to dynamically create SQL Agent jobs and assign each job what tables to process. Moreover, for certain tasks, the overhead of managing the creation, execution, and removal of SQL Agent jobs just doesn’t quite justify it. A case in point is, for instance, when you have a large number of tables scattered across half a dozen servers, and you want to do a count of each table on how many rows to be archived.

 

For exception handling, T-SQL offers very little support when you rely on SQL Agent jobs to achieve parallelism. Other than polling the job status and checking some status tables that the job is expected to update, your main T-SQL procedure basically has little control over the SQL Agent job once the job has started running. If the task can be partitioned as such that each SQL Agent job can handle its own errors independently, you are fine. Otherwise, this approach can become quite clumsy.

 

So what’s the point? The point is that as good as it is to stay within one’s comfort zone that is T-SQL, for some tasks T-SQL is not the best language even if you can manage to accomplish the tasks with T-SQL alone. Tasks that require parallelism or that can benefit from parallelism (beyond SQL Server engine’s query parallelism) may be better accomplished if you venture beyond T-SQL into a robust programming language, such as C#, that truly supports multi-threading, and use it (instead of T-SQL) to construct the main controlling mechanism for your tasks.

 

Granted that when you only have a wrench, you may not have a choice but to use it as a hammer. But I submit that you’d be far better off if you actually have a hammer in your toolbox! Then again, if you live in a wrench friendly but hammer hostile environment, you could be in quite a quandary.

Published Monday, August 02, 2010 10:28 AM by Linchi Shea

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Adam Machanic said:

Agreed 100%, Linchi. Stay tuned :-)

August 2, 2010 10:14 AM
 

Davide Mauri said:

For parallelism using the Service Broker can be a good balance between simplicity and effectiveness, do you think so?

(I mean, creating a multi-threaded solution in C# is everything but simple!) :)

August 2, 2010 12:16 PM
 

Remus Rusanu said:

You can leverage the SSB self-balancing capabilities of Activation. Have a look at http://rusanu.com/2009/08/05/asynchronous-procedure-execution/ and http://rusanu.com/2009/08/18/passing-parameters-to-a-background-procedure/. For exception handling, I wish I'd have an easy answer...

August 2, 2010 12:24 PM
 

Linchi Shea said:

I'm not advocating writing a multi-thread app in C# as the only approach. The main point is the T-SQL is quite limited in this regard, and it's good to have alternative tools available for you to use. Of course, there are many alternatives. Exactly what alternative to use depends on many factors, not the least of which is what you are already familiar with and the nature of the specific task at hand.

August 2, 2010 12:36 PM
 

Chad Miller said:

+1 -- Agree 100%. I understand you're not proposing a one-size solution of using C#, but since you mentioned it...

I would suggest looking at PowerShell. There are several ways to parallelize a process in PowerShell (background jobs, launching multiple processes, remoting, System.Threading). Some parallel Powershell solutions are just as hokey as using T-SQL, however the following PowerShell script looks interesting and is similar to a C# System.Threading implementation:

http://www.nivot.org/2009/01/22/CTP3TheRunspaceFactoryAndPowerShellAccelerators.aspx

August 2, 2010 2:22 PM
 

Michael Zilberstein said:

Agree 100%. As the least ugly T-SQL solution I thought about writing multi-threaded C# code and executing it inside SQL Server as CLR stored procedure. Haven't tried it yet.

August 2, 2010 3:43 PM
 

Phani said:

A consolidated list would be a great output from this blog which will benefit folks around like me who will give it a try working on those options. Please help.

August 3, 2010 12:59 AM
 

DM Unseen said:

I'll have some serious second thoughts about this. I think the issue here lies solely in the limitations of (T)SQL and not in relational languages per se, since multi assigment (see CJ Date's articles) http://portal.acm.org/citation.cfm?id=1808085#. solves these kinds of issues quite neatly and consistently, far better than any C# implementation would, since the updates would also be transactionally consistent with eachother as well as (when possible) executed in paralell.

August 3, 2010 3:51 AM
 

GrumpyOldDBA said:

Mind you this can typically be more tricky when there are foreign key constraints in place. I agree with you completely, the issue of archiving data from a production database can be quite painful, especially as the volumes to be archived grow and even more so when your system is 24 x 7.

August 4, 2010 3:16 AM
 

HowardW said:

I'm surprised to see no mention of SSIS here. It would seem the natural choice given the example and has excellent support for running tasks in parallel...

August 4, 2010 9:06 AM
 

Niels Grove-Rasmussen said:

I do agree - T-SQL is not a golden hammer. Specially for system administration there are more suitable tools like PowerShell, C# or C++. Please keep up the brilliant work Linchi.

August 5, 2010 3:47 AM
 

Mike Honey said:

I'm with Howard - why reinvent the wheel?

SSIS gives complex control flow including parallelism, dependancy constraints, loops (even parallelism within loops), logging, and configurable cross-server deployment, all with a low learning curve.  

For the requirement you described above, you need only to add some "Execute SQL" tasks, copy & paste your SQL, then connect the arrows to arrange your control flow for any dependancies.  The SSIS engine will adjust the degree of parallelism at runtime based on available CPUs and manage all the dependancies and parallelism for you.  

BTW, SSIS has been a SQL Server component for oh, 5 years now ...

August 5, 2010 8:40 PM

Leave a Comment

(required) 
(required) 
Submit

About Linchi Shea

Checking out SQL Server via empirical data points

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement