THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Jamie Thomson

This is the blog of Jamie Thomson, a data mangler in London working for Dunnhumby

Parallel Foreach loops, one reason to use Powershell Workflow instead of SSIS

Lately I’ve been using Azure Automation which is a service on Azure for running Powershell Workflow scripts, as such as I’ve diving deep into Powershell Workflow. What’s Powershell Workflow? I’m glad you asked:

A workflow is a sequence of programmed, connected steps that perform long-running tasks or require the coordination of multiple steps across multiple devices or managed nodes. Windows PowerShell Workflow lets IT pros and developers author sequences of multi-device management activities, or single tasks within a workflow, as workflows. By design, workflows can be long-running, repeatable, frequent, parallelizable, interruptible, stoppable, and restartable. They can be suspended and resumed; they can also continue after an unexpected interruption, such as a network outage or computer restart.
Getting Started with Windows PowerShell Workflow

So Powershell Workflow does exactly what it says on the tin, its a workflow engine. Those of you that, like me, have been knee deep in SQL Server Integration Services (SSIS) for most of the past decade will know that SSIS too has its own workflow engine – its got a different name, Control Flow, but its still a workflow engine. One frequent ask of SSIS’s Control Flow is a Parallel ForEach Loop but there seems little hope that we’re going to get one (raised on Connect in March 2014, closed as Won’t Fix 5 months later) without going to 3rd party solutions like Cozyroc’s Parallel Loop Task.

As the title of this post has already told you Powershell Workflow has a big advantage over SSIS, it includes a Parallel Foreach Loop out-of-the-box and you can read all about it at about_Foreach-Parallel. Or type “help about_Foreach-Parallel” into your nearest Powershell window. From there:

The Parallel parameter of the ForEach keyword runs the commands in a ForEach script block once for each item in a specified collection.
The items in the collection, such as a disk in a collection of disks, are processed in parallel. The commands in the script block run sequentially on each item in the collection.

That’s fairly self-explanatory. If you’re interested to know more I’ve put together a cool little Powershell Workflow demo that makes calls to a URI, first serially (using a conventional Foreach Loop) then parallelly (which isn’t, I’m pretty sure, a real word but I’m using it anyway) and hosted it as a githib gist: Parallel-vs-Serial.ps1. You can simply copy-paste the script into Powershell ISE, hit F5 and you’re away. On my machine the serial calls completed in 19seconds, the parallel calls in 13seconds. No brainer!

Take a look if you have 5 minutes. This is good stuff.

@Jamiet 

image

Published Tuesday, December 9, 2014 4:59 PM by jamiet

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Frank Szendzielarz said:

Isn't Powershell Workflow basically WWF? I think I wrote way back when to you on this blog even that I saw Control Flow being supplanted by WWF. I would like to see WWF come with a library of Data Flow custom activities.

However, there is yet an even more performant alternative less familiar even amongst the C# crew: TPL DataFlow http://msdn.microsoft.com/en-us/library/hh228603(v=vs.110).aspx

Some tests put it at higher performance than SSIS with native data access. Don't have links to hand at the mo.

December 9, 2014 7:20 PM
 

jamiet said:

Hi Frank,

Yes, Powershell Workflow is an abstraction over WF (i.e. Workflow Foundation). There was talk in the SQL Server 2008 timeframe of SSIS's Control Flow being replaced by WF but it never happened, and probably never will.

I always though TPL Dataflow was analogous to SSIS's Dataflow rather than Control Flow (and not just because of the name). I followed the link you posted and saw these quotes:

"The TPL Dataflow Library consists of dataflow blocks, which are data structures that buffer and process data."

"The TPL defines three kinds of dataflow blocks: source blocks, target blocks, and propagator blocks. A source block acts as a source of data and can be read from. A target block acts as a receiver of data and can be written to. A propagator block acts as both a source block and a target block, and can be read from and written to."

Sounds very much like a data flow to me, not Control Flow. Please do correct me if I'm wrong.

TPL Dataflow has been on my list of stuff to look at for years, unfortunately that's a long list.

Thanks for the comment.

JT

December 10, 2014 2:54 AM
 

Frank Szendzielarz said:

Hi Jamie

Yes you are right TPL Dataflow is more like the SSIS Data Flow.  I think that sometimes a decision making process of 'SSIS or C#' happens and this technology could tip designers in favour of the later, and then you could have a fully WF or fully C# scenario. But yeah, it doesn't have much to do with Azure Automation I suppose.

I worked for quite a while with WF and if I remember rightly, ParallelForEach did not actually run the child activities in parallel on separate threads, I think it just differed from ForEach in that the next activity was sent to the scheduling engine if the current activity went idle. I guess this ForEach -parallel is a new WF Activity specifically for PowerShell?

December 10, 2014 6:32 AM
 

jamiet said:

My understanding is that For each -parallel in PowerShell Workflow is just an abstraction over WF's ParallelForeach, but I might be mistaken. I would hope that *is* what it is, otherwise there'd be some wheel reinventing going on within PowerShell Workflow and I'm not a fan of that.

Regardless, my demo above proves that For each -parallel can be quicker so I'm happy to go on using it regardless of the underlying tech.

December 10, 2014 6:40 AM
 

Frank Szendzielarz said:

Thanks. If I get some time I will research how/why this works and post results. Perhaps the improvement in speed depends on the nature of the scheduled commands - if ping or web requests are issued then perhaps the activity wrapping those commands is aware that the task is idle until the network response arrives.

December 10, 2014 7:09 AM
 

Frank said:

For what it's worth, I had a dig around to understand 'why' it works, and the answer seems to be that:

-  All the Powershell WF activities derive from PSActivity in the Microsoft.PowerShell.Activities namespace (Microsoft.Powershell.Workflow.ServiceCore.dll) , which is implemented as an asynchronous NativeActivity . The CanInduceIdle property is set to true. All the Powershell WF activities derive from that class and use it to execute, while differing only in their properties/parameters and the string name of the underlying powershell command.

-  ParallelForEach in WF loops over child activities sequentially, but continues as soon as they go idle or complete, and because the child activities in powershell are executed asynchronously, WF bookmarks are created and the ParallelForEach does create new threads for them.

-  Timeouts are implemented using a WF Delay activity behind the scenes that causes a cancellation on completion

December 10, 2014 9:23 AM
 

jamiet said:

Wow, awesome detective work Frank, and certainly very interesting indeed. I really appreciate you taking the time to do this and comment here with your findings.

December 10, 2014 9:29 AM
 

jeff_yao said:

Really enjoy reading all these thoughtful comments/conversations. A good supplementary to the blog itself.

December 10, 2014 1:44 PM
 

Frank said:

My pleasure. I am wracking my brains trying to think how this information might be relevant, and while I can't think of specific examples, my intuition and experience with WF says "expect gotchas".

For example, even though it has nothing to do with the above, I was working on one WF "control flow" project where incoming messages triggered workflows, where each workflow instance resulted (eventually) in a database update. This was OK until the number of incoming messages per second was so high that it exceeded the maximum connections per connection pool allowed by default in ADO.NET. It took me ages before I figured out why I was getting SQL connection exceptions on an incoming MQ message , bound to a WF service. In order to ensure this worked, I had to set a throttling limit on the parallelforeach in WF, and also increase the max connections in the connection string.

Another problem, which you will run into , is with max tcp outbound connections on some windows servers set to 10 by default, which exists to prevent network virus/worms replicating too quick. You have to run an update on the OS itself just to get some WF parallelism to work.

This is just some vague rambling from me, really, but I guess if you want to do some WF control flow type stuff,  if your scripts are responding to incoming messages over ServiceBus or MQ, you have to watch out for some interesting things in the field of parallelism you wouldn't normally have to deal with.

December 10, 2014 6:43 PM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Privacy Statement