In my new life as a Hadoop data monkey I have been using a tool called Redwood Cronacle as a workflow/scheduler engine. One thing that has shocked me after years of working in the Microsoft ecosystem is the utter dearth of useful community content around Cronacle. There simply isn’t anything out there about it, nobody blogs about Cronacle (top link when googling for “cronacle blog” is http://www.chroniclebooks.com/blog/), no forums, precisely zero questions (at the time of writing) on stackoverflow tagged Cronacle… there’s just nothing. Its almost as if nobody else out there is using it and that’s infuriating when you’re trying to learn it.
In a small effort to change this situation I’ve already posted one Cronacle-related blog post Implementing a build and deploy pipeline for Cronacle and in this one I’m going to cover a technique that I think is intrinsic to any workflow engine, iterating over a collection and carrying out some operation on each iterated value (you might call it a cursor). There’s a wealth of blog posts on how to do this using SSIS’s ForEach Loop container because its a very common requirement (here is one I wrote 10 years ago) but I couldn’t find one pertaining to Cronacle. Here we go…
We have identified a need to be able to iterate over a dataset within Cronacle and carry out some operation (e.g. execute a job) for each iterated value. This article explains one technique to do it.
My method for doing this has two distinct steps:
- Build a dataset and return that dataset to Cronacle
- Iterate over the recordset
There are many ways to build a dataset (in the example herein I execute a query on Hadoop using Impala) hence the second of these two steps is the real meat of this article. That second step is however its pointless without the first step, hence both steps will be explained in detail.
Here's my Cronacle Job Chain that I built to demo this:
Step 1 - Build the collection
To emphasize a point made above, I could have used one of many techniques to build a collection to be iterated over, in this case I issued an Impala query:
N.B. The beeline argument --showheader has no effect when used with the -e option (only has an effect when a file is specified using the -f option). This is an important point as you will see below.
When this JobDefinition gets executed we can observe that the collection is assigned to the outParam parameter:
outParam is of type String, not a String array. The values therein are delimited by a newline character ("\n")
Step 2 - Iterate over the collection
The output parameter from the first job in the Job Chain is fed into an input parameter of the second job in the Job Chain:
From there we write Redwood Script (basically Java code) to split the string literal into an array and then execute an arbitrary Job Definition "JD_EchoInParameterValue_jamie_test" for each iterated value.
Thus, the code shown above is the important part here. It takes a collection that has been crowbarred into a string literal and splits it by \n into a string array then passes each element of that array to another job as a parameter. I’ve made the code available in a gist: https://gist.github.com/jamiekt/06a905ec9f8119416b4f
When executed observe that "JD_EchoInParameterValue_jamie_test" gets called three times, once for each value in the array ("col, "1", "2")
I’m still a Cronacle beginner so its quite likely that there is an easier way to do this. The method I’ve described here feels like a bit of a hack however that’s probably more down to my extensive experience with SSIS which has built-in support for doing this (i.e. the For Each Loop container).
Comments are welcome.
You can read all of my blog posts relating to Cronacle at http://sqlblog.com/blogs/jamie_thomson/archive/tags/Cronacle/default.aspx.
* SSIS is the tool that I used to use to do this sort of stuff in the Microsoft ecosystem