THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Jamie Thomson

This is the blog of Jamie Thomson, a data mangler in London working for Dunnhumby

Fetch once, write many times

As I’ve said before these days I’m spending time working on Hadoop rather than SQL Server and one of my colleagues today alerted me to a rather cool feature Of Hive (in a nutshell Hive is the technology that makes enables us to write SQL statements against Hadoop). Check this out:

SNAGHTML16b7b5a7

Pretty cool, no? From a single tablescan I can summarise and insert into two destinations. Its the capability of a SSIS dataflow, but in SQL. If I ever make it back to the SQL Server world I’ll be requesting this feature in T-SQL pronto.

@Jamiet

Published Tuesday, May 12, 2015 10:21 PM by jamiet
Filed under: , ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

mbourgon said:

May 12, 2015 5:44 PM
 

jamiet said:

Hi mike,

Hmmm, well, not really (IMO). Composable DML is definitely useful but it doesn't allow as many scenarios as the feature I mention above. For example, as I talk about at http://sqlblog.com/blogs/jamie_thomson/archive/2009/08/31/exploring-composable-dml.aspx, you can't do a GROUP BY over the data in the output clause, thus it's impossible to achieve the same as what is achieved in the example in my blood post above.

The biggest benefit that I see of the feature above is to aggregate your data to different granularities while only making one pass of the source data. You can't do that with Composable DML.

Thoughts?

May 12, 2015 6:04 PM
 

jamiet said:

Oh, and Composable DML allows insertion into a maximum of two tables. Looking at the syntax of this hive feature I see no reason why it wouldn't support more than that (though I'd have to check it out)

May 12, 2015 6:12 PM
 

mbourgon said:

Doh!  I hadn't seen that post of yours.  That'll teach me!  : )

May 12, 2015 6:17 PM
 

Adam Machanic said:

You should be able to achieve this with grouping sets. Though multiple passes over the data will effectively be made, by virtue of a spool. How does Hadoop actually process the data in this case? Is it truly doing only a single pass?

May 13, 2015 4:19 AM
 

Adam Machanic said:

Actually perhaps I misunderstood the syntax. You have two inserts, but are mike1 and mike2 different tables? In that case grouping sets is not directly the answer. But I think it still might be doable with some combination of grouping sets, composable DML, the MERGE statement (which allows special OUTPUT clause games to be played), and some creativity. The result would be ugly...but it's an interesting challenge.

May 13, 2015 4:22 AM
 

jamiet said:

Yep, took the words right of my mouth. The two destinations are different tables so grouping sets won't do the same job.

I'd be interested if the same could be achieved using the stuff you rattled off there. Also note what i said to Mike, i suspect this Hive syntax might support more than two destinations, I doubt that'd be possible via creative use of OUTPUT.

"How does Hadoop actually process the data in this case? Is it truly doing only a single pass?"

I believe so. Thought I guess it depends on your meaning of "a single pass". To my mind it would only need to pull the data off disk once but clearly there would need to be multiple passes of that extracted data in order to achieve multiple grains of aggregation (akin to the spool that you mentioned). The real answer is "I don't know right now". Gonna be fun looking into this in more detail. If nothing else its a nice, declarative, self-explanatory syntax.

May 13, 2015 4:52 AM
 

SK said:

By this time, you might have got answer.  

I think, since each INSERT create separate data file, it can run multiple INSERT INTO can execute simultaneously without filename conflict. to read and aggregate at different grain, its map-reduce job under the hood which can have multiple mappers and reducers running in parallel reading from same data source.

December 8, 2015 1:11 PM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Privacy Statement