THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Jamie Thomson

This is the blog of Jamie Thomson, a data mangler in London working for Dunnhumby

Dumb querying, Impala vs Hive

I alluded yesterday that I wouldn’t be posting much on this blog for the foreseeable future but I’ve come across something today that was quite interesting so I’m already back.

I’m using Hadoop and using Hive (version 0.12) to query the contents of some tables. Here’s the output from issuing “SELECT COUNT(*) FROM tablename”

2015-01-07_10-19-04

Your eyes don’t deceive you, that’s 223seconds to determine that this table has 0 rows in it. That staggered me. A colleague suggested I try the same query using Impala instead:

2015-01-07_10-26-07

Same result, but this time it came back in <1second.

What’s the moral here? I’m not really sure. I won’t go out on a limb and say “Impala is quicker” because all I’ve proved is that its quicker to count the number of rows in an empty table which is a fairly pointless exercise. Hive is designed for running operations in a batch manner over massive data sets, nonetheless I think its interesting and I suspect some of my buddies in the SQL server community (whom I suspect will be the only ones reading this) will find it interesting also. Note that Hive v0.12 is old (came out in October 2013) and also that the Stinger initiative is in full flow which promises much much faster response times to Hive queries like these:

image

@Jamiet

Published Wednesday, January 7, 2015 3:31 PM by jamiet
Filed under: , ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

joe said:

I suspect/guess that's because even though it's empty a job is still launched along with a map task and the time you see is due to the task launching and completing. Impala doesn't convert your query into an MR job and so it doesn't incur the same overhead.

January 7, 2015 2:55 PM
 

jamiet said:

Yep, spot on Joe. Even the exec of the MR part took 23s. This shouldn't be a surprise of course, just interesting to see the difference exhibited so obviously.

January 7, 2015 3:06 PM
 

Shrikant said:

I just ran a count(*) on a 220k row table and it ran in ~18 seconds.

http://i.imgur.com/PdnZeLv.png

Looks like Stinger delivered in v0.13!

February 20, 2015 9:18 AM
 

jamiet said:

Yup. We got 0.13 just this morning in fact. Running count(*) on a ~9000 row table, 35s. Much better. Still not in Impala territory though.

February 20, 2015 9:24 AM
 

jamiet said:

Slightly more impressively it just took 20seconds to tell me one of our tables has got 321842374 in it (that's ~321m). Not bad.

Impala took slightly more than a second.

February 20, 2015 10:00 AM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Privacy Statement