THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Piotr Rodak

if datepart(dw, getdate()) in (6, 7)
    use pubs;

Hadoop growing pains

This post is not going to be about SQL Server. I have been reading recently more and more about “Big Data” – very catchy term that describes untamed increase of the data that mankind is producing each day and the struggle to capture the meaning of these data. Ten years ago, and perhaps even three years ago this need was not so recognized. Increasing number of smartphones and discernable trend of mainstream Internet traffic moving to the smartphone generated one means that there is bigger and bigger stream of information that has to be stored, transformed, analysed and perhaps monetized. The nature of this traffic makes if very difficult to wrap it into boundaries of relational database engines. The amount of data makes it near to impossible to process them in relational databases within reasonable time. This is where ‘cloud’ technologies come to play.

I just read a good article about the growing pains of Hadoop, which became one of the leading players on distributed processing arena within last year or two. Toby Baer concludes in it that lack of enterprise ready toolsets hinders Hadoop’s apprehension in the enterprise world. While this is true, something else drew my attention. According to the article there are already about half of a dozen of commercially supported distributions of Hadoop. For me, who has not been involved into intricacies of open-source world, this is quite interesting observation. On one hand, it is good that there is competition as it is beneficial in the end to the customer. On the other hand, the customer is faced with difficulty of choosing the right distribution. In future, when Hadoop distributions fork even more, this choice will be even harder. The distributions will have overlapping sets of features, yet will be quite incompatible with each other. I suppose it will take a few years until leaders emerge and the market will begin to resemble what we see in Linux world. There are myriads of distributions, but only few are acknowledged by the industry as enterprise standard. Others are honed by bearded individuals with too much time to spend.

In any way, the third fact I can’t help but notice about the proliferation of distributions of Hadoop is that IT professionals will have jobs.


Published Thursday, June 21, 2012 10:45 PM by Piotr Rodak
Filed under: , , ,



Geoff said:

Relational databases work fine. You just need something like Vertica, Netezza, etc.

Older database technologies like Oracle and SQL Server are not up to the task but others are. (And that's OK.)

June 22, 2012 12:36 PM

Alexander Kuznetsov said:

I frequently use open source, as well as roll out my own tools whenever it makes sense. As such, my perspective is different.

Of course, the "difficulty of choosing the right distribution" is a good point. Yet the alternatives might not be any easier. If what I need is not exactly standard, then getting things done with closed source systems may be slow, painful, and inefficient.

Consider, for example, how much time and effort Itzik Ben Gan has spent on promoting OLAP functions: writing blog posts, articles and such. I guess it would be many times easier to just develop those functions.

This is why we are using our own tools to unit test stored procedures, and for ETL, and for many other tasks. Surely I could spend a lot of time creating Connect items and such, but for me it was so very much easier to just do it the way I need, get tools that work the way I want, and enjoy my work, instead of struggling with the tool that does not do what I want.

Regarding "There are myriads of distributions, but only few are acknowledged by the industry as enterprise standard" - why should I care if it is standard or not. If it works really well for me, saving me a lot of time every day, then I can just concentrate on doing productive and interesting work.

So I am with you on "individuals with too much time to spend" - if you have good tools you do have more time. If you do not have to worry how your changes affect others, you may have your tools fit your needs perfectly well, and just enjoy.

Conclusion: we do not have to always agree on standards. Faced with different challenges, we may choose different solutions. It may be dramatically cheaper to roll out your own solution than to struggle with the one that is not a good fit for your problem.

June 22, 2012 5:00 PM

Alexander Kuznetsov said:

I wanted to provide one more example of a typical open-sourcey approach. Ayende used to struggle getting things done with SSIS, and wrote a good description of his pain points:

Instead, he developed his own library: Rhino ETL

He wrote the following: "Rhino ETL was born out of a need." Yes, Rhino ETL was not born because someone had too much time to spend.

June 22, 2012 5:34 PM

Piotr Rodak said:

Thanks guys for your comments.

First of all, Alex, my post wasn't intended to complain about the open source software. My general point was that Hadoop is in its infancy and there will be a lot of streams of its development in the future. I remember when a friend of mine showed me Linux in early nineties. Who would know then that Linux would gain such strong position in the world of commercial enterprise software? I remember I watched my friend typing obscure command lines and I was mostly impressed by the ability of switching command consoles :).

I think the same will happen to Hadoop, just much quicker - the pace of life seems to have accelerated tremendously since early nineties. We will see distributions rise and fall, along with the companies that promote them.But eventually we will see a few distributions emerge from the 'chaos' that will provide the value the business is looking for. I agree that you might not care if something is standard or not, but from enterprise's point of view it is more important. Software should have I would say, 'representation' so support can be secured, SLAs agreed, development path acknowledged. Industry standards help to reduce the risk, and this is what enterprises are interested in.

Geoff, I do know that there are solutions in RDBMS world that handle huge amounts of data very well. In fact, I think that from a very high point of view, the data appliance architectures (share nothing) and the Hadoop architecture are similar in principle. They allow for distributing the workload amongst multiple nodes and collect results of calculations. And that's OK. I believe that both approaches, relational and nonrelational have and will continue to have their application although at the moment I see that more and more problems are considered 'non relational', what in my opinion is not always correct.

June 25, 2012 8:09 PM
New Comments to this post are disabled
Privacy Statement