THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Jamie Thomson

This is the blog of Jamie Thomson, a data mangler in London working for Dunnhumby

Analysing SQLBlog using Power Query

In December 2011 I wrote a blog post entitled Querying RSS feed subscriber count on Google Reader using Data Explorer‏ in which I used a product called Data Explorer, that was in beta at the time, to find out how many people subscribed on Google Reader to various bloggers on http://sqlblog.com. Here were the results:

It was a nice demo of what was possible using Data Explorer however it did have some limitations, mainly that I had to type in the list of bloggers and their URLs. It would have been much more elegant to have Data Explorer request a list of bloggers from http://sqlblog.com and iterate over that list, querying for information about each in turn.

Times move on and today, 3 and a bit years later, Google Reader is defunct and Data Explorer has morphed into Power Query. It looks different and has lots of new features which make it much more conducive to garnering information from the web. I thought it might be time well spent (and kinda fun) to revisit my previous blog post and see if Power Query can collate more useful information regarding http://sqlblog.com than it could back then.


tl;dr

If you can’t be bothered reading the detail of this post then just take a look at this screenshot that I shared on Twitter earlier today of a Power BI dashboard that shows information pertaining to http://sqlblog.com:

image

This dashboard shows:

That screenshot is taken from http://powerbi.com which is a place where Power BI reports (that are based on Power Query queries) can be stored and shared. Unfortunately they can only be shared with people whose email domain is the same as yours so I can’t share a link directly to the page which is a real shame because Power BI reports are interactive and the benefit is much greater when one is able to interact with them. If you would like this situation to change (and you should) then go and vote for these requests:

The real work here goes on in the Power Query query that retrieves the data that underpins these reports. If you’d like to know more about that, read on!

The Power Query part

Power Query is, to all intents and purpose, a draggy-droppy-pointy-clicky UI over a functional language called M. The M queries that I built to obtain the data depicted above I have shared in this github gist: https://gist.github.com/jamiekt/bac21f6169a9853dcef5 or you could download as a Power BI Designer file: sqlblog.pbix. Whether you copy-and-paste the M code or you download the .pbix you will need to download and install Power BI Designer in order to run the queries and see the resultant data.

One caveat to running the queries, ensure you have created a Data Source setting on your machine for http://sqlblog.com, it should look just like this:

image

Retrieving data from a website using Power Query is essentially a web scraping exercise. Here are the steps the M query basically goes through to retrieve this information:

  SNAGHTML34123979

    • For each link in the list of monthly archives
      • retrieve the monthly archive
      • for each blog post summary in the monthly archive
        • retrieve the title and number of comments
  image

That’s pretty much it. The code looks a bit gnarly at first glance but all its essentially doing is navigating the DOM and following a few hyperlinks.

The reports

The reports are Power View reports. Power View is fairly intuitive so I don’t think its worth talking about it here too much. I do however think it is worth showing some examples of the insights that Power View can provide on top of this data.

How many blog posts and how many comments on SQLBlog

image image

Slicing by blogger

Charting all of the blog posts and ordering by number of comments is useful, that’s how we see that Marco Russo’s post DateTool dimension: an alternative Time Intelligence implementation is the most popular:

image

However if we introduce a slicer we can analyse a single blogger’s blog posts, let’s take Aaron Bertrand for example seeing as he has posted the most:

image

Aaron’s most popular blog post (going by the number of comments) is his ditty from October 2010 Fun with software : uninstalling SQL Server 2008 R2 Evaluation Edition.

How many bloggers ARE there on http://sqlblog.com

image

62!

Taking it further

There is a heck of a lot of other data available on http://sqlblog.com should one want to go and fetch it and Power Query makes it really easy to do that. It would be interesting, for example, to chart the tally of blog posts per month and see when “peak blog month” occurred. I’ll leave that as an exercise for the reader.

@Jamiet

Published Wednesday, March 25, 2015 10:05 PM by jamiet

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Andy said:

Quality blog post (as always).  I had to chuckle about the Data Explorer preview as I remember having a play around.

One of the features I really like about Power Query is the Data Catalog integration, I like the idea of certifying data sources and making them a searchable asset.  Wonder if making the Data Catalog a OneDrive feature (an exposable API?) to share these queries?

March 25, 2015 4:51 PM
 

Jason Kohlhoff said:

Good stuff, Jamie. I've done some tickering with PowerQuery, and I REALLY hope Microsoft finds a way to bake it (M) in to the next version of SSIS.

March 25, 2015 5:51 PM
 

jamiet said:

Hi Jason,

There have been a lot of discussions about that on twitter over the past 12 hours:

https://twitter.com/jamiet/timelines/580998007597858816

March 26, 2015 3:48 AM
 

jamiet said:

Hi Andy,

yeah, I'm interesed to see where they go with the Data Catalog.

March 26, 2015 3:48 AM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Privacy Statement