In December 2011 I wrote a blog post entitled Querying RSS feed subscriber count on Google Reader using Data Explorer in which I used a product called Data Explorer, that was in beta at the time, to find out how many people subscribed on Google Reader to various bloggers on http://sqlblog.com. Here were the results:
It was a nice demo of what was possible using Data Explorer however it did have some limitations, mainly that I had to type in the list of bloggers and their URLs. It would have been much more elegant to have Data Explorer request a list of bloggers from http://sqlblog.com and iterate over that list, querying for information about each in turn.
Times move on and today, 3 and a bit years later, Google Reader is defunct and Data Explorer has morphed into Power Query. It looks different and has lots of new features which make it much more conducive to garnering information from the web. I thought it might be time well spent (and kinda fun) to revisit my previous blog post and see if Power Query can collate more useful information regarding http://sqlblog.com than it could back then.
If you can’t be bothered reading the detail of this post then just take a look at this screenshot that I shared on Twitter earlier today of a Power BI dashboard that shows information pertaining to http://sqlblog.com:
This dashboard shows:
That screenshot is taken from http://powerbi.com which is a place where Power BI reports (that are based on Power Query queries) can be stored and shared. Unfortunately they can only be shared with people whose email domain is the same as yours so I can’t share a link directly to the page which is a real shame because Power BI reports are interactive and the benefit is much greater when one is able to interact with them. If you would like this situation to change (and you should) then go and vote for these requests:
The real work here goes on in the Power Query query that retrieves the data that underpins these reports. If you’d like to know more about that, read on!
The Power Query part
Power Query is, to all intents and purpose, a draggy-droppy-pointy-clicky UI over a functional language called M. The M queries that I built to obtain the data depicted above I have shared in this github gist: https://gist.github.com/jamiekt/bac21f6169a9853dcef5 or you could download as a Power BI Designer file: sqlblog.pbix. Whether you copy-and-paste the M code or you download the .pbix you will need to download and install Power BI Designer in order to run the queries and see the resultant data.
One caveat to running the queries, ensure you have created a Data Source setting on your machine for http://sqlblog.com, it should look just like this:
Retrieving data from a website using Power Query is essentially a web scraping exercise. Here are the steps the M query basically goes through to retrieve this information:
- For each link in the list of monthly archives
- retrieve the monthly archive
- for each blog post summary in the monthly archive
- retrieve the title and number of comments
That’s pretty much it. The code looks a bit gnarly at first glance but all its essentially doing is navigating the DOM and following a few hyperlinks.
The reports are Power View reports. Power View is fairly intuitive so I don’t think its worth talking about it here too much. I do however think it is worth showing some examples of the insights that Power View can provide on top of this data.
How many blog posts and how many comments on SQLBlog
Slicing by blogger
Charting all of the blog posts and ordering by number of comments is useful, that’s how we see that Marco Russo’s post DateTool dimension: an alternative Time Intelligence implementation is the most popular:
However if we introduce a slicer we can analyse a single blogger’s blog posts, let’s take Aaron Bertrand for example seeing as he has posted the most:
Aaron’s most popular blog post (going by the number of comments) is his ditty from October 2010 Fun with software : uninstalling SQL Server 2008 R2 Evaluation Edition.
Taking it further
There is a heck of a lot of other data available on http://sqlblog.com should one want to go and fetch it and Power Query makes it really easy to do that. It would be interesting, for example, to chart the tally of blog posts per month and see when “peak blog month” occurred. I’ll leave that as an exercise for the reader.