THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Jamie Thomson

This is the blog of Jamie Thomson, a freelance data mangler in London

Exploring earnings data for the UK [Open Data]

I have a burgeoning interest in the world of Open Data which wikipedia describes as:

Open data is the idea that certain data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control. The goals of the open data movement are similar to those of other "Open" movements such as open source, open hardware, open content, and open access. The philosophy behind open data has been long established, but the term "open data" itself is recent, gaining popularity with the rise of the Internet and World Wide Web and, especially, with the launch of open-data government initiatives such as Data.gov.
http://en.wikipedia.org/wiki/Open_data

To that end I follow @LondonDatastore on Twitter as they are actively publishing Open Data pertaining to the city in which I live, London. Four days ago they announced that a new dataset had been released that provided earnings information, not just for London but for the whole country and going back many years too. The provided link, http://data.london.gov.uk/datastore/package/earnings-workplace-borough, brings up a page from where one can download an Excel workbook containing some data. Unfortunately the data in that workbook is not, in my opinion, provided in such a manner that makes it easily explorable (which I have complained about before on this blog); the data is spread over multiple worksheets:

image

Worse, the data is already aggregated and pivoted:

image

In other words the workbook does not contain the raw row-level data from which this pivoted data is produced. Thankfully a link (https://www.nomisweb.co.uk/query/construct/summary.asp?mode=construct&version=0&dataset=99) is provided from where the raw data can be downloaded. I found the interface there to be slightly clunky but that’s a minor quibble – that raw data is available should be considered a major boon. I downloaded earnings data for:

  • Local Authority
  • Gender
  • Full Time or Part Time
  • Year (1999-2012)

and have made that raw data available in a publicly accessible Excel workbook. You can view that workbook online (only a browser required) at AnnualPay_by_YearGenderLocalAuthorityPartorfulltime.xlsx (from there you can also download for your own analysis).

image

The workbook also provides summaries over the raw data by way of pivot tables and charts. Arguably its clear to see, for example, that a gender imbalance exists although perhaps the gap may be lessening somewhat

image

Its interesting to note that the average salary for Males dropped off in 2009/2010. Perhaps the economic events of 2008 are the cause for that, checking out data for only City of London (the traditional financial hub of the UK) suggests that may well be the case:

image

There will be many more nuggets of information available in this data, all we need to do is set the data free so that people can find it for themselves. That is the aim of this blog post so hit the link: AnnualPay_by_YearGenderLocalAuthorityPartorfulltime.xlsx and see what other nuggets you can find! If dataviz is an area that interests you then this is a cracking dataset to explore!


One other point I want to make is that the raw data is provided as a mean average which means that each aggregated figure is a mean average of some mean averages. This isn’t good as it distorts the data as I demonstrate with this simple example:

image

We have an average salary for both male and female (5000 & 9000 respectively) and the average of those two averages is 7000. However if we take the total “total salary” / total “tally of people asked” (127000 / 23) then the average is quite different – 5521.73.

That’s a large discrepancy even for only two rows of data and it highlights the problem of providing averages rather than the figures that created those averages. To me this is a data quality issue – the raw source data does not provide the requisite level of detail to enable accurate analysis. Quality of data is paramount.

@Jamiet

Published Monday, February 11, 2013 2:34 PM by jamiet

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Tony Rogerson said:

Have you thought about the cause of this gender imbalance? Could it not be that the gender split is different job types thus a serious data skew takes place; if would be better if we had a job title to try and add that dimension to the data so we know that apples are being compared to apples.

Nice work though.

T

February 11, 2013 9:01 AM
 

jamiet said:

Hiya Tony,

Yeah, it would be great to analyse by profession, unfortunately that data is not available (not in what I was looking at anyway).

JT

February 11, 2013 10:14 AM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement