THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Andy Leonard

Andy Leonard is an author and engineer who enjoys building and automating data integration solutions. Andy is co-host of the Data Driven podcast. Andy is no longer updating this blog. His current blog is

Big Data vs. Sampling


Merriam-Webster defines sampling as:

  1. the act, process, or technique of selecting a suitable sample; specifically :  the act, process, or technique of selecting a representative part of a population for the purpose of determining parameters or characteristics of the whole population

  2. a small part selected as a sample for inspection or analysis ask a sampling of people which candidate they favor

In statistics, sampling is the practice of viewing or polling a representative subset of a population. Sampling generates statistical error so that results are often published as some value +/- some error range, i.e. 47% +/- 5%. It’s possible to produce accurate but useless statistical results in this manner. For example, if the result of a political poll in a race between two candidates is 47% +/- 5%, one interpretation is “it’s too close to call.” On second thought, that’s not useless. But it may be less useful for a candidate who desires to spend large sums of money to prepare for a victory celebration.

The Vs. Part…

Sampling informs analysts and data scientists of an approximation and a range of potential error. Sampling says we don’t need to poll every individual to achieve a result that is “good enough.”

“Big data” attempts to poll every individual.

The Problem I am Trying To Solve

Is more data better? In his 2012 book, Antifragile, Nassim Nicholas Taleb ( | @nntaleb) – the first data philosopher I encountered – states:

“The fooled-by-data effect is accelerating. There is a nasty phenomenon called ‘Big Data’ in which researchers have brought cherry-picking to an industrial level. Modernity provides too many variables (but too little data per variable), and the spurious relationships grow much, much faster than real information, as noise is convex and information is concave.” – Nassim Nicholas Taleb, Antifragile, p. 416

According to Taleb, there’s a bias for error embedded in big data; more is not better, it’s worse. I’ve experienced this with business intelligence solutions and spoken about data quality in data warehouse solutions, saying:

“The ratio of good:bad data in a useless / inaccurate data warehouse is surprisingly high; almost always north of 95% and often higher than 99%.”

The Solution

The solution to bad data has always been (and remains) data quality. “Just eliminate the inaccurate data” sounds simple but it’s not an easy problem to solve. In data science, data quality is the next-to-the-longest long pole.  (Data integration is the longest long pole.) The solution for the first and second longer poles in data science is the same: automation.

At Enterprise Data & Analytics, we’re automating data quality. I mention it not by way of advertisement (although a geek’s gotta eat), but to inform you of another research focus area in our enterprise. Are we the only people trying to solve this problem? Goodness no. (That would be tragic!) We are focused on automating data science. You can learn more about our solutions for automating data wrangling at DILM Suite.


Learn More:
Biml in the Enterprise Data Integration Lifecycle (Password: BimlRocks)
From Zero to Biml - 19-22 Jun 2017, London 
IESSIS1: Immersion Event on Learning SQL Server Integration Services – 2-6 Oct 2017, Chicago

SSIS Framework Community Edition
Biml Express Metadata Framework
SSIS Catalog Compare
DILM Suite

Published Saturday, April 8, 2017 10:29 AM by andyleonard

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS


No Comments

Leave a Comment


This Blog



My Latest Book:

Community Awards

Friend of Red Gate

Contact Me


Privacy Statement