THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Jamie Thomson

This is the blog of Jamie Thomson, a freelance data mangler in London

Fuzzy Logic and Regex come to T-SQL in SQL Server 2008 R2 – available now!

In May 2008 I wrote a blog post entitled Fuzzyness where I put forward an argument for having the fuzzy lookup and fuzzy grouping functionality from SQL Server Integration Services (SSIS) in T-SQL. An excerpt from that blog post:

Take the following list of values:

Brian Smith
Bryan Smith
Brian Smyth
Bryan Smyth
Brian Smythe

Now as far as SQL Server 2008 is concerned those five names are all different, that's just how SQL works. Something either is true or it isn't. Yay or nay. Zero or one. Clearly though we as humans can see there is some sort of similarity between those five names and that is important information to know; its not hard to imagine a class of applications that could make use of this ability to know that things are similar as opposed to being the same.

Those of you that have used SSIS 2005 will know that Fuzzy Querying exists in that product as provided by the Fuzzy Lookup and Fuzzy Grouping components however I personally don't want to have to call out to a SSIS package to process data that is already in my database.

 

My good friend Rob Farley posted a request on Connect asking for fuzzy functionality to be introduced into T-SQL and today I’m delighted to be able to tell you that this functionality is coming to SQL Server 2008 R2 in the shape of Master Data Services (MDS). The MDS team are introducing the following into SQL Server:

 

 

I’ve just sat through a whirlwind tour of these new features from MDS team member Grant Dickinson and was positively wowed by what’s coming up.

  • The Regex functions wrap the innate Regex functionality in .Net and Grant pointed out that they do do some expression caching under the covers to offer some slight performance improvements when using these functions in T-SQL. Functions include mdq.RegexExtract, mdq.RegexIsMatch, mdq.RegexIsValid, mdq.RegexMask, mdq.RegexReplace & mdq.RegexSplit.
  • The similarity scalar function mdq.Similarity(..) returns a score indicating the similarity of two values as defined by one of the aforementioned similarity algorithms
  • The FL3 functions are the next evolution of the fuzzy logic algorithms currently built into SSIS and are now coming to the database engine. They offer indexing of attribute groups (an MDS construct that is exactly what it says it is – groups of attributes of an entity) to enable fast lookups against large datasets. The example that Grant showed was, I guess, the canonical fuzzy lookup demonstration showing how incoming name and address combinations could be matched against an existing list of known names and addresses and then providing a best match plus similarity score. The FL3 functions are wrapped in slightly more user-friendly functions like mdqFuzzyLookup_Build (which builds your fuzzy lookup index for you), mdq.FuzzyLookup_Lookupmdq.FuzzyLookup_Match.
  • A new function mdq.Split that provides yet another way of handling arrays in T-SQL. (Erland will have to update his article!)

There was a lot more information that Grant provided however we were a bit short on time so we didn’t go into quite as much detail as we might otherwise have done. Never mind though, the documentation is up and available at http://msdn.microsoft.com/en-us/library/ee633712(SQL.105).aspx so go and read about these new features for yourself.

image

I wish I could tell you more right now although I plan to do exactly that over the coming few weeks. SQL Server 2008 R2 CTP3 is available as of today from http://www.microsoft.com/sqlserver/2008/en/us/R2.aspx so get downloading and have a play with the new features in MDS. So much to learn, so little time!!!

@Jamiet

Published Monday, November 09, 2009 10:49 PM by jamiet

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Adam Machanic said:

Wow, that's excellent. Thanks for reporting!

November 9, 2009 5:40 PM
 

Andy Clark said:

I'm sure this is a good thing, and is important for comparing data that is not people's names such as addresses or telephone numbers.

Soundex was developped for the US Census and hence is orientated towards people's names and the english pronciation of words.

The key issue I've seen with the Soundex is that it relies on the first letter being the same. It's also been commented that Soundex is not good in comparing non english names, however I've never seen that myself.

So for example are John Citson and John F. Kitson the same person?

November 10, 2009 3:56 AM
 

dan said:

Very interesting, potentially lots of applications in anti money laundering etc, Master Data Services seems a good place to locate it.

November 10, 2009 5:28 AM
 

Rob Farley said:

I can't believe I hadn't seen this. I didn't even get a Google Alert on my name. Hmm...

Thanks Jamie - now I just need to find the extra hours to get into all this (I feel another presentation's going to be prepared...)

November 10, 2009 5:50 PM
 

Simon Sabin said:

This isn't all that it seems. These haven't been added to the language. Rather they are contained within a MDM database which you have to deploy.

I would be interested to know what redist there is on the database and the libraries.

November 11, 2009 10:13 PM
 

Tonci Korsano said:

Hi there,

I got a google alert on this web page.

My keyword was ""fuzzy lookup".

I have other alerts like "fuzzy sql", "fuzzy grouping", etc.

Thank you for the heads-up.

Keep up this good work.

Best regards,

Tonci.

November 18, 2009 11:40 AM
 

SSIS Junkie said:

I’m at SQLBits 5 today and I’ve just come out of Ian Marriott’s session where he talked about the forthcoming

November 21, 2009 11:47 AM
 

wBob said:

See here for additional SQL 2008 R2 hidden function mdq.XmlTransform allowing you to use XSLT against XML in your database:

http://beyondrelational.com/blogs/wbob/archive/2010/05/05/sql-2008-r2-mds-easter-egg.aspx

June 16, 2010 11:26 AM
 

Joe Wilkerson said:

I need fuzzy logic to find matches on distance (latitude, longitude) and dateTime stamp ... any thing within distance < x meters and with y seconds (of another entry) might be a match.  For my situation this is a cross join on 100s of billions of rows in even the simplest cases!

I can code it easily with where clause functions as filters but I'm still trying to figure out how many 100s of hours it will run!

Ultimately, I assume I will have to recode in c# for efficiency, but trying to get a working prototype done first in sql only.

Fuzzy joins have many uses!

August 22, 2010 3:24 PM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement