In May 2008 I wrote a blog post entitled Fuzzyness where I put forward an argument for having the fuzzy lookup and fuzzy grouping functionality from SQL Server Integration Services (SSIS) in T-SQL. An excerpt from that blog post:
Take the following list of values:
|Brian Smith |
|Bryan Smith |
|Brian Smyth |
|Bryan Smyth |
|Brian Smythe |
Now as far as SQL Server 2008 is concerned those five names are all different, that's just how SQL works. Something either is true or it isn't. Yay or nay. Zero or one. Clearly though we as humans can see there is some sort of similarity between those five names and that is important information to know; its not hard to imagine a class of applications that could make use of this ability to know that things are similar as opposed to being the same.
Those of you that have used SSIS 2005 will know that Fuzzy Querying exists in that product as provided by the Fuzzy Lookup and Fuzzy Grouping components however I personally don't want to have to call out to a SSIS package to process data that is already in my database.
My good friend Rob Farley posted a request on Connect asking for fuzzy functionality to be introduced into T-SQL and today I’m delighted to be able to tell you that this functionality is coming to SQL Server 2008 R2 in the shape of Master Data Services (MDS). The MDS team are introducing the following into SQL Server:
I’ve just sat through a whirlwind tour of these new features from MDS team member Grant Dickinson and was positively wowed by what’s coming up.
- The Regex functions wrap the innate Regex functionality in .Net and Grant pointed out that they do do some expression caching under the covers to offer some slight performance improvements when using these functions in T-SQL. Functions include mdq.RegexExtract, mdq.RegexIsMatch, mdq.RegexIsValid, mdq.RegexMask, mdq.RegexReplace & mdq.RegexSplit.
- The similarity scalar function mdq.Similarity(..) returns a score indicating the similarity of two values as defined by one of the aforementioned similarity algorithms
- The FL3 functions are the next evolution of the fuzzy logic algorithms currently built into SSIS and are now coming to the database engine. They offer indexing of attribute groups (an MDS construct that is exactly what it says it is – groups of attributes of an entity) to enable fast lookups against large datasets. The example that Grant showed was, I guess, the canonical fuzzy lookup demonstration showing how incoming name and address combinations could be matched against an existing list of known names and addresses and then providing a best match plus similarity score. The FL3 functions are wrapped in slightly more user-friendly functions like mdqFuzzyLookup_Build (which builds your fuzzy lookup index for you), mdq.FuzzyLookup_Lookup & mdq.FuzzyLookup_Match.
- A new function mdq.Split that provides yet another way of handling arrays in T-SQL. (Erland will have to update his article!)
There was a lot more information that Grant provided however we were a bit short on time so we didn’t go into quite as much detail as we might otherwise have done. Never mind though, the documentation is up and available at http://msdn.microsoft.com/en-us/library/ee633712(SQL.105).aspx so go and read about these new features for yourself.
I wish I could tell you more right now although I plan to do exactly that over the coming few weeks. SQL Server 2008 R2 CTP3 is available as of today from http://www.microsoft.com/sqlserver/2008/en/us/R2.aspx so get downloading and have a play with the new features in MDS. So much to learn, so little time!!!