THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | Join | Help
in Search

SSIS Junkie

Freelance SQL Server developer in London

  • Suggested Best Practises and naming conventions

    Once upon a time I blogged at http://consultingblogs.emc.com/jamiethomson but that ended in August 2009 when I left EMC. There is a lot of (arguably) valuable content over there however certain events in the past leave me concerned that that content is not well cared for and I don't have any confidence that it will still exist in the long term. Hence, I have taken the decision to re-publish some of that content here at SQLBlog so over the coming weeks and months you may find re-published content popping up here from time-to-time.

    This is the third such blog post in which I suggest some best practises and naming conventions that you may choose to employ and which was originally published here (I have changed it slightly since then – spotters badge if you can find the differences!). The first post in this series can be found at [SSIS] OnPipelineRowsSent and the second at Dataflow mechanics.


    I thought it would be worth publishing a list of guidelines that I see as SSIS development best practices. These are my own opinions and are based upon my experience of using SSIS over the past 18 months. I am not saying you should take them as gospel but these are generally tried and tested methods and if nothing else should serve as a basis for you developing your own SSIS best practices.

    One thing I really would like to see getting adopted is a common naming convention for tasks and components and to that end I have published some suggestions at the bottom of this post.

    This list will get added to over time so if you find this useful keep checking back here to see updates!

    1. If you know that data in a source is sorted, set IsSorted=TRUE on the source adapter output. This may save unnecessary SORTs later in the pipeline which can be expensive. Setting this value does not perform a sort operation, it only indicates that the data it sorted.
    2. Rename all Name and Description properties from the default. This will help when debugging particularly if the person doing the debugging is not the person that built the package.
    3. Only select columns that you need in the pipeline to reduce buffer size and reduce OnWarning events at execution time
    4. Following on from the previous bullet point, always use a SQL statement in an OLE DB Source component or LOOKUP component rather than just selecting a table. Selecting a table is akin to "SELECT *..." which is universally recognised as bad practice. (http://www.sqljunkies.com/WebLog/simons/archive/2006/01/20/17865.aspx). In certain scenarios the approach of using a SQL statement can result in much improved performance as well (http://blogs.conchango.com/jamiethomson/archive/2006/02/21/2930.aspx).
    5. Use SQL Server Destination as opposed to OLE DB Destination where possible for quicker insertions I used to recommend using SQL Server Destinations wherever possible but I've changed my mind. Experience from around the community suggests that the difference in performance between SQL Server Destination and OLE DB Destination is negligible and hence, given the flexibility of packages that use OLE DB Destinations it may be better to go for the latter. Its an "it depends" consideration so you should consider what you prefer based on your own testing.
    6. Use Sequence containers to organise package structure into logical units of work. This makes it easier to identify what the package does and also helps to control distributed transactions if they are being implemented.
    7. Where possible, use expressions on the SQLStatementType property of the Execute SQL Task instead of parameterised SQL statements. This removes ambiguity when different OLE DB providers are being used. It is also easier. (UPDATE: There is a caveat here. Results of expressions are limited to 4000 characters so be wary of this if using expressions).
    8. Use caching in your LOOKUP components where possible. It makes them quicker. Watch that you are not grabbing too many resources when you do this though.
    9. LOOKUP components will generally work quicker than MERGE JOIN components where the 2 can be used for the same task (http://blogs.conchango.com/jamiethomson/archive/2005/10/21/2289.aspx). Note that this is not always the case so test and measure, test and measure, test and measure!
    10. Always use DTExec to perf test your packages. This is not the same as executing without debugging from SSIS Designer (http://www.sqlis.com/default.aspx?84).
    11. Use naming conventions for your tasks and components. I suggest using acronyms at the start of the name and there are some suggestions for these acronyms at the end of this article. This approach does not help a great deal at design-time where the tasks and components are easily identifiable but can be invaluable at debug-time and run-time.  e.g. My suggested acronym for a Data Flow Task is DFT so the name of a data flow task that populates a table called MyTable could be "DFT Load MyTable".
    12. If you want to conditionally execute a task at runtime use expressions on your precedence constraints. Do not use an expression on the "Disable" property of the task.
    13. Don't pull all configurations into a single XML configuration file. Instead, put each configuration into a seperate XML configuration file. This is a more modular approach and means that configuration files can be reused by different packages more easily.
    14. If you need a dynamic SQL statement in an OLE DB Source component, set AccessMode="SQL Command from variable" and build the SQL statement in a variable that has EvaluateAsExpression=TRUE. (http://blogs.conchango.com/jamiethomson/archive/2005/12/09/2480.aspx)
    15. When using checkpoints, use an expression to populate the CheckpointFilename property which will allow you to include the value returned from System::PackageName in the checkpoint filename. This will allow you to easily identify which package a checkpoint file is to be used by.
    16. When using raw files and your Raw File Source Component and Raw File Destination Component are in the same package, configure your Raw File Source and Raw File Destination to get the name of the raw file from a variable. This will avoid hardcoding the name of the raw file into the two seperate components and running the risk that one may change and not the other.
    17. Variables that contain the name of a raw file should be set using an expression. This will allow you to include the value returned from System::PackageName in the raw file name. This will allow you to easily identify which package a raw file is to be used by. N.B. This approach will only work if the Raw File Source Component and Raw File Destination Component are in the same package.
    18. Use a common folder structure (http://blogs.conchango.com/jamiethomson/archive/2006/01/05/2559.aspx)
    19. Use variables to store your expressions (http://blogs.conchango.com/jamiethomson/archive/2005/12/05/2462.aspx). This allows them to be shared by different objects and also means you can view the values in them at debug-time using the Watch window.
    20. Keep your packages in the dark (http://www.windowsitpro.com/SQLServer/Article/ArticleID/47688/SQLServer_47688.html). In summary, this means that you should make your packages location unaware. This makes it easier to move them across environments.
    21. If you can, filter your data in the Source Adapter rather than filter the data using a Conditional Split transform component. This will make your data flow perform quicker.
    22. When storing information about an OLE DB Connection Manager in a configuration, don't store the individual properties such as Initial Catalog, Username, Password etc... just store the ConnectionString property.
    23. Your variables should only be scoped to the containers in which they are used. Do not scope all your variables to the package container if they don't need to be.
    24. Employ namespaces for your packages
    25. Make log file names dynamic so that you get a new logfile for each execution.
    26. Use ProtectionLevel=DontSaveSensitive. Other developers will not be restricted from opening your packages and you will be forced to use configurations (which is another recommended best practice)
    27. Use annotations wherever possible. At the very least each data-flow should contain an annotation.
    28. Always log to a text file, even if you are logging elsewhere as well. Logging to a text file has less reliance on external factors and is therefore most likely to contain all information required for debugging.
    29. Create a new solution folder in Visual Studio Solution Explorer in order to store your configuration files. Or, store them in the 'miscellaneous files' section of a project.
    30. Always use template packages to standardise on logging, event handling and configuration.
    31. If your template package contains variables put them in a dedicated namespace called "template" in order to differentiate them from variables that are added later.
    32. Break out all tasks requiring the Jet engine (Excel or Access data sources) into their own packages that do nothing but that data flow task. Load the data into Staging tables if necessary. This will ensure that solutions can be migrated to 64bit with no rework.
    33. Don't include connection-specific info (such as server names, database names or file locations) in the names of your connection managers. For example, "OrderHistorySystem" is a better name than "Svr123ABC\OrderHist.dbo".

    The acronyms below can be used at the beginning of the names of tasks to identify what type of task it is.

    Task

    Prefix

    For Loop Container

    FLC

    Foreach Loop Container

    FELC

    Sequence Container

    SEQC

    ActiveX Script

    AXS

    Analysis Services Execute DDL

    ASE

    Analysis Services Processing

    ASP

    Bulk Insert

    BLK

    Data Flow

    DFT

    Data Mining Query

    DMQ

    Execute DTS 2000 Package

    EDPT

    Execute Package

    EPT

    Execute Process

    EPR

    Execute SQL

    SQL

    File System

    FSYS

    FTP

    FTP

    Message Queue

    MSMQ

    Script

    SCR

    Send Mail

    SMT

    Transfer Database

    TDB

    Transfer Error Messages

    TEM

    Transfer Jobs

    TJT

    Transfer Logins

    TLT

    Transfer Master Stored Procedures

    TSP

    Transfer SQL Server Objects

    TSO

    Web Service

    WST

    WMI Data Reader

    WMID

    WMI Event Watcher

    WMIE

    XML

    XML

     

    These acronyms should be used at the beginning of the names of components to identify what type of component it is.

     

    Component

    Prefix

    DataReader Source

    DR_SRC

    Excel Source

    EX_SRC

    Flat File Source

    FF_SRC

    OLE DB Source

    OLE_SRC

    Raw File Source

    RF_SRC

    XML Source

    XML_SRC

    Aggregate

    AGG

    Audit

    AUD

    Character Map

    CHM

    Conditional Split

    CSPL

    Copy Column

    CPYC

    Data Conversion

    DCNV

    Data Mining Query

    DMQ

    Derived Column

    DER

    Export Column

    EXPC

    Fuzzy Grouping

    FZG

    Fuzzy Lookup

    FZL

    Import Column

    IMPC

    Lookup

    LKP

    Merge

    MRG

    Merge Join

    MRGJ

    Multicast

    MLT

    OLE DB Command

    CMD

    Percentage Sampling

    PSMP

    Pivot

    PVT

    Row Count

    CNT

    Row Sampling

    RSMP

    Script Component

    SCR

    Slowly Changing Dimension

    SCD

    Sort

    SRT

    Term Extraction

    TEX

    Term Lookup

    TEL

    Union All

    ALL

    Unpivot

    UPVT

    Data Mining Model Training

    DMMT_DST

    DataReader Destination

    DR_DST

    Dimension Processing

    DP_DST

    Excel Destination

    EX_DST

    Flat File Destination

    FF_DST

    OLE DB Destination

    OLE_DST

    Partition Processing

    PP_DST

    Raw File Destination

    RF_DST

    Recordset Destination

    RS_DST

    SQL Server Destination

    SS_DST

    SQL Server Mobile Destination

    SSM_DST

  • Thoughts on Test Driven Database Development

    Test-Driven Development (TDD) is a software development practise that has been around for a few years. Wikipedia describes it as:

    Test-driven development (TDD) is a software development process that relies on the repetition of a very short development cycle: first the developer writes a failing automated test case that defines a desired improvement or new function, then produces code to pass that test and finally refactors the new code to acceptable standards. Kent Beck, who is credited with having developed or 'rediscovered' the technique, stated in 2003 that TDD encourages simple designs and inspires confidence.

    http://en.wikipedia.org/wiki/Test-driven_development

    Since 2003 TDD practises have seen refinements such as Behavior-Driven Development and Uncle Bob's Three Rules of TDD, all the while TDD has pretty much become an accepted way of developing quality software. Accepted that is everywhere outside of the database development arena and that is the arena in which I spend my working life. TDD simply has not, in my opinion, caught on with database developers like it has our appdev brethren and I was reminded of this yesterday when Atul Thakor asked on Twitter:

    anyone done TDD for database development and would they recommend it?

    https://twitter.com/#!/atulthakor/status/161886007929733120

    To which my answer was an emphatic:

    (1) yes & (2) absolutely, yes

    https://twitter.com/#!/jamiet/status/161894215217987585

    I'll use this blog post to expand on that outside of 140 characters.

     

    In October 2010 I undertook a mini-project for the client I was working for at the time (a bank) where a colleague and I were tasked with building the database portion of a system that would support reconciliation of our ETL processes. It was a nice piece of work in that it was small, well-scoped, time-bound, greenfield, did not have any external dependancies and had a technically savvy product owner. We sat down at the start and decided that this was an ideal opportunity to trial TDD as a method of developing a database; I would write the failing tests and my colleague would make the tests pass. We came up with some guiding principles and, although we didn't know it at the time, they were pretty close to Uncle Bob's three rules.

    I used Visual Studio 2010's database unit testing framework1 to write my tests and have them run as part of our Continuous Integration (CI) build (see Setting up database unit testing as part of a Continuous Integration build process). I would write the tests, check-in, the CI build would fail and my colleague would "get latest" in order to see what code he had to write to stop the build from failing. To cut a long story short the use of TDD was considered to be a great success; we shipped a working system on time/on budget and moreover, even though I didn't write a scrap of code that went into production I have never had more confidence that a system I was involved in building worked as intended. That's quite a statement. My confidence stemmed from the fact that as the test author I was ultimately responsible for ensuring that the system did what it was supposed to; I could qualify my confidence by pointing at our CI build and highlighting the number of tests that were passing and how that number had steadily increased as the project progressed.

    By the time the project had finished the database consisted of (if memory serves me correctly) 6 tables and about 10 stored procedures or functions (so yes, very small). To test that we had roughly 70 tests that were getting run up to 20 times a day. The project had taken about two months from start-up to final delivery - you can make your own opinions as to whether you consider that prompt or tardy but our product owner was happy and that's pretty much all that counted as far as I was concerned.

    Since that project I have moved onto other clients and at each one I have always extolled the use of database unit testing; we haven't always practised TDD but at each one we have been writing database unit tests and in the future I suspect that a client's willingness (or lack thereof) to use database unit testing will be a major factor in influencing whether we end up working together or not.

    Are you a database developer doing database unit testing or perhaps even TDD? Let me know in the comments, I'd love to hear about others' experiences.

    @jamiet

    1Yes, that linked-to article from 7 years ago is the best one I could find to describe what Visual Studio's Database Unit testing Framework actually is - sort it out Microsoft!

    UPDATE: I have just remembered that Jamie Laflen has written an excellent whitepaper entitled Apply Test-Driven Development to your Database Projects that goes into much more detail about how to achieve database TDD using Visual Studio than I have here. Well worth a read.

  • Whatever happened to Twitter Annotations?

    In April 2010 Twitter announced a new feature that they would soon be introducing - Twitter Annotations. Put simply Twitter Annotations can be described as the ability to attach metadata to a tweet; think hashtags on steroids. Lots of people were quite excited about it:

    I love to sit on the beach.  One of the coolest things about the beach is the number of layers of visual depth.  Look at the sand and it's beautiful, but zoom your eyes in closer and you'll see a whole layer of life running around on the sand that you didn't see before.  Look even closer and you can see individual grains of sand, water and light dancing between them.  Look closer still and you see that each grain of sand is a unique object with its own texture.  If your eyes are strong enough, or you have a machine to help you, you can see even more layers by looking closer still. That's what Twitter is going to be like with the launch of Twitter Annotations this Summer. It's a beautiful vision, with huge potential

    What Twitter Annotations mean by Marshall Kirkpatrick

     

    Today at the Twitter Chirp Hack Day I talked with a ton of developers and the new feature they were most interested in. Adam Jackson echoed everyone I’ve heard today when he tweeted “Twitter Annotations is what I’ve been wanting FOREVER.”

    Developers: how will we all get along with Twitter’s annotation feature? by Robert Scoble

     

    Twitter announced a series of new features at its Chirp conference in April, ...the one that has the most potential to change the way the social network functions in fundamental ways is Annotations, which Twitter said would be rolled out in the second quarter of the year ... Annotations would allow developers (and Twitter itself, of course) to add additional information to a tweet — such as a string of text, a URL, a location tag or bits of data — without affecting its character count. In other words, such information would be metadata about the tweet or the user who posted it, and would be carried along as an additional payload as it traveled through the Twitter network. Apps and services could then collect that information and filter it or make sense of it. In some ways, Annotations are like Facebook’s open graph protocol, which also adds metadata to the behavior of users on certain sites when they’re logged in

    Twitter Annotations Are Coming — What Do They Mean For Twitter and the Web? by Matthew Ingram

     

    What <others> did not mention is what I think is potentially the most fascinating use of opening up annotations. Google’s success today is built on their page rank algorithm that measures the validity of a web page by the number of incoming links to it and the page rank of the sites containing those links – its a system built on reputation. Twitter annotations could open up a new paradigm however – let’s call it People rank- where reputation can be measured by the metadata that people choose to apply to links and the websites containing those links. Its not hard to see why Google and Microsoft have paid big bucks to get access to the Twitter firehose!

    Interesting things – Twitter annotations and your phone as a web server by Jamie Thomson (i.e. me!)

     

    Twitter themselves said in May 2010:

    We will continue to move as quickly as we can to deliver the Annotations capability to the market so that developers everywhere can create innovative new business solutions on the growing Twitter platform.

    The Twitter Platform

     

    That was 20 months ago. The question I now ask is....where are they? Evidently I'm not the only one asking that question because in a thread entitled How can I try the Annotations API? in November 2011 Gustavo Frederico asked:

    How can I try the Annotations API? I'm looking forward to trying it out.

     To which Taylor Singletary (a Twitter employee) replied:

    Annotations is still more concept then reality. Maybe some day we'll have more to say about them.

     

    Hmmm...in 18 months the situation has have gone from "We will continue to move as quickly as we can to deliver the Annotations capability to the market" & "rolled out in the second quarter of the year" to "Annotations is still more concept then reality", that's quite a climb-down if you ask me. I have strong hopes that Twitter Annotations will be with us eventually but the deafening silence isn't particularly encouraging.

     

    You might ask why I am bothered, I am only a SQL Server developer after all. That is true but I still consider that my job can loosely be defined as extracting value from data and from that perspective the onslaught of data (nay, structured data) that Twitter Annotations would bring should be of interest to both myself and my clients.

    I am also fascinated as to how Twitter Annotations could work with Schema.org which is heavily backed by Google and Microsoft and which Microsoft are pushing as the backbone of Contracts in Windows 8 (Schema.org is mentioned in this video from the Build conference).

    So, I ask again, whatever happened to Twitter Annotations? Does anyone know?

    @jamiet


  • Use VALUES clause to get the maximum value from some columns [SQL Server, T-SQL]

    My ex-colleague Paul Mcmillan pointed me at a thread on Stack Overflow that demonstrated a neat T-SQL trick to get the maximum value from a collection of columns in a row. Paul had never seen it before and neither had I so I figure one or two of you out there might learn something from it too.

    In short you can use the VALUES clause to effectively union the values into a dataset and get the MAX from that dataset. Better demonstrated with code:

    DECLARE @t TABLE(a INT,b INT,c INT);
    INSERT @t VALUES(1,2,3),(9,8,7),(4,6,5);
    SELECT *
    ,      (  
    SELECT  MAX(val)
              
    FROM    (VALUES (a)
                           ,   (
    b)
                           ,   (
    c)
                       )
    AS value(val)
           )
    AS MaxVal
    FROM @t;

     

    I'm sure many of you knew this already but if you didn't, well, you too have learnt something today. See more uses for the VALUES clause at Interesting enhancements to the VALUES Clause in SQL Server 2008

    @jamiet

    P.S. Oh, this only works in SQL Server 2008 and beyond.

  • Get the SQLBits agenda in your phone's calendar

    For SQLBits 8 in April 2011 I published a calendar containing all of the sessions from the conference; anyone could subscribe to that calendar on their phone or calendar service (i.e. Hotmail or Google Calendar).

    For the upcoming SQLBits X conference I have done the same again by adding all of the sessions to that same calendar. If you are already subscribed to that calendar from SQLBits 8 then you have nothing to do - all the SQLBits X sessions will automatically flow to your phone/Hotmail calendar/Google calendar (go take a look now - they should already be there).

    If you want to get this SQLBits calendar onto your smartphone then the easiest way to do it is add my calendar to whichever calendar service (i.e. Hotmail or Google) you have got synced to your phone and let technology do its thing.

    I will keep the calendar updated with any changes to the agenda so, assuming you have subscribed, changes will just propogate to you without you having to do anything. Remember, to save yourself work in the future make sure you subscribe to the calendar as opposed to importing it.

    Hope this is useful

    @jamiet

    UPDATE: I have just discovered an even easier way to subscribe to this SQLBits calendar using the Google Calendar service - simply click this button:

    0

    If Google Calendar reports that you do not have permission (as it seems to be doing for some people) then follow the instructions that I provided above. I promise you, the calendar *is* publicly available so if this button doesn't work its Google that is doing something wrong.

  • Implementing SQL Server solutions using Visual Studio 2010 Database Projects – a compendium of project experiences

    Over the past eighteen months I have worked on four separate projects for customers that wanted to make use of Visual Studio 2010 Database projects to manage their database schema.* All the while I have been trying to take lots of notes in order that I can write the blog post that you are reading right now – a compendium of experiences and tips from those eighteen months. I should note that this blog post should not necessarily be taken as a recommendation to actually use database projects in Visual Studio 2010 – it is intended to be useful for those of you that have already made the decision to use them; having said that, I do make recommendations as to actions I think you should take if you have made that decision.

    First let’s be clear what we’re talking about here. Visual Studio Database Projects have been known by a few different names down the years, some of which you may be familiar with. If you have ever heard the terms datadude, DBPro, teamdata, TSData or Visual Studio Team System for Database Professionals then just know that all of these terms refer to the same thing, namely the project type highlighted below when starting a new project in Visual Studio:

    image

    From here onwards I am going to refer to Visual Studio Database projects and all the features therein simply as datadude because that’s a popular colloquial name (and is also much quicker to type). Know also that at the time of writing the features that I am talking about here are currently undergoing some changes ahead of the next release of SQL Server (i.e. SQL Server 2012) in which these features are mooted to be delivered under a new moniker - SQL Server Developer Tools (SSDT).

    OK, with all those preliminaries out of the way let’s dig in.

    Continuous Integration

    Continuous Integration (CI) is a development practise that has existed for many years but in my experience has not been wholly embraced by the database community. The idea behind CI for databases is that every time a developer checks-in a piece of code be it a stored procedure, a table definition or whatever, the entire database project is built and then deployed to a database instance. Microsoft provide a useful article An Overview of Database Build and Deployment that goes some way to explaining how to setup your CI deployment.

    CI is one of the fundamental tenets that underpins a lot of the things I talk about later in this blog post and hence gives rise to my first recommendation when using datadude:

    Recommendation #1: Use Source Control and implement a Continuous Integration deployment

    In an earlier draft of this blog post I outlined in detail the CI configuration from one of the aforementioned projects. Its not suitable for inclusion at this point in the current draft but I still think there is some useful information to be gleaned so I have included it below in “Appendix – An example CI configuration”.

    Composite Projects

    Each of the four aforementioned projects were brownfield projects meaning that each already encompassed some established, deployed, databases and they wanted to bring those databases under the control of datadude. Each project had thousands of objects across multiple databases and in this situation it is very likely that some of the stored procedures, views or functions will refer to objects in one of the other databases. The way to resolve those references is to use database references however once you have applied all of your database references it is still very possible that you will run into a situation where code in database A refers to an object in database B while at the same time database B refers to an object in database A. This is depicted in the following figure:

    image

    Here we have a view [DB1]..[View2] that selects data from [DB2]..[Table1] and a view [DB2]..[View1] that selects data from [DB1]..[Table1]. Datadude does not allow a database reference from [DB2] to [DB1] if there is already a database reference from [DB2] to [DB1] and hence will return an error akin to:

    • SQL03006: View: [dbo].[View1] has an unresolved reference to object [DB1].[dbo].[Table1].

    image

    We have the proverbial chicken-and-egg problem, [DB1] can’t be created before [DB2] and vice versa. This problem is solved by using Composite Projects (not to be confused with Partial Projects) which allow you to split objects that are intended to be in the same database over multiple datadude projects. I could go over how you set one of these things up but there’s really no need because there is a rather excellent walkthrough on MSDN at Walkthrough: Partition a Database Project by Using Composite Projects; the reason for me mentioning it here is to make you aware that composite projects exist and of the problem that they solve. If you are introducing datadude into a brownfield project then it is highly likely that you are going to require composite projects so learn them and learn them good.

    One important last note about composite projects is to answer the question “How does the tool know that the multiple projects refer to the same database?” The answer is given at the walkthrough that I linked to above; namely, it says:

    Do not specify server variables and values or database variables and values when you define references in a composite project. Because no variables are defined, the referenced project is assumed to share the target server and database of the current project.

    “Do not specify server variables and values or database variables and values when you define references in a composite project. Because no variables are defined, the referenced project is assumed to share the target server and database of the current project.”

    So now you know! To put it another way, if you reference one project from another and don’t tell datadude that the two projects refer to different databases then it assumes they refer to the same database.

    Code Analysis

    Datadude provides the ability to analyse your code projects for code in stored procedures and functions that it considers to be inferior and highlight it – this feature is called Code Analysis. Note that Code Analysis will not highlight code that is syntactically incorrect (datadude does that already, which may well be considered its core feature), it highlights code that is syntactically correct but may be considered defective when executed. Specifically Code Analysis will highlight the following perceived code defects (click through on the links for explanations of why these are considered code defects):

    In my opinion the best aspect of Code Analysis is that it can be run as part of your Continuous Integration deployment meaning that if anyone checks in some deficient code, BOOM, your CI deployment fails and the developer is left red-faced. Nothing else has increased the code quality on my projects quite like running Code Analysis as part of a CI deployment.

    Hopefully I have convinced you that turning on Code Analysis is a good idea. If you agree then head to the project properties and check the box labelled Enable Code Analysis on Build. I also recommend checking the Treat warnings as errors boxes otherwise you’ll find that undisciplined developers will simply ignore the warnings.

    Enable datadude code analysis

    N.B. Incidentally if you have time I highly recommend that you go and read the blog post I linked to there – Discipline Makes Strong Developers by Jeff Atwood. I’ve read many thousands of blog posts in my time and that is the one that has influenced me more than any other.

    Turning on Code Analysis on a greenfield project is a no-brainer. On a brownfield project its not quite so easy – on a recent engagement I moved a single database into datadude and turned on Code Analysis which immediately found over two thousand perceived code defects. I generally abhor the use of that famous maxim if it aint broke, don’t fix it in our industry but on occasions like this you may be well advised to heed that advice and leave well alone for fear of breaking code that does what it is supposed to (no matter how inefficiently it does it). Instead you do have the option to suppress Code Analysis warnings/errors:

    Suppress datadude code analysis

    I advise using Code Analysis suppression sparingly. Recently I discovered that one of the developers on my team had decided it was OK to simply suppress every error that was thrown by Code Analysis without first investigating the cause. I was not amused!

    Recommendation #2: Turn on Code Analysis

    Realising the value of idempotency

    An operation is considered idempotent if it produces the same result no matter how many times that operation is applied; for example, multiplication by a factor of one is an idempotent operation – no matter how many times you multiple a number by one the result will always be the same.

    Idempotency is a vital facet of database deployment using datadude. Datadude tries to ensure that no matter how many times you deploy the same project the state of your database should be the same after each deployment. The implication here is that during a deployment datadude will examine the target database to see what changes (if any) need to be made rather than simply attempting to create lots of objects; if all the objects already exist nothing will be done. In my opinion this is the single biggest benefit of using datadude – you don’t have to determine what needs to be done to change your database schema to the desired state, datadude does it for you.

    If I have convinced you about the value of idempotency within datadude then you should also realise that the same rigour should be applied to data as well. Datadude provides Post-Deployment scripts that allow you to deploy data to your schema however there is no inbuilt magic here – datadude will simply go and run those scripts as-is, it will not try and comprehend the contents of those scripts. What this means is that you, the developer, are responsible for making your Post-Deployment scripts idempotent and the easiest way to do that is to employ the T-SQL MERGE statement.

    T-SQL’s INSERT is not sufficient as it will work once and thereafter fail as it will be attempting to insert already inserted data; this gives rise to my third recommendation:

    Recommendation #3: When running your deployment in a test environment, run it more than once.

    No-brainer Recommendations

    I consider Code Analysis and Idempotency to be so important that I called them out as dedicated headlines. In this section I’ll outline some additional simple measures that you can undertake and which will, if employed correctly, have a profound effect on the success of your datadude projects.

    Putting a build number into the DB

    I find it is very useful to maintain a log of deployments that have been made to a database and my chosen method is to use a Post-Deployment script to insert a value into some table. Here’s the definition of the table I use for this:

    CREATE TABLE [dbo].[DeployLog]
    (
      
    [BuildId]           NVARCHAR(50)
    [DeployDatetime]    SMALLDATETIME
    CONSTRAINT  PK_dboDeployLog PRIMARY KEY ([DeployDatetime])
    )

    In my Post-Deployment script I will use:

    INSERT [dbo].[DeployLog]([BuildId],[DeployDatetime])
    VALUES ('$(BuildId)',GETDATE());

    to insert a row into that table during every deployment. $(BuildId) is a variable defined in the .sqlcmdvars file of my project:

    image

    Here is what we see inside that file:

    image

    The $(BuildId) variable has been defined with a default value of UNKNOWN and hence subsequent deployments from Visual Studio will result in the following:

    image

    On first glance that might not seem particularly useful however it comes into its own if you are doing CI deployments (see recommendation #1) because each build in a CI environment will result in a new build identifier. The following command-line call to vsdbcmd.exe is how deployments are generally done using datadude, note the presence of the /p:BuildId switch:

    ..\Tools\VSDBCMD\vsdbcmd.exe /Action:Deploy /ConnectionString:"Data Source=.;Integrated Security=True;Pooling=False" /p:BuildId="some-value" /DeployToDatabase:+ /ManifestFile:.\FinanceDB\sql\release\FinanceDB.deploymanifest

    Your CI tool should be able to replace “some-value” with an identifier for the current build (that’s outside the scope of this blog post but any CI tool worth its salt will be able to do this) – when the deployment executes that value will then make its way into your [dbo].[DeployLog] table and you will have a self-maintaining history of all the deployments (datetime & build identifier) that have been made to your database.

    Recommendation #4: Maintain an automated history of your deployments

    Use Schema View

    It is natural to navigate through all of the objects in your database project using Solution Explorer however datadude provides a better mechanism for doing just that – the Schema View window.

    image

    Schema View provides a logical view of all the objects defined in your database project regardless of which file they may be defined in. That is very useful for many reasons, not least because it makes it easy to locate whichever object you are after – that’s advantageous if multiple objects are defined in the same file. Moreover if some files have property BuildAction=”Not In Build” (see later) they won’t show up in Schema View (this is a good thing by the way). Schema View is also the place that operations such as refactoring and dependency analysis are launched from.

    Some people think that it is important that the name of each file in a datadude project should accurately reflect the object defined within. I disagree; object renames mean that maintaining the filenames becomes laborious and having the Schema View means you never have to use the filenames to navigate your project anyway.

    One final reason to use Schema View is the External Elements button:

    image

    Toggling this button on means that objects in referenced projects show up in the project that they are referenced from (this is particularly useful if you are using Composite Projects). Note in the following screenshot how the object [dbo].[t1] in project Database2 appears in the [dbo] schema of Database3 – that’s because Database3 has a reference to Database2.

    image

    For those reasons my fifth recommendation is:

    Recommendation #5: Use Schema View in preference to Solution Explorer

    You will still need Solution Explorer to navigate files that do not contain database objects (e.g. Post-Deployment scripts) but ordinarily you should spend most of your time interacting with Schema View.

    Make liberal use of PRINT statements in Pre/Post-Deployment Scripts

    When you deploy a datadude project datadude will take care of telling you what it is up to. For example, the following screenshot shows the output from deploying the already discussed [dbo].[DeplogLog]:

    image

    Of course it only does this for objects that it knows about and that doesn’t include anything in your Pre or Post \deployment scripts so you need to take responsibility for outputting pertinent information from those scripts. Here I have amended the script that inserts into [dbo].[DeployLog]:

    SET NOCOUNT ON;
    INSERT [dbo].[DeployLog]([BuildId],[DeployDatetime])
    VALUES ('$(BuildId)',GETDATE());
    PRINT CAST(@@ROWCOUNT as NVARCHAR(5)) + N'rows inserted into [dbo].[DeployLog], BuildId=$(BuildId)';

    This gives us much more useful output:

    image

    Adding PRINT statements to your Pre & Post Deployment scripts is so easy it really is a no-brainer and hence my next recommendation is:

    Recommendation #6: Any action in a Pre or Post-Deployment Script should use PRINT to state what has been done

    Output variable values in your Pre-Deployment script

    This is in the same vein as the previous bullet-point – output as much information as is possible. In this case we’re talking about outputting the values of all variables that are stored in the .sqlcmdvars file; first, a reminder of what’s in that file:

    image

    Here is the contents of my amended Pre-Deployment Script:

    PRINT 'DefaultDataPath=$(DefaultDataPath)';
    PRINT 'DatabaseName=$(DatabaseName)';
    PRINT 'DefaultLogPath=$(DefaultLogPath)';
    PRINT 'BuildId=$(BuildId)';

    And the resultant output:

    image

    This is the sort of simple amendment that will pay off in spades later in your project (especially if you are supplying many values from the command-line) and again, its so easy to do it there really is no reason not to. Just remember to update your Pre-Deployment script whenever you add new variables to .sqlcmdvars.

    Recommendation #7: Output the value of all variables in your Pre-Deployment script

    One Object Per File

    Datadude doesn’t restrict what can go in a file, for example the following file, “t.table.sql”, defines three objects; a table, a primary key and a view:

    image

    Even though they’re all defined in the same file they show up in Schema View separately (one of the aforementioned benefits of using Schema View):

    image

    That said, just because you can doesn’t mean that you should. I prefer to go for one object per file for the simple reason that its easier to track the history of an object via Source Control. Moreover, if an object is no longer required then it is a simple change to just remove the file containing that object from the build (see “Don’t delete anything from your project” later) as opposed to editing a file to remove all traces of an object.

    Recommendation #8: Each database object should be defined in a dedicated file

    Time your Pre and Post Deployment Scripts

    Its always useful to know where time is spent when doing deployments, in my experience the majority of time spent is in the Post-Deployment script (your mileage may vary of course). An easy win is to output the time taken to run your Pre and Post Deployment scripts. Adapt your Pre-Deployment script so that it looks something like this:

    DECLARE @vPreDeploymentStartTime DATETIME = GETDATE();
    PRINT '****************Begin Pre-Deployment script at ' +CONVERT(VARCHAR(30),GETDATE(),120) + '***********************';

    /*Call other scripts from here using SQLCMD's :r syntax
    Example:      :r .\myfile.sql                          
    */

    PRINT 'Pre-Deployment duration = ' + CONVERT(VARCHAR(5),DATEDIFF(ss,@vPreDeploymentStartTime,GETDATE())) + ' seconds';
    PRINT '****************End Pre-Deployment script at ' +CONVERT(VARCHAR(30),GETDATE(),120) + '***********************';

    then do similar for your Post-Deployment script. When you deploy your output will include the following:

    image

    Note the lines:

        ****************Begin Pre-Deployment script at 2011-12-31 20:00:34***********************
        Pre-Deployment duration = 0 seconds
        ****************End Pre-Deployment script at 2011-12-31 20:00:34***********************
       

        ****************Begin Post-Deployment script at 2011-12-31 20:00:34***********************
        Post-Deployment duration = 0 seconds
        ****************End Post-Deployment script at 2011-12-31 20:00:34***********************

    In this particular case its not all that useful to know that the deployment took 0 seconds but if and when your deployments snowball to many minutes it will be useful to know how long your scripts are taking at which point you can investigate further by timing each individual step in your Pre and Post Deployment scripts.

    Recommendation #9: Time your deployments and output the timings

    Use sqlcmdvars and the command-line as much as possible

    Hardcoding any value into a piece of code is a fraught practise; you should assume that values previously thought to be constant may not be so in the future. You can protect yourself from future changes by storing all literal values as variables in your .sqlcmdvars file. Sure, you can supply default values for those variables but you have the added advantage that they can be overridden from the command-line when deploying using vsdbcmd.exe. Moreover, if you have values that are hardcoded in multiple places in your code then specifying those values in .sqlcmdvars ensures that your code adheres to the principle of DRY. Lastly, if values are stored in the .sqlcmdvars file then you can output them at deploy time (see recommendation #7).

    Recommendation #10: All literal values should be stored in your .sqlcmdvars file

    Every developer gets their own development database

    In most SQL Server development shops that I’ve been on all developers work against a single centralised development database. To me this is an antiquated way of working because its possible that work one person is doing can conflict with that of someone else, I find it much better for every developer to work in isolation and then use the CI deployment to check that one’s code is not in conflict with anyone else’s. Datadude supports (nay encourages) this way of working with the notion of an Isolated Development Environment:

    image

    Every developer should configure their isolated development environment which, typically, would be their local instance. And so to my next recommendation:

    Recommendation #11: Every developer should use the Isolated Dev Environment settings in order to author their code

    Incidentally, if every developer has their own development database and you are following my earlier recommendation to use a [DeployLog] table then you can track how often a developer is bothering to deploy and test their code. On a recent project we used this evidence in a (ahem) discussion with a developer who tried to convince us that he was testing his code sufficiently even though he was repeatedly causing the CI deployment to fail.

    Don’t delete anything from your project

    When projects are no longer required in your database then intuitively it makes sense to remove the file containing that object from the datadude project, I would however like to suggest a different approach. Rather than removing a file just change the Build property to “Not in Build”:

    image

    This has the advantage that your project maintains some semblance of history of what objects have been removed from your database – that can be useful to anyone inheriting your code in the future.

    Recommendation #12: Use “Not in Build” to remove an object from your database

    Build and Deploy your datadude projects outside of Visual Studio

    Building and deploying your datadude projects within Visual Studio can become a real time hog; in my experience its not unusual for deployments to take many minutes and your Visual Studio environment will be unavailable for further development work during that time. For that reason I recommend investing some time in writing some msbuild scripts that will build and deploy your project(s) from the command-line. Here are some examples that you can adapt for your own use, firstly a script to build a solution:

    <?xml version="1.0" encoding="utf-8"?>
    <!-- Execute using:
    msbuild SolutionBuild.proj
    -->
    <Project  xmlns="http://schemas.microsoft.com/developer/msbuild/2003"
              DefaultTargets="Build">
      <!-- Notes:
          When doing .net development Visual Studio Configurations are particularly useful because they can affect
          how the code is executed (i.e. under the Debug configuration
          debug symbols can be used to step through the code (something like that anyway - I don't know too much about that
          stuff).
          In DBPro, Configurations are less relevant because there is no such thing as debugging symbols. Nonetheless, they
          can still be useful for times when you want to do different things (e.g. you might want to run Code Analysis in
          a debug situation but not in a release situation. There is a useful thread on this here:
          "Debug vs Release" http://social.msdn.microsoft.com/Forums/en-US/vstsdb/thread/a0ec0dc0-a907-45ba-a2ea-d2f0175261a7
        
          Note that Visual Studio Configurations should not be used to maintain different settings per environment.
          The correct way to do that is to maintain
          seperate .sqlcmdvars files per environment and then choose which one to use at deployment time when using
          vsdbcmd.exe (use syntax "/p:SqlCommandVariablesFile=$(ProjectName)_$(Environment).sqlcmdvars")
      -->
      < ItemGroup >
        <!-- List all the configurations here that you want to build -->
        <Config Include="Debug" />
        <Config Include="Release" />
      </ ItemGroup >
      <Target Name="Build">
        <Message Text="Building %(Config.Identity) configuration..."/>
        <MSBuild Projects=".\Lloyds.UKTax.DB.sln" Properties="Configuration=%(Config.Identity)" />
      </ Target >
    </ Project >

    and secondly a script that will deploy a datadude project:

    <?xml version="1.0" encoding="utf-8"?>
    <!-- Execute using:
    msbuild SolutionDeploy.proj /Target:Deploy
    -->
    <Project  xmlns="http://schemas.microsoft.com/developer/msbuild/2003"
              DefaultTargets="Build;Deploy">
      < PropertyGroup >
        <!-- At time of writing I don't see a reason for anything else to be used but that may change in the future hence why this
          is a property and hence can be overriden. -->
        <Configuration>Debug</Configuration>
        <DevServer>Data Source=GBS0039182\GLDDEV01;Integrated Security=True;Pooling=False</DevServer>
      </ PropertyGroup >
     
      < ItemGroup >
        <ProjectToBuild Include="SolutionBuild.proj" />
      </ ItemGroup >

      <!-- Notes:
            Add a <DbProj> item for every database project (.dbproj) that needs to be deployed. They will get deployed in the
            order that they are listed
             thus it is your responsibility to make sure they are listed in the correct order (respecting dependency order).
            %Identity is a metadata reference. It refers to the name of the item (i.e. Include="The bit that goes here is the
              identity")
            Note also that whatever you put for Include is important. Include="dev_thomsonj" means the project will only get
          deployed if the deployment is being executed by username=dev_thomsonj -->
      < ItemGroup >
        <DbProj Include="username">
          <DbName>MyDB</DbName>
          <ProjectName>MySoln.MyDB</ProjectName>
          <OutputPath>\%(ProjectName)\sql\$(Configuration)\\cf1 </OutputPath>
          <DeployConnStr>Data Source=localhost;Integrated Security=True;Pooling=False</DeployConnStr>
        </ DbProj >
      </ ItemGroup >

      <Target Name="Build">
        <MSBuild Projects="@(ProjectToBuild)" />
      </ Target >
     
      <Target Name="Deploy">
        <!-- Notes:
              09 is the hex code for TAB, hence all of the %09 references that you can see. See http://asciitable.com/
                for more details.
        -->
        <Message Text="USERNAME=$(USERNAME)" />
        <Message Condition="'%(DbProj.Identity)'==$(USERNAME)" Text="Deploying:
         Project%09%09%09:  %(DbProj.ProjectName)  
         DbName%09%09%09:  %(DbProj.DbName)
         From OutputPath%09%09:  %(DbProj.OutputPath)
         To ConnStr%09%09:  %(DbProj.DeployConnStr)
         By%09%09%09:  %(DbProj.Identity)"
         />
        <Exec Condition="'%(DbProj.Identity)'==$(USERNAME)" Command="&quot;$(VSINSTALLDIR)\vstsdb\deploy\vsdbcmd.exe&quot;
                /Action:Deploy /ConnectionString:&quot;%(DbProj.DeployConnStr)&quot; /DeployToDatabase+
                /manifest:&quot;.%(DbProj.OutputPath)%(DbProj.ProjectName).deploymanifest&quot; /p:TargetDatabase=%(DbProj.DbName)
                /p:Build=&quot;from cmd line&quot;" />
      </ Target >
    </ Project >

    Writing these scripts may appear to be laborious but they’ll save you heaps of time in the long run.

    Recommendation #13: Build and deploy to your development sandbox using scripts

    UPDATE: Upon reading this blog post Mordechai Danielov wrote a follow-up in which he published a useful script that builds a series of projects using Powershell. Its at building your database solutions outside of Visual Studio.

    Useful links

    Over the years I’ve collected some links to MSDN articles that have proved invaluable:

    Datadude bugs

    Like any substantial piece of software datadude is not without bugs. Many of the issues I have found are concerned with the datadude interpreter not correctly parsing T-SQL code, here’s a list of some bugs that I have found down the years:

    Some of these bugs were reported a long time ago and may well have been fixed in later service packs.

    Previous datadude blog posts

    I have blogged on datadude quite a bit in the past:

    Summing up

    This has been an inordinately large blog post so if you’ve read this far – well done. For easy reference, here are all the recommendations that I have made:

    1. Use Source Control and implement a Continuous Integration deployment
    2. Turn on Code Analysis
    3. When running your deployment in a test environment, run it more than once
    4. Maintain an automated history of your deployments
    5. Use Schema View in preference to Solution Explorer
    6. Any action in a Pre or Post-Deployment Script should use PRINT to state what has been done
    7. Output the value of all variables in your Pre-Deployment script
    8. Each database object should be defined in a dedicated file
    9. Time your deployments and output the timings
    10. All literal values should be stored in your .sqlcmdvars file
    11. Every developer should use the Isolated Dev Environment settings in order to author their code
    12. Use “Not in Build” to remove an object from your database
    13. Build and deploy to your development sandbox using scripts

    I really hope this proves useful because its taken a good long while to get it published Smile If you have any feedback then please let me know in the comments.

    Thanks for reading!

    @jamiet

    * When I started writing this blog post the first sentence was “Over the past six months I have worked on two separate projects for customers that wanted to make use of Visual Studio 2010 Database projects to manage their database schema.” as opposed to what it is now: “Over the past eighteen months I have worked on four separate projects…” Yes, that’s how long its taken to write it! Smile

     

     

    Appendix – An example CI configuration

    As stated above, an earlier draft of this blog post included full details of the CI configuration from one of the projects that I have worked on. Although it may repeat some of what has already been said I have included that text below.

    ==============================================================

    Introduction

    This project has invested heavily in using a Continuous Integration (CI) approach to development. What that means, succinctly, is that whenever someone checks-in some code to our source control system an automated build process is kicked off that constructs our entire system from scratch on a dedicated server. CI is not a new concept but it is fairly rare that anyone applies the same rigour to their database objects as they do to so-called “application code” (e.g. the stuff written in .Net code) and on this project we have made a conscious decision to properly build our databases as part of the CI build.

    Datadude employs a declarative approach to database development. In other words you define what you want database schema to look like and datadude will work out what it needs to do to your target in order to turn it into what you have defined. What this means in practice is that you only ever write CREATE … DDL statements rather than IF <object-exists> THEN ALTER …ELSE CREATE … statements which is what you may have done in the past.

    Here’s our CI environment setup:

    • SubVersion (SVN) is being used for source control
    • Hudson, an open source CI server, is being used to orchestrate our whole CI environment. It basically monitors our SVN repository and when it spots a checked-in file, kicks off the CI build
    • Our CI scripts (the stuff that actually does the work) are written using msbuild
    • We have 2 msbuild scripts:
      • Build.proj which is responsible for:
        • Compiling all our .Net website code
        • Building/Compiling our datadude projects  (every datadude project file is a msbuild-compliant script)
      • Deploy.proj which is responsible for:
        • Restoring latest database backups from our production environment into our CI environment
        • Deploying our built datadude projects on top of those restored backups
        • Build a folder structure to hold all the artefacts that get deployed
        • Creating folder shares
        • Moving SSIS packages into folder structure
        • Deploying SSRS reports to our SSRS server
        • Deploy our Analysis Services cube definitions to our Analysis Server
    • Both Build.proj and Deploy.proj get executed by our CI build

    Building datadude projects

    Datadude makes it very easy to build datadude projects in a CI environment because they are already msbuild-compliant; its simply a call to the MSBuild task, passing in the location of the solution file as an argument. We use the Release configuration (although there is no particular reason for you to do the same – purely your choice):

    <Target Name="Database">
      <!--Build database projects and copy output to staging -->
      <Message Text="*****Building database solution" />
      <MSBuild Projects="..\src\SQL\DatabaseSolution.sln" Properties="Configuration=Release" />
    </Target>

    That’s it! The output from a datadude build includes a number of files but the most important one is a .dbschema file which is an XML representation of all the objects in your database.

    Deploying the output from a built datadude project

    This is a little more difficult. We *could* simply use the MSBuild task to call our deployment script as we do for build script (see above) but the problem with that is that there are many pre-requisites (including datadude itself) and we don’t want to install Visual Studio and all the assorted paraphernalia onto our various environments. Instead we chose to make use of a command-line tool called VSDBCMD.exe to deploy datadude projects. VSDBCMD does basically the same job as what happens if you were to right-click on a datadude project in Visual Studio and select “Deploy” i.e. It compares the output of a build (A) to the target database (B) and works out what it needs to do to make B look like A. It then produces a .sql script that will actually make those requisite changes, then goes and executes it.

    The difficulty comes in VSDBCMD.exe having its own list of file dependencies that are listed at MSDN article How to: Prepare a Database for Deployment From a Command Prompt by Using VSDBCMD.EXE, thankfully a much smaller list then if we were using the MSBuild task.

    image

    Some of those files, namely:

    • Sqlceer35en.dll
    • Sqlceme35.dll
    • Sqlceqp35.dll
    • Sqlcese35.dll
    • System.Data.SqlServerCe.dll

    get installed with SQL Server CE. We bundle along the x86 & x64 installers for SQL Server CE along with all the rest of our deployment artefacts and then, as part of Deploy.proj, install them like so:

    <Exec Command='msiexec /passive /l* "$(SetupLogDirectory)\SSCERuntime_x86-ENU.log" /i "$(BuildDir)\Vendor\SSCERuntime_x86-ENU.msi"' />
    <Exec Condition="$(X64)" Command='msiexec /passive /l* "$(SetupLogDirectory)\SSCERuntime_x64-ENU.log" /i "$(BuildDir)\Vendor\SSCERuntime_x64-ENU.msi"' />

    That takes care of some of the dependencies but we still have to take care of:

    • DatabaseSchemaProviders.Extensions.xml
    • Microsoft.Data.Schema.dll
    • Microsoft.Data.Schema.ScriptDom.dll
    • Microsoft.Data.Schema.ScriptDom.Sql.dll
    • Microsoft.Data.Schema.Sql.dll
    • Microsoft.SqlServer.BatchParser.dll

    as well as the actual VSDBCMD.exe file itself. Quite simply we keep those files in SVN and then bundle them along with all our deployment artefacts (I won’t show you how we do that because its out of the scope of this post and besides if you’re at all proficient with msbuild then you’ll know how to do that and if you’re not, well, why are you reading this?)

    ==============================================================

  • Querying RSS feed subscriber count on Google Reader using Data Explorer‏

    I have been fiddling about with Data Explorer some more and have built an interesting little mashup that enables one to discover the number of Google Reader subscribers to each RSS feed on SQLBlog.com. Before I show you how its built I'll whet your appetite with a screenshot of the output:

    excelscr
    For brevity I have deliberately hidden some of the results however you can see the full dataset for yourself (and see if any of the numbers have changed since I took this screenshot) by visiting https://ws41451459.dataexplorer.sqlazurelabs.com/Published/SQLBlog%20subscriber%20counts%20on%20Google%20Reader and opening the data in the format of your choosing (i.e. Excel, OData or CSV). All the data is publicly available as is the mashup so you won't have any difficulty in accessing it. Opening the data yourself will also illustrate how long it takes Data Explorer to execute the mashup which, when you consider that the mashup queries the Google Reader API for data on each RSS feed in turn, could be quite interesting in itself.
    So, how is it done? Its very simple once you know how, my mashup has four resources:
    resources
    Which we can take a look at in turn.

    Resource: StripOutCommas

    This is simply a function that will remove commas from a string, this is important because the API that we are using returns numbers as text and any numbers greater than 999 get commas inserted. It is defined simply as:

    (value) => Text.Replace(value,",","")

     stripoutcommas

    Resource: List Of Feeds

    Simply a list of RSS feeds for which we are going to get the subscriber count. The first task, "Typed list of feeds" , is our typed-in list of RSS feeds:
    listoffeeds

    The second task, "Convert to table", does exactly what it says on the tin:
    converttotable

    N.B. I have, by the way, complained loudly about the inability to resize columns in the Data Explorer UI.

    Resource: GetNumberOfSubscribersForFeed

    This is where the real work occurs. We are using an API provided by Carter Cole at http://cartercole.com/dev/api/greaderapi.asp that wraps the Google Reader API thus making it easy to query for number of subscribers.

    (feed) => StripOutCommas(Json.Document(Web.Contents("http://cartercole.com/dev/api/greaderapi.asp", [Query = [feed = feed]]))[subscribers])

    getnumberofsubscribers

    Here I have essentially defined a function that (1) calls Carter's API, (2) passes it a parameter called feed, (3) converts the result into a JSON document, (4) extracts the "subscribers" value and (5) passes the result to StripOutCommas().

    Resource: Number Of Subscribers Per Feed

    This resource is what pulls everything together. Firstly the "Call func on each RSS feed" task calls our "GetNumberOfSubscribersForFeed" resource on each RSS feed in the "List Of Feeds" resource:

    Table.AddColumn(#"List Of Feeds", "NumberOfSubscribers", each GetNumberOfSubscribersForFeed([feed]))
    numberofsubscribersperfeed


    The "Rename column", "Convert Text to Integer" & "Sort Subscribers DESC" tasks are, hopefully, self-explanatory.

    Consume

    And that's it! As already stated the mashup has been published at https://ws41451459.dataexplorer.sqlazurelabs.com/Published/SQLBlog%20subscriber%20counts%20on%20Google%20Reader from where you can:

    • Download the mashup output as a CSV file
    • Download the mashup output as an Excel document
    • Download the mashup itself so you can play with it at your leisure
    • View the output in an OData feed
    downloadsite

    If you want to use the mashup to get similar counts for your own set of RSS feeds simply change the "List of Feeds" resource appropriately. Happy data exploring!

    @Jamiet

  • ExcelMashup.com now has a dedicated forum

    Five days ago I published a fairly scathing attack on the folks that built ExcelMashup.com due to their poor support for folks that wanted to provide feedback. Well, credit where credit is due, they have been pretty quick to respond as I have been informed that the site has been updated with a link to the dedicated forum which is hosted on Microsoft Answers. In an offline email exchange I promised them that if they did the right thing then I would update my blog accordingly, hence the post that you are reading right now.

    Note that I do not know whether or not they have started responding to the email alias mentioned at http://www.excelmashup.com/Content/Community.html.

    @jamiet

  • Thoughts on ExcelMashup.com (and a rant)

    Microsoft last week made available a new website called ExcelMashup.com and Chris Webb tweeted me exclaiming:

    "Excel Mashup: http://bit.ly/uvvFxy - so, @jamiet, we've got the Excel Services API but no OData unfortunately...?"
    If you have followed Chris' blog and my own of late you may know that the need to easily extract data out of Excel spreadsheets has become a personal crusade of ours lately, for evidence check the following: In his blog post Chris summarises quite nicely what we would like to see in terms of an API on top of the Excel Web Apps:
    what I’d like to see is the Excel Web App be able to do the following:
    • Consume data from multiple data source types, such as OData, and display that data in a table
    • Expose the data in a range or a table as an OData feed


    My take on it: I simply want to make data [consumable from/producable into] Excel in a manner that is consumeruser-agnostic.

    Hence then why Chris tweeted me that link - an API for Excel that uses files on SkyDrive (which is what ExcelMashup.com talks about) certainly sounded promising. Unfortunately it seems ExcelMashup.com is not what we hoped it would be - it is a JavaScript API and hence intended for pulling data out of an Excel spreadsheet and displaying it on a website. There's nothing wrong with that of course but it does not cover any of the scenarios that Chris and I are interested in and frankly that is a rather disappointing. To sum up, as Mike Levin on the forum thread says:

    A web-based spreadsheet without an easily accessed API amounts to cutting it off from the world of data around it. I do mashups against Google Spreadsheets all the time, and came over here looking to port the work, and make it Web-spreadsheet agnostic.
    I'm disappointed. Just give me a credentials system and a REST API. I see the reference to having this in Sharepoint, but is there a lightweight way to do it? With login credentials and a restful API, we woudn't even need client libraries, Sharepoint, or any such software overhead. Just use the language of your choice, and bang against the spreadsheet.

    ExcelMashup.com simply does not provide what Mike is after, and that is disappointing.

    I decided it would be right and proper to give the above feedback to the team behind ExcelMashup.com so I headed to http://www.excelmashup.com/Content/Forums.html to see this:

    image 

    Very strange, none of the forums have got anything to do with “Excel Mashup”, instead all but one of them seem related to Sharepoint which, in this context, I have no interest in. Undeterred I headed to the Sharepoint 2010 General Questions and Answers forum where I posted the following:

    ExcelMashup.com - Where is the forum?
    Hello,
    I was browsing http://www.excelmashup.com/Content/Forums.html to discover where to go to ask questions about excelmashup.com and the forum that I am posting this to seemed the most appropriate. Its still not a forum dedicated to excelmashup.com though - does such a forum exist?
    I have lots of questions that I would like to ask about excelmashup.com but I'm not going to waste my time firing off questions to non-relevant forum.


    regards
    Jamie

    I thought that was pretty fair - they haven't provided a link to a relevant forum so I wanted to know if such a forum existed. Apparently though a Microsoft employee didn't agree with me because the thread has been moved to the Off-Topic Posts (Do Not Post Here) forum. Say what? I post a forum thread related to ExcelMashup.com on a forum that ExcelMashup.com advises me to post on and that thread gets moved to the "F off and stop bothering us" bucket? Are you fricking kidding me? In addition to that facepalm I also sent an email to docthis@microsoft.com (as advised at http://www.excelmashup.com/Content/Community.html) saying exactly the same as on my forum thread. That was two days ago and I haven't yet received a reply.


    OK, I can accept the fact that ExcelMashup doesn't have any use for me - I have no issue with that. What angers me is that the site has been put out for customers to use and then promptly disregarded. There is no dedicated forum, they're clearly not monitoring the forums that they provide links to, Microsoft support folks clearly have no idea what ExcelMashup is and instead are soft-deleting any forum thread related to it and to top it all the ExcelMashup team aren't bothering to respond to emails sent to the email address that they provide a link to.


    I don't know why they didn't just put a logo showing big two-fingered salute on their website instead just to save us all the bother!

    @jamiet

    UPDATE, 19th December 2011:  I have been informed by Cyrielle Simeone in the comments below that ExcelMashup.com has been updated in light of my comments and now has a specific forum for ExcelMashup.com at Microsoft Answers.

  • Data Explorer Feedback part 1

    Earlier today I posted Data Explorer walkthrough – Parsing a Twitter list in which I explained that I have been using Data Explorer for a few weeks now. In that time I have been compiling some feedback for the Data Explorer product team which I was going to email privately but I see no harm in putting these thoughts into the public domain, hence why I am writing this blog post – a hodge podge of thoughts/suggestions/gripes from the past few weeks of using Data Explorer.

    • I’m very impressed that if I change the name of a resources any references to that resource are updated accordingly. After years of working with the SSIS dev tools I’m automatically wired to expect that not to work.
    • I love the composability of tasks and resources (e.g. Define a custom function as a resource and use it in another resource). That idea of composability and reuse is something that is sadly lacking in SSIS (and the SSIS team know that because I’ve moaned about it often enough Smile ).
    • If the name of the resource is too large for the box then there is no way to see the full name without editing it:

    image

    • It would be nice to be able to reorder the list of resources
    • We *really* need the ability to widen the columns when viewing the results of a task.
    • Data Explorer seems a bit lacking in its abilities to parse HTML & XML right now. For example, I'm surprised that I can't use XPath to extract the contents of an HTML/XML document. In fact, the lack of such features is the most disappointing aspect of Data Explorer thus far and means that the product is still a long way off being a Kapow competitor, remembering that this is still an early beta of course.
    • We need better ways of visualising the output from a mashup (i.e. something that doesn’t require the user to have Excel installed). How about giving SSRS the ability to consume an OData feed – we could then host such a report on SQL Azure Reporting, perhaps even sell that report via the Azure Marketplace.
    • If I make a mashup publicly available I would like to be able to know how many times it is called. I believe that the ability to INSERT data (e.g. into SQL Azure) using Data Explorer will be coming soon (its kind of already there with the Snapshot feature) so I'm wondering if there is a way to "trigger" an insertion when someone consumes one of my exposed resources? I suspect the answer is "not right now" so consider that a feature request. I guess you could crystalise that request as "Provide an eventing model within a mashup so that resources can be triggered when some event (e.g. Someone consumes a resource) occurs."
    • We need a scheduler that enables us to run Snapshots at some pre-defined time.
    • I have a mashup published at https://ws41451459.dataexplorer.sqlazurelabs.com/Published/TwitterListDemo. The link for the feed is https://ws41451459.dataexplorer.sqlazurelabs.com/Published/TwitterListDemo/Feed but that doesn’t actually show me any data, it shows me a list of resources that produce the data however in this case there is only one resource (https://ws41451459.dataexplorer.sqlazurelabs.com/Published/TwitterListDemo/Feed/jamiet-dataexplorer-text_user_screen_name). Would it not make sense to link directly to the feed with data in it if there is only actually one external resource in my mashup? It would save a mouse-click at least.
    • I haven’t yet figured out if its possible to do recursion. Still working on that one.
    • In my Twitter List example the values “jamiet” & “data-explorer” for the list owner and list slug respectively were hardcoded in the mashup; yes, they were parameters to a function but they were still hardcoded. It would have been much better if I were able to give the consumer of the mashup the ability to define/override those values when they consume it. In other words, a mashup needs to be parameterizable.
    • Overall I am very impressed and the excitement that I displayed in my initial post Thoughts on Data Explorer is justified. Data Explorer coalesces nicely with existing interests of mine such as SQL Server, ETL & web-enabled data and I have high hopes that I will be using this extensively for clients in the years to come.
  • Data Explorer walkthrough – Parsing a Twitter list

    Yesterday the public availability of Data Explorer, a new data mashup tool from the SQL Server team, was announced at Announcing the Labs release of Microsoft Codename “Data Explorer”. Upon seeing the first public demos of Data Explorer at SQL Pass 2011 I published a blog post Thoughts on Data Explorer which must have caught someone’s attention because soon after I was lucky enough to be invited onto an early preview of Data Explorer, hence I have spent the past few weeks familiarising myself with it. In this blog post I am going to demonstrate how one can use Data Explorer to consume and parse a Twitter list.

    I have set up a list specifically for this demo and suitably it is a list of tweeters that tweet about Data Explorer – you can view the list at http://twitter.com/#!/list/jamiet/data-explorer. Note that some of the screenshots in this blog post were taken prior to the public release and many of them have been altered slightly since then; with that in mind, here we go.

    First, browse to https://dataexplorer.sqlazurelabs.com/ and log in

    When logged in select New to create a new mashup

    image

    Give your mashup a suitable name

    image

    You will be shown some options for consuming a source of data. Click on Formula

    image

    We’re going to be good web citizens and use JSON rather than XML to return data from our list. The URI for our Twitter API call is https://api.twitter.com/1/lists/statuses.json?slug=data-explorer&owner_screen_name=jamiet, note how I have specified the list owner (me) and the name of the list (what they call the slug) “data-explorer” as query parameters. If you go to that URL in your browser then you will be prompted to save a file containing the returned JSON document which, if all you want to do is see the document, isn’t very useful. In debugging my mashups I have found a service called JSON Formatter to be invaluable because it allows us to see the contents of a JSON document by supplying the URI of that document as a parameter like so: http://jsonformatter.curiousconcept.com/#https://api.twitter.com/1/lists/statuses.json?slug=data-explorer&owner_screen_name=jamiet. It might be useful to keep that site open in a separate window as you attempt to build the mashup below.

    I’ve digressed a little, let’s get back to our mashup. We’re going to use a function called Web.Contents() to consume the contents of the Twitter API call and pass the results into another function, Json.Document(), which parses the JSON document for us. The full formula is:

    = Json.Document(Web.Contents(“https://api.twitter.com/1/lists/statuses.json?slug=data-explorer&owner_screen_name=jamiet”))

    image

    When you type in that formula and simply hit enter you’re probably going to be faced with this screen:

    image

    Its asking you how you want to authenticate with the Twitter API. Calls to the https://api.twitter.com/1/lists/statuses.json resource don’t require authentication so anonymous access is fine, just hit continue. When you do you will see something like this:

    image

    The icon

    image

    essentially indicates a dataset, so each record of these results is in itself another dataset. We’ll come onto how we further parse all of this later on but before we do we should clean up our existing formula so that we’re not hardcoding the values “data-explorer” and “jamiet”.

    The Web.Contents() function possesses the ability to specify named parameters rather than including them in the full URL. Change the formula to:

    = Json.Document(Web.Contents("https://api.twitter.com/1/lists/statuses.json", [Query = [slug="data-explorer", owner_screen_name="jamiet"] ])) :

    image

    That will return the same result as before but now we’ve broken out the query parameters {slug, owner_screen_name} into parameters of Web.Contents(). That’s kinda nice but they’re still hardcoded; instead what we want to do is turn the whole formula into a callable function, we do that by specifying a function signature and including the parameters of the signature in the formula like so:

    = (slug,owner_screen_name) => Json.Document(Web.Contents("https://api.twitter.com/1/lists/statuses.json", [Query = [slug=slug, owner_screen_name=owner_screen_name] ]))

    image

    Let’s give our new function a more meaningful name by right-clicking on the resource name which is currently set as “Custom1” and renaming it as “GetTwitterList”:

    image

    image

    We have now defined a new function within our mashup called GetTwitterList(slug, owner_screen_name) that we can call as if it were a built-in function.

    image

    Let’s create a new resource as a formula that uses our new custom function and pass it some parameter values:

    = GetTwitterList("data-explorer", "jamiet")

    image

    We still have the same results but now via a nice neat function that abstracts away the complexity of Json.Document( Web.Contents() ).

    As stated earlier each of the records is in itself a dataset each of which, in this case, represents lots of information about a single tweet. We can go a long way to parsing out the information using a function called IntoTable() that takes a dataset and converts it into a table of values:

    image

    Here is the result of applying IntoTable() to the results of GetTwitterlist():

    image

    This is much more useful, we can now see lots of information about each tweet however notice that information about the user who wrote the tweet is wrapped up in yet another nested dataset called “user”.

    All the time note how whatever data we are seeing and whatever we do to that data via the graphical UIs is always reflected in the formula bar; in the screenshot immediately above notice that we are selecting the “user” and “text” columns (the checkbox for “user” is off the screen but is checked).

    We can now parse out the user’s screen_name using a different function – AddColumn(). AddColumn() taken an input and allows us to define a new column (in this case called “user-screen_name”) and specify an expression for that column based on the input. A picture speaks a thousand words so:

    = Table.AddColumn(intoTable, "user_screen_name", each [user][screen_name])

    image

    There we have our new column, user_screen_name, containing the name of the tweeter that tweeted the tweet. At this point let’s take a look at the raw JSON to see where this got parsed out from:

    image

    Notice that the screen_name, UserEd_, is embedded 3 levels deep within the hierarchical JSON document.

    We’re almost there now. The final step is to use the function SelectColumns() to select the subset of columns that we are interested in::

    = Table.SelectColumns(InsertedCustom,{"text", "user_screen_name"})

    image

    Which gives us our final result:

    image

    At this point hit the Save button:

    image

    OK, so we have a mashup that pulls some data out of twitter, parses it and then….well…nothing! It doesn’t actually do anything with that data.  We have to publish the mashup so that it can be consumed and we do that by heading back to the home page (which is referred to as “My Workspace”) by clicking the My Workspace button near the top of the page:

    image

    Back in My Workspace you can select your newly created mashup (by clicking on it) and options Preview, Snapshot & Publish appear:

    image

    We’ll ignore Preview and Snapshot for now, hit the Publish button instead at which point we are prompted for a name that we will publish the mashup as:

    image

    Hitting Publish will do the necessary and make our data feed available at a public URI:

    image

    Head to that URL (https://ws41451459.dataexplorer.sqlazurelabs.com/Published/TwitterListDemo) and here’s what you see:

    image

    You can download the mashup output as a CSV file or an Excel workbook. You can also download the whole mashup so you can edit it as you see fit and, most importantly, you can access the output of the mashup via an OData feed at https://ws41451459.dataexplorer.sqlazurelabs.com/Published/TwitterListDemo/Feed/jamiet-dataexplorer-text_user_screen_name

    We have used Data Explorer’s JSON parsing and dataset navigation abilities to pull out the data that we are interested in and present it in a neat rectangular data structure that we are familiar with. Moreover we have done it without installing any software and we have made that data accessible via an open protocol; that’s pretty powerful and, in my wholly worthless opinion, very cool indeed.

    Have fun playing with Data Explorer. Feel free to download my Twitter List Demo mashup and mess about with it to your heart’s content.

    @jamiet

  • SQL Server Configuration timeouts - and a workaround [SSIS]

    Ever since I started writing SSIS packages back in 2004 I have opted to store configurations in .dtsConfig (.i.e. XML) files rather than in a SQL Server table (aka SQL Server Configurations) however recently I inherited some packages that used SQL Server Configurations and thus had to immerse myself in their murky little world. To all the people that have ever gone onto the SSIS forum and asked questions about ambiguous behaviour of SQL Server Configurations I now say this... I feel your pain!

    The biggest problem I have had was in dealing with the change to the order in which configurations get applied that came about in SSIS 2008. Those changes are detailed on MSDN at SSIS Package Configurations however the pertinent bits are:

    As the utility loads and runs the package, events occur in the following order:

    1. The dtexec utility loads the package.
    2. The utility applies the configurations that were specified in the package at design time and in the order that is specified in the package. (The one exception to this is the Parent Package Variables configurations. The utility applies these configurations only once and later in the process.)
    3. The utility then applies any options that you specified on the command line.
    4. The utility then reloads the configurations that were specified in the package at design time and in the order specified in the package. (Again, the exception to this rule is the Parent Package Variables configurations). The utility uses any command-line options that were specified to reload the configurations. Therefore, different values might be reloaded from a different location.
    5. The utility applies the Parent Package Variable configurations.
    6. The utility runs the package.
    To understand how these steps differ from SSIS 2005 I recommend reading Doug Laudenschlager’s blog post Understand how SSIS package configurations are applied.

    The very nature of SQL Server Configurations means that the Connection String for the database holding the configuration values needs to be supplied from the command-line. Typically then the call to execute your package resembles this:

    dtexec /FILE Package.dtsx /SET "\Package.Connections[SSISConfigurations].Properties[ConnectionString]";"\"Data Source=SomeServer;Initial Catalog=SomeDB;Integrated Security=SSPI;\"",

    The problem then is that, as per the steps above, the package will (1) attempt to apply all configurations using the Connection String stored in the package for the "SSISConfigurations" Connection Manager before then (2) applying the Connection String from the command-line and then (3) apply the same configurations all over again. In the packages that I inherited that first attempt to apply the configurations would timeout (not unexpected); I had 8 SQL Server Configurations in the package and thus the package was waiting for 2 minutes until all the Configurations timed out (i.e. 15seconds per Configuration) - in a package that only executes for ~8seconds when it gets to do its actual work a delay of 2minutes was simply unacceptable.

    We had three options in how to deal with this:

    1. Get rid of the use of SQL Server configurations and use .dtsConfig files instead
    2. Edit the packages when they get deployed
    3. Change the timeout on the "SSISConfigurations" Connection Manager

    #1 was my preferred choice but, for reasons I explain below*, wasn't an option in this particular instance. #2 was discounted out of hand because it negates the point of using Configurations in the first place. This left us with #3 - change the timeout on the Connection Manager. This is done by going into the properties of the Connection Manager, opening the "All" tab and changing the Connect Timeout property to some suitable value (in the screenshot below I chose 2 seconds).

    connman

    This change meant that the attempts to apply the SQL Server configurations timed out in 16 seconds rather than two minutes; clearly this isn't an optimum solution but its certainly better than it was.

    So there you have it - if you are having problems with SQL Server configuration timeouts within SSIS try changing the timeout of the Connection Manager. Better still - don't bother using SQL Server Configuration in the first place. Even better - install RC0 of SQL Server 2012 to start leveraging SSIS parameters and leave the nasty old world of configurations behind you.

    @Jamiet


    * Basically, we are leveraging a SSIS execution/logging framework in which the client had invested a lot of resources and SQL Server Configurations are an integral part of that.

  • Workaround for datadude deployment bug - NullReferenceException

    I have come across a bug in Visual Studio 2010 Database Projects (aka datadude aka DPro aka Visual Studio Database Development Tools aka Visual Studio Team Edition for Database Professionals aka Juneau aka SQL Server Data Tools) that other people may encounter so, for the purposes of googling, I'm writing this blog post about it. Through my own googling I discovered that a Connect bug had already been raised about it (VS2010 Database project deploy - “SqlDeployTask” task failed unexpectedly, NullReferenceException), and coincidentally enough it was raised by my former colleague Tom Hunter (whom I have mentioned here before as the superhuman Tom Hunter) although it has not (at this time) received a reply from Microsoft. Tom provided a repro, namely that this syntactically valid function definition:

    CREATE FUNCTION [dbo].[Function1]
    ()
    RETURNS TABLE
    AS
    RETURN
    (
        
    WITH cte AS (
        
    SELECT 1 AS [c1]
        
    FROM [$(Database3)].[dbo].[Table1]
      
    )
      
    SELECT 1 AS [c1]
      
    FROM cte
    )

    would produce this nasty unhelpful error upon deployment:

    C:\Program Files (x86)\MSBuild\Microsoft\VisualStudio\v10.0\TeamData\Microsoft.Data.Schema.TSqlTasks.targets(120,5): Error MSB4018: The "SqlDeployTask" task failed unexpectedly.
    System.NullReferenceException: Object reference not set to an instance of an object.
       at Microsoft.Data.Schema.Sql.SchemaModel.SqlModelComparerBase.VariableSubstitution(SqlScriptProperty propertyValue, IDictionary`2 variables, Boolean& isChanged)
       at Microsoft.Data.Schema.Sql.SchemaModel.SqlModelComparerBase.ArePropertiesEqual(IModelElement source, IModelElement target, ModelPropertyClass propertyClass, ModelComparerConfiguration configuration)
       at Microsoft.Data.Schema.SchemaModel.ModelComparer.CompareProperties(IModelElement sourceElement, IModelElement targetElement, ModelComparerConfiguration configuration, ModelComparisonChangeDefinition changes)
       at Microsoft.Data.Schema.SchemaModel.ModelComparer.CompareElementsWithoutCompareName(IModelElement sourceElement, IModelElement targetElement, ModelComparerConfiguration configuration, Boolean parentExplicitlyIncluded, Boolean compareElementOnly, ModelComparisonResult result, ModelComparisonChangeDefinition changes)
       at Microsoft.Data.Schema.SchemaModel.ModelComparer.CompareElementsWithSameType(IModelElement sourceElement, IModelElement targetElement, ModelComparerConfiguration configuration, ModelComparisonResult result, Boolean ignoreComparingName, Boolean parentExplicitlyIncluded, Boolean compareElementOnly, Boolean compareFromRootElement, ModelComparisonChangeDefinition& changes)
       at Microsoft.Data.Schema.SchemaModel.ModelComparer.CompareChildren(IModelElement sourceElement, IModelElement targetElement, ModelComparerConfiguration configuration, Boolean parentExplicitlyIncluded, Boolean compareParentElementOnly, ModelComparisonResult result, ModelComparisonChangeDefinition changes, Boolean isComposing)
       at Microsoft.Data.Schema.SchemaModel.ModelComparer.CompareElementsWithoutCompareName(IModelElement sourceElement, IModelElement targetElement, ModelComparerConfiguration configuration, Boolean parentExplicitlyIncluded, Boolean compareElementOnly, ModelComparisonResult result, ModelComparisonChangeDefinition changes)
       at Microsoft.Data.Schema.SchemaModel.ModelComparer.CompareElementsWithSameType(IModelElement sourceElement, IModelElement targetElement, ModelComparerConfiguration configuration, ModelComparisonResult result, Boolean ignoreComparingName, Boolean parentExplicitlyIncluded, Boolean compareElementOnly, Boolean compareFromRootElement, ModelComparisonChangeDefinition& changes)
       at Microsoft.Data.Schema.SchemaModel.ModelComparer.CompareChildren(IModelElement sourceElement, IModelElement targetElement, ModelComparerConfiguration configuration, Boolean parentExplicitlyIncluded, Boolean compareParentElementOnly, ModelComparisonResult result, ModelComparisonChangeDefinition changes, Boolean isComposing)
       at Microsoft.Data.Schema.SchemaModel.ModelComparer.CompareElementsWithoutCompareName(IModelElement sourceElement, IModelElement targetElement, ModelComparerConfiguration configuration, Boolean parentExplicitlyIncluded, Boolean compareElementOnly, ModelComparisonResult result, ModelComparisonChangeDefinition changes)
       at Microsoft.Data.Schema.SchemaModel.ModelComparer.CompareElementsWithSameType(IModelElement sourceElement, IModelElement targetElement, ModelComparerConfiguration configuration, ModelComparisonResult result, Boolean ignoreComparingName, Boolean parentExplicitlyIncluded, Boolean compareElementOnly, Boolean compareFromRootElement, ModelComparisonChangeDefinition& changes)
       at Microsoft.Data.Schema.SchemaModel.ModelComparer.CompareAllElementsForOneType(ModelElementClass type, ModelComparerConfiguration configuration, ModelComparisonResult result, Boolean compareOrphanedElements)
       at Microsoft.Data.Schema.SchemaModel.ModelComparer.CompareStore(ModelStore source, ModelStore target, ModelComparerConfiguration configuration)
       at Microsoft.Data.Schema.Build.SchemaDeployment.CompareModels()
       at Microsoft.Data.Schema.Build.SchemaDeployment.PrepareBuildPlan()
       at Microsoft.Data.Schema.Build.SchemaDeployment.Execute(Boolean executeDeployment)
       at Microsoft.Data.Schema.Build.SchemaDeployment.Execute()
       at Microsoft.Data.Schema.Tasks.DBDeployTask.Execute()
       at Microsoft.Build.BackEnd.TaskExecutionHost.Microsoft.Build.BackEnd.ITaskExecutionHost.Execute()
       at Microsoft.Build.BackEnd.TaskBuilder.ExecuteInstantiatedTask(ITaskExecutionHost taskExecutionHost, TaskLoggingContext taskLoggingContext, TaskHost taskHost, ItemBucket bucket, TaskExecutionMode howToExecuteTask, Boolean& taskResult)
       Done executing task "SqlDeployTask" -- FAILED.
      Done building target "DspDeploy" in project "Lloyds.UKTax.DB.UKtax.dbproj" -- FAILED.
     Done executing task "CallTarget" -- FAILED.
    Done building target "DBDeploy" in project

    It turns out there are a certain set of circumstances that need to be met for this error to occur:

    • The object being deployed is an inline function  (may also exist for multistatement and scalar functions - I haven't tested that)
    • That object includes SQLCMD variable references
    • The object has already been deployed successfully

    Just to reiterate that last bullet point, the error does not occur when you deploy the function for the first time, only on the subsequent deployment.

     

    Luckily I have a direct line into a guy on the development team so I fired off an email on Friday evening and today (Monday) I received a reply back telling me that there is a simple fix, one simply has to remove the parentheses that wrap the SQL statement. So, in the case of Tom's repro, the function definition simply needs to be changed to:

    CREATE FUNCTION [dbo].[Function1]
    ()
    RETURNS TABLE
    AS
    RETURN
    --(
        
    WITH cte AS (
        
    SELECT 1 AS [c1]
        
    FROM [$(Database3)].[dbo].[Table1]
      
    )
      
    SELECT 1 AS [c1]
      
    FROM cte
    --)

    I have commented out the offending parentheses rather than removing them just to emphasize the point.

    Thereafter the function will deploy fine. I tested this out on my own project this morning and can confirm that this fix does indeed work.

     

    I have been told that the bug CAN be reproduced in the Release Candidate (RC) 0 build of SQL Server Data Tools in SQL Server 2012 so am hoping that a fix makes it in for the Release-To-Manufacturing (RTM) build.

    Hope this helps

    @jamiet

  • Verify a connection before using it [SSIS]

    Just recently I've inherited some SSIS packages that were in dire need of fixing however, as is often the case, most of my battles were with connection string configurations.

    It always baffles me when I see packages that don't log information that would be useful for debugging purposes and when its me that has to debug those packages I tend to get a little irate. Do a favour to yourself and the poor soul that inherits your packages by placing a Script Task at the start of your package with the following code in it:

            public void Main()
    {
    bool failure = false;
    bool fireAgain = true;
    foreach (var ConnMgr in Dts.Connections)
    {
    Dts.Events.FireInformation(
    1, "", String.Format("ConnectionManager='{0}', ConnectionString='{1}'",
    ConnMgr.Name, ConnMgr.ConnectionString),
    "", 0, ref fireAgain);
    try
    {
    ConnMgr.AcquireConnection(
    null);
    Dts.Events.FireInformation(
    1, "", String.Format("Connection acquired successfully on '{0}'",
    ConnMgr.Name),
    "", 0, ref fireAgain);
    }
    catch (Exception ex)
    {
    Dts.Events.FireError(
    -1, "", String.Format("Failed to acquire connection to '{0}'. Error Message='{1}'",
    ConnMgr.Name, ex.Message),
    "", 0);
    failure
    = true;
    }
    }
    if (failure)
    Dts.TaskResult
    = (int)ScriptResults.Failure;
    else
    Dts.TaskResult
    = (int)ScriptResults.Success;
    }

    You'll be glad that you did because you'll get your connection strings appearing in your log file:

    SCR Output Connection Strings: ConnectionManager='DB', ConnectionString='Data Source=dev;Initial Catalog=AdventureWorks;Integrated Security=SSPI;'
    SCR Output Connection Strings: Connection acquired successfully on 'DB'

    @jamiet

    P.S. Those of you that have been following my blog long enough may know that I posted this back in 2005 however I don't think there's any harm in putting it out there again, especially given that:

    1. more people are now using SSIS
    2. The previous code was VB.net
    3. In the previous post the code was in a JPEG thus not copy/paste-able (god only knows why I did that)
  • Dataflow mechanics [SSIS]

    Once upon a time I blogged at http://consultingblogs.emc.com/jamiethomson but that ended in August 2009 when I left EMC. There is a lot of (arguably) valuable content over there however certain events in the past leave me concerned that that content is not well cared for and I don't have any confidence that it will still exist in the long term. Hence, I have taken the decision to re-publish some of that content here at SQLBlog so over the coming weeks and months you may find re-published content popping up here from time-to-time.

    This is the second such blog post in which I discuss the internals of the SSIS Dataflow. The first post in this series can be found at [SSIS] OnPipelineRowsSent.


    During my activity on the SSIS forum I've noticed that much of the content is in regard to the dataflow task and that's not a surprise given that its the most useful tool in the SSIS box and also the most complex. This post is me brainstorming some of the stuff that I know about the dataflow and hopefully it proves useful to some of you.

    • Buffer Architecture. If I'm ever interviewing you for a job as a SSIS developer you can lay a lot of money to say that I'll ask you to tell me what a buffer is. Buffers are fundamental to the dataflow - they are what the dataflow uses to move data around. A buffer is essentially an area of memory and by default consists of approximately 10000 rows (usually slightly less than that) and that's why when you execute a dataflow within BIDS the row counts on the data paths go up in approximate increments of 10000. Part of performance tuning a SSIS dataflow is about manipulating various properties until you find the optimum number of rows in each buffer and you can read more (much more) about that here.
    • Dataflows contain components which are generally categorised into synchronous and asynchronous. The most definitive description of these is that the output from a synchronous component uses the same buffer as the input; asynchronous components create a new buffer for their output. All source adapters are asynchronous components, all destination adapters are synchronous. Synchronous components are generally quicker than asynchronous components.
    • Asynchronous components are further categorised as partially-blocking or fully-blocking. Fully-blocking components require all rows from upstream before they put any data into the output; partially-blocking components will start to output data before they receive all upstream rows.
    • Execution trees. Each asynchronous component creates what is called an execution tree in the dataflow. In SSIS 2005 (but not in later versions) each execution tree uses one execution thread so another part of performance tuning is to fully utilise all processors on your hardware. Read more here.
    • OnPipelineRowsSent. All executables in a SSIS package throw events and one of the events throws by the dataflow is OnPipelineRowsSent. When a component outputs a buffer of data then it throws a OnPipelineRowsSent event and thus enables us to know how many rows each component has processed. When you execute a dataflow within the development environment (aka BIDS) these events are consumed and are used to change the rowcounts that you see increasing as more rows are processed.
    • Spooling. I said earlier that all buffers are a space in memory but of course memory is finite so if there is more data in the pipeline than can fit in memory then buffers will get spooled to disc. The location on disc is defined by BLOBTempStoragePath & BufferTempStoragePath. Spooling will severely impact dataflow performance so avoid if possible.
    • A lot of people ask if its possible to remove columns from the dataflow once they have finished using them. For example, if columns called [FirstName] & [LastName] are concatenated together to make [FullName] its likely that those two columns won't be needed anymore. The simple answer though is no. Once the data is in memory it would be an overhead to remove the data and "squeeze" the buffer up to make it slower which is why those columns still appear downstream. This is nothing to be concerned about - its highly highly unlikely they are heavily impacting performance. Of course, if an asynchronous component is encountered then a new buffer will be created on the output and the unrequired columns will (probably) be removed. This issue is further discussed here.
    • Following on from the previous point...its intuitive to think that columns that begin at a component don't exist prior to the data being processed by that component. In fact that's not true. Prior to dataflow execution the execution plan for a dataflow is determined and it is at that point that all columns are defined and thus created (i.e. space is set aside in memory). So, all columns that will be used in a buffer exist even before the buffer gets any data.
    • The datatypes of columns in the dataflow are different from datatypes used for SSIS variables. To this day I don't understand why the SSIS team opted to use different datatypes in the control flow and data flow and I hope this changes one day. (UPDATE: SSIS Development Manager Jeff Bernhardt addresses this issue in A potted SSIS history via Connect.)
    • The stock components (i.e. those provided out-of-the-box) are mostly written in native code (XML Source and the Script Component are exceptions to this rule). SSIS provides a .Net API that enables you and I to build our own components and hence it is tempting to think that these custom components won't work as quickly as stock components. This is probably true but really the difference is negligible. The majority of the work (validation, memory management, buffer editing etc...) is done by native code so you're not going to suffer severe performance problems by implementing custom components.
    • The BLOB data types (i.e. DT_TEXT, DT_NTEXT, DT_IMAGE) can severely impact dataflow performance so try and avoid them if you can.
    • Raw files can be used to pass data from one dataflow to another - even if those dataflows are in different packages. Raw files have a proprietary file format that is essentially a match of the data in memory and hence reading to and writing from them is extremely quick. People often seem reticent to place data into raw files but I don't hesitate to recommend using them if you need to.
    • There is an important property on each component output called IsSorted. A lot of people think that setting this property to TRUE will cause the data in that output to be sorted. That's not true - this property only informs the dataflow engine that the data is sorted, nothing more. If you set this property to TRUE and the data is not sorted then you will probably be creating problems for yourself later on (for example a downstream Merge Join component will not fail but it won't produce the correct results either).
    • Source and destination adapters maintain external column collections which are used to store the metadata of the external data sinks that those adapters connect to. There are two reasons for this as far as I can determine. Firstly to enable offline development (a big criticism of SSIS's predecessor DTS was that offline development wasn't possible) and secondly to enable the dataflow to validate itself. More information here.
    • Although it appears in BIDS as though the data in a buffer "moves" from one component to another that isn't actually the case. Data in a buffer doesn't actually move about in memory. My fellow MVP Phil Brammer (blog | twitter) once used an analogy of cars travelling on a road to describe this. The buffers are analogous to cars on the road and milestones along the road are analogous to the components. Instead of thinking of the cars moving along the road to reach the milestones, think of the cars as being stationery and the road moving along underneath the cars.
    • Back pressure is an important concept in an SSIS dataflow. Backpressure occurs when a dataflow is producing data to a destination faster than the destination can consume it (a common phenomenon when inserting into a relational database table) - this creates contention further back down the dataflow, hence the term "backpressure". Michael Entin (one of the original developer geniuses that built the dataflow engine) talks more about back pressure at SSIS Backpressure Mechanism.

    I'll probably add to this post over time as new things occur to me. In the meantime if you want a more detailed description of how the dataflow works then Kirk Haselden's book has a whole chapter devoted to it. You can also pose questions in the comments although I'd urge you to post questions to the SSIS forum where more people will be available to answer and where your question may already have been answered.

    @Jamiet

More Posts Next page »

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement