THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Jorg Klein

Microsoft Data & Analytics consultant from the Netherlands

  • Azure Data Lake Analytics U-SQL decryption possibilities

    The following information applies to the situation as of October 2016.

    One of the things that make U-SQL so powerful is C# integration. It gives you the possibility to create your own C# classes and methods and then use them in your U-SQL scripts. Recently we executed a couple of big data projects at Macaw using the Azure Data Lake Store (ADLS) and Azure Data Lake Analytics (ADLA). During one of these projects we had to decrypt some custom Rijndael (AES) encrypted JSON content in the ADLS and store the result as a CSV. These CSV files would then be encrypted by default by enabling encryption at rest for the entire ADLS.

    We were able to support two decryption scenarios: decrypt the entire file or decrypt a part of a file (particular sensitive string values). I will go into detail about each scenario.

    Scenario 1: Decrypt entire files
    Because we were dealing with JSON files and U-SQL does not offer a native JSON extractor yet, we used the JsonExtractor C# sample that is shared on GitHub by Michal Rys
    We modified the Extract method of the JsonExtractor class to include a CustomDecryptor that will decrypt the input.BaseStream (= entire file). Next to that we set AtomicFileProcessing to true.

    using System.IO;

    using System.Linq;

    using System.Collections.Generic;

    using Microsoft.Analytics.Interfaces;

    using Newtonsoft.Json.Linq;

    using Custom.Framework.Utilities.Encoding;


    namespace CustomADLALibrary



        [SqlUserDefinedExtractor(AtomicFileProcessing = true)]

        public class CustomExtractor : IExtractor


            /// <summary/>

            private string rowpath;


            /// <summary/>

            public CustomExtractor(string rowpath = null)


                this.rowpath = rowpath;



            /// <summary/>

            public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)



                CustomDecryptor Decryptor = new CustomDecryptor();

                using (var reader = new StreamReader(input.BaseStream))


                    var jsonAsString = reader.ReadToEnd();

                    var decryptedString = Decryptor.Decrypt(jsonAsString);

                    var root = JToken.Parse(decryptedString);


                    foreach (JObject o in SelectChildren(root, this.rowpath))


                        this.JObjectToRow(o, output);


                        yield return output.AsReadOnly();





            /// <summary/>

            private static IEnumerable<JObject> SelectChildren(JToken root, string path)


                if (!string.IsNullOrEmpty(path))


                    return root.SelectTokens(path).OfType<JObject>();


                var o = root as JObject;

                if (o != null)


                    return new[] { o };


                return root.Children().OfType<JObject>();



            /// <summary/>

            protected virtual void JObjectToRow(JObject o, IUpdatableRow row)


                foreach (var c in row.Schema)


                    JToken token = null;

                    object value = c.DefaultValue;


                    if (o.TryGetValue(c.Name, out token) && token != null)


                        value = JsonFunctions.ConvertToken(token, c.Type) ?? c.DefaultValue;


                    row.Set<object>(c.Name, value);





    The modified Extract method is using a CustomDecryptor object in which the actual decryption takes place:

    using System;

    using System.IO;

    using System.Security.Cryptography;


    public class CustomDecryptor



        private readonly byte[] _key = { -- }; //YourKey

        private readonly byte[] _iv = { -- }; //YourIv


        private readonly ICryptoTransform _decryptor;


        public CustomDecryptor()


            var myRijndael = new RijndaelManaged { Key = this._key, IV = this._iv, Padding = PaddingMode.PKCS7 };


            this._encryptor = myRijndael.CreateEncryptor(myRijndael.Key, myRijndael.IV);

            this._decryptor = myRijndael.CreateDecryptor(myRijndael.Key, myRijndael.IV);



        public string Decrypt(string input)


            // Create the streams used for decryption.

            using (MemoryStream msDecrypt = new MemoryStream(Convert.FromBase64String(input)))


                using (CryptoStream csDecrypt = new CryptoStream(msDecrypt, _decryptor, CryptoStreamMode.Read))


                    using (StreamReader srDecrypt = new StreamReader(csDecrypt))


                        return srDecrypt.ReadToEnd();






    From U-SQL we can now easily decrypt entire files. Make sure you publish the Microsoft.Json assembly and your custom assembly (CustomADLALibrary) that contains the modified JsonExtractor that is used before you try to execute the U-SQL job. See my previous blog post for instructions on how to programmatically register U-SQL assemblies using PowerShell.

    REFERENCE ASSEMBLY [Newtonsoft.Json];


    DECLARE @INPUT_FILE string = @"somePath/encryptedFile.json";

    DECLARE @OUTPUT_FILE string = @"someOtherPath/decryptedFile.csv";


    @decryptedExtract =

        EXTRACT column1 string,

                column2 string


        USING new CustomADLALibrary.CustomExtractor();

    @result =

        SELECT *

        FROM @decryptedExtract;

    OUTPUT @result


    USING Outputters.Csv(quoting : false);

    Scenario 2: Decrypt string values
    In this case the files were not completely encrypted but only particular sensitive string values in the files. Therefore the custom extractor was not needed and the decryption could take place directly in the U-SQL SELECT statement as shown in the example below.



    DECLARE @INPUT_FILE string = @"somePath/stringEncryptedFile.csv";

    DECLARE @OUTPUT_FILE string = @"someOtherPath/decryptedFile.csv";


    @extract =

        EXTRACT column1 string,

                column2 string


        USING Extractors.Csv();


    @result =

        SELECT column1 AS unencryptedColumn,

               new WeakObjectEncryptor().Decrypt(column2) AS decryptedColumn

        FROM @extract;


    OUTPUT @result


    USING Outputters.Csv(quoting : false);

    These decryption examples show one of the many possibilities of the new powerful new U-SQL language. It’s quite easy to use and understandable for both people with a SQL background and C# programmers. In this case I worked together with my colleague Luuk Gortzak who helped with the C# scripts. Credits to Luuk for helping me out!

  • Script Azure Data Lake Analytics assembly deployments to U-SQL Catalog

    Registering your custom assemblies using Visual Studio Data Lake Tools is easy, just right click your U-SQL C# Class Library project, click “Register Assembly”, fill in the details and your assembly gets published and created in your U-SQL Catalog.

    But what if you want to script this process to automate your deployments? That is not documented yet and as I follow the principle that we should be able to rebuild the whole environment with a press on the button, I found out how to do this.

    Step 0 – Prerequisites
    Install latest version of Azure PowerShell. The PowerShell script below won’t work with an older version.

    Step 1 – Build your dll from Visual Studio.
    We need the dll of your custom assembly, create it by right clicking your U-SQL Class Library project and choose to Build. You can see where the dll has been created on disk in the output window.

    Step 2 – Create a U-SQL job file that will create the assembly.
    Modify the U-SQL script below and store it in a file somewhere locally on disk. Name it createAssembly.usql

    USE DATABASE yourUsqlCatalog;



    FROM @"/Assemblies/yourCSharp.dll";


    Step 3 – Upload the assembly to your ADLS and execute the U-SQL job using PowerShell.
    Enter values for the variables in the PowerShell script below and execute it.

    #Variables; modify 

    $dataLakeStoreName = "yourAdlsAccount"

    $dataLakeAnalyticsName = "yourAdlaAccount"

    $assemblyLocalPath = "c:\yourCSharp.dll" #step 1

    $usqlScriptLocalPath = "c:\createAssembly.usql" #step 2

    $assemblyAdlsPath = "/Assemblies/yourCSharp.dll" #Assemblies folder will be created if not exists

    $azureSubscriptionId = "00000000-0000-0000-0000-000000000000"


    #Login (login pop up appears)



    #Connect to the Azure Subscription in which your ADLA Catalog exists 

    Set-AzureRMContext -SubscriptionId $azureSubscriptionId 


    #Import dll to ADLS

    Import-AzureRmDataLakeStoreItem -AccountName $dataLakeStoreName -Path $assemblyLocalPath -Destination $assemblyAdlsPath


    #Submit new job to ADLA (createAssembly.usql)

    $job = Submit-AzureRmDataLakeAnalyticsJob -Name "Create Assembly" -AccountName $dataLakeAnalyticsName –ScriptPath $usqlScriptLocalPath -DegreeOfParallelism 1


     While (($t = Get-AzureRmDataLakeAnalyticsJob -AccountName $dataLakeAnalyticsName -JobId $job.JobId).State -ne "Ended"){

         Write-Host "Job status: "$t.State"..."

         Start-Sleep -seconds 10



     Get-AzureRmDataLakeAnalyticsJob -AccountName $dataLakeAnalyticsName -JobId $job.JobId


    Step 4 – Validate.
    Validate if your dll is uploaded to a folder in your ADLS called “Assemblies”, and next, if your assembly is created in your U-SQL Catalog with Visual Studio Server Explorer (Azure).

    Step 5 – Reference your new assembly in your U-SQL scripts.
    You can now start using your assembly by referencing it in the first lines of code in your U-SQL script.

    USE DATABASE yourUsqlCatalog;





  • Use AdlCopy to generate U-SQL jobs that copy data between Azure Blob Storage and Azure Data Lake Store

    AdlCopy is a command-line tool (it runs on the user’s machine) that allows you to copy data from Azure Storage Containers or Blobs into Azure Data Lake Store.

    You can use the AdlCopy tool in two ways:

    • Standalone, where the tool uses Data Lake Store resources to perform the task. This can be a cheap and more ad-hoc/manual option to move data from blob storage to ADLS compared to using Data Factory or ADLA U-SQL jobs. I assume only ADLS transaction costs will be made if your blob storage resides in the same region as your ADLS.
    • Using a Data Lake Analytics account, where the units assigned to your Data Lake Analytics account are used to perform the copy operation. You would typically use this option when the data to be moved is in the range of gigabytes and terabytes, and you want better and predictable performance throughput.

    Another advantage is that on execution the AdlCopy tool generates an ADLA job that contains U-SQL code for each source file from your blob storage that will be copied to the ADLS. You can reuse the code of this job to schedule execution of the copy process on a regular basis. Requirement for this scenario is that your blob storage source files have static paths and file names, as the U-SQL code of the generated job contains hardcoded references to the source files locations.

    Download AdlCopy:


    Copy all data from blob storage container to ADLS folder using a generated U-SQL job that can be reused:

    1. Start Command Prompt

    2. Navigate to dir where AldCopy is installed:


    3. Run the following command (standalone, copy is performed by ADLS):

    AdlCopy.exe /Source /Dest swebhdfs:// /SourceKey -

    4. Data is copied:



    5. Run the following command to execute the process using the Data Lake Analytics service:

    AdlCopy.exe /Source /Dest swebhdfs:// /SourceKey - /Account yourDataLakeAnalyticsAccount /Units 2

    6. Now because the ADLA service has been used a U-SQL job was generated and executed, which can be seen in the azure portal:



    7. The U-SQL code shows that for each file in the blob storage EXTRACT from blob storage and OUTPUT to ADLS statements were created:



    8. You can now choose “Duplicate Script”, delete the DROP ASSEMBLY IF EXIST and CREATE ASSEMBLY statements which you don’t need again, and save your new job. You can now execute/schedule the copy process again without the need of the AdlCopy tool.


  • Setup Azure Data Lake Analytics federated U-SQL queries to Azure SQL Database

    One of the major value propositions of U-SQL is that it allows to query data where it lives. For external systems, such as Microsoft Azure SQL Database, this is achieved with federated queries against data sources.




    In order to query these external data sources, a data source object has to be created and referenced that abstracts the connection information as well as information about its capabilities to execute query expressions passed and translated from U-SQL to the its local query engine.




    ·         An Azure Subscription

    ·         Azure Data Lake Store (ADLS)

    ·         Azure Data Lake Analytics (ADLA)

    ·         Azure SQL Database (ASQLDB) or Azure SQL Data warehouse (ASQLDWH) with SQL login/password

    ·         Visual Studio 2015. Optional, to create and execute U-SQL queries, this can also be done in the Azure portal.

    ·         Azure Data Lake Tools for Visual Studio 2015

    ·         Azure PowerShell

    ·         RX Permissions on your ADLS (data level) to be able to create the Catalog Secret

    o   Navigate to ADLS in Azure Portal

    o   Click Data Explorer

    o   Click Access

    o   Validate you have Read/Execute permissions.

    ·         Allow IP range in the ASQLDB server firewall for the ADLA services that fire the U-SQL queries

    o   Navigate to ASQLDB server in Azure Portal

    o   Click Settings

    o   Click Firewall

    o   Create new rule with range to




    1.       Create ADLA Database using a U-SQL query:




    2.       Create Catalog Secret in ADLA Database which contains the password for the SQL login and connection string for the ASQLDB database using the Azure PowerShell script below:

    #Login (login pop up appears)


    #Show your available Azure Subscriptions


    #Connect to the Azure Subscription in which your ADLA Database exists 

    Set-AzureRMContext -SubscriptionId 00000000-0000-0000-0000-000000000000


    #$passwd: password for ASQLDB / ASQLDWH which you want to federate

    #-Account: ADLA account name

    #-DatabaseName: Data Lake Analytics Database name

    #-Host: Host of ASQLDB / ASQLDWH 

    $passwd = ConvertTo-SecureString "YourPassword" -AsPlainText -Force

    $mysecret = New-Object System.Management.Automation.PSCredential("YourASQLDB_Secret", $passwd)

    New-AzureRmDataLakeAnalyticsCatalogSecret -Account "youradlaaccount" -DatabaseName "YourADLADatabaseName" -Secret $mysecret -Host "" -Port 1433


    3.       Create CREDENTIAL with IDENTITY that matches the AzureRmDataLakeAnalyticsCatalogSecret name as used in the PowerShell script (YourASQLDB_Secret) in ADLA Database using the U-SQL query below:

    //Connect to ADLA Database

    USE DATABASE YourADLADatabaseName;


    //Create CREDENTIAL

    //USER_NAME: ASQLDB Username

    //IDENTITY: ADLA Catalog Secret, must match name chosen in PowerShell script



    4.       Create Data Source in ADLA Database with a reference to the ASQLDB using the U-SQL query below:

    // Create External Data source on AZURESQLDB


           ( PROVIDER_STRING = "Database=YourASQLDB;Trusted_Connection=False;Encrypt=True"

           , CREDENTIAL = YourASQLDB_Secret

           , REMOTABLE_TYPES = (bool, byte, sbyte, short, ushort, int, uint, long, ulong, decimal, float, double, string, DateTime)



    5.       Create an External Table in ADLA Database based on the Data Source using the U-SQL query below:


    // External tables are optional. You can refer to data source tables in queries directly (Lazy Metadata loading):


    CREATE EXTERNAL TABLE someExternalTable (

           [someColumn] string

    ) FROM ASQL_YOURDB LOCATION "dbo.someTable";


    6.       Query the federated external ASQLDB table and output result to file using the U-SQL query below:

    @query =

        SELECT someColumn

        FROM someExternalTable;


     OUTPUT @query TO "/Output/file.csv"

     USING Outputters.Csv();

  • See you at the PASS Summit 2015 in Seattle!

    My employer Macaw gave me the opportunity to attend the SQL PASS Summit this year, so I am traveling to Seattle tomorrow! I’m excited to be there this year because there are so many new technologies being pushed in our direction like the Cortana Analytics suite and SQL Server 2016. At the moment I am already involved in some PoC/pilot projects for some of our customers regarding these new technologies, so I can apply my new knowledge in the field the moment I come back to Holland.

    Regarding to the community, I’ll hope to meet some people I have been in contact with on internet for quite some time like some of the SQL Server MVPs, but of course it’s also a pleasure to meet some new people, so come to say “Hi” or send me message.

    I created a twitter account today which I will use to share some knowledge during the PASS, so follow me right now. See you there!

  • SharePoint Server 2016 IT Preview from a BI point of view

    SharePoint Online and Power BI have a lot of attention and focus lately, but what about the on-premises version of SharePoint that’s used in almost all serious BI solutions today? Well SharePoint Server 2016 IT Preview has just been released by Microsoft, by many stated as the last on-premises version of SharePoint ever. In this blog post I will focus on the BI features of SharePoint Server 2016.

    So what kind of BI features are improved and what’s new? It’s better start with the question: which BI features will remain available in SharePoint 2016, because a lot of BI features will be deprecated or removed.

    An overview of the main SharePoint BI capabilities and their availability in SharePoint 2016:

    · Excel Services

    Excel Services will no longer be available, it will completely move to Excel Online in Office Online.

    · PowerPivot for SharePoint and Power View add-ins

    These BI features are not available in the current release of the SharePoint 2016 IT Preview. Microsoft states they will implement them later this year, but when and in what form is uncertain.

    · Reporting Services Integrated Mode

    There is no news about Reporting Services Integrated Mode, but as I wrote in my last blog post, Reporting Services gets a big update in SQL Server 2016. One of the new possibilities will be the integration of Reporting Services and Power BI. So why still integrate with SharePoint?

    With the release of the new Power BI Desktop (former Power BI designer) a lot of Power Pivot functionality move to Power BI. For example, you can now import Excel Power BI artifacts (Data Model, Queries, Power View) into a Power BI Desktop file. This new functionality, together with the Reporting Services/Power BI integration that’s coming and the news that Excel Services will no longer be available in SharePoint 2016, raises the question if BI solutions will continue to need SharePoint in the future. All functionality seems to be moving to Power BI!

    On the other hand, Microsoft recently announced a collaboration with Pyramid Analytics which will offer the possibility to deploy a Power BI Desktop file to an on-premises Pyramid Analytics server, bringing Power BI on-premises. Pyramid Analytics offered SharePoint integration in the past, so maybe they will integrate again with SharePoint 2016, which results in a Power BI and SharePoint 2016 integration, making SharePoint more important than ever for on-premises solutions.

    It’s clear there is a lot of uncertainty about the future of SharePoint as part of the Microsoft BI platform! To be continued…

  • BI on your terms with SQL Server 2016

    The last few years Microsoft's strategy was all about cloud first (or cloud only?), releasing new BI products and updates to existing products to the cloud in high pace without almost any investments in on-premises BI. In 2015 Microsoft seems to change its course, they now aim more on the enabling of hybrid scenarios, investing a lot in both cloud (Power BI/Azure) and on-premises with SQL Server 2016.
    Microsoft’s message regarding BI for 2015/2016 is:  “BI on your terms”.

    BI on your terms means leveraging up-to-date possibilities for one or a combination (hybrid) of the following architectures:

    • Cloud with Azure and Power BI

    • On-Premises with SQL Server 2016

    • Server driven or Self-Service

    To be able to offer quality hybrid architectures Microsoft invests a lot in the on-premises BI suite with SQL Server 2016 and they have announced to keep investing in it the coming years. So not only cloud first like we have seen in previous years, but more on hybrid possibilities, and if you desire on-premises only.

    For the first time in many years an exciting version of SQL Server is coming in terms of BI. The main topics are:

    • Hybrid BI (Cloud/On-Premises)

    • Modern Reports

    • Enhanced Analysis

    • Mobile BI

    Below is an overview of the new BI related features per SQL Server 2016 service or product. As the length of this list shows, SQL Server 2016 will be a massive BI version!!

    Analysis Services Tabular

    • Enhanced modeling capabilities in the semantic layer

      • Many-to-many relationships

      • BI Directional cross filtering. This means you can not only filter on the 1 side of a 1 to many relationship in your tabular model, but also on the many side. For example, two connected tables, Sales à Product:

        • Product: product, product category

        • Sales: sales date, connection to product table

          Now select products sold filtering on sales date(many side) while also filtering on product category (1 side). This is not possible in today’s version of SSAS tabular.

    • Time intelligence

      • Date/time columns are automatically converted to rich date/time tables starting from the column’s MIN date till the MAX date found

    • New DAX functions

      • A lot of new functions that at the moment require quite complex formulas like present time, date difference, percentile, product, geomean, median, etc.

    • Performance improvements

      • For end users

        • Query engine optimized

      • For developers

        • Metadata operations; modeling related operations are much faster

      • For data processing

        • Parallel partition processing

    • Expose on-premises tabular models in the cloud (hybrid) à Power BI feature, possible already today with SQL Server 2012.

    Analysis Services Dimensional

    • Netezza as a Data Source (Netezza Data Warehouse | IBM - NDM Technologies)

    • Performance improvements

      • Unnatural hierarchies

      • Distinct counts

      • Other performance improvements in areas where multidimensional is not performant at the moment

    • DBCC (DataBase Check Consistency) support. Checks the logical and physical integrity of objects in the specified database.

    • Expose on-premises multidimensional cubes in the cloud with Power BI (hybrid)

    SQL Server Database Engine

    • Integration of R analytical engine, predictive analytic capabilities via T-SQL queries

    • PolyBase available without the need of PDW, makes it possible to query both structured relational SQL, and unstructured Hadoop data through T-SQL statements

    • Data encryption for stored data and data in motion

    • Row-level security

    • Updates to the in-memory OLTP engine, for example updateable in-memory nonclustered columnstore indexes

    • Parsing and storing native JSON data

    • XEvents-based monitoring in Management Studio

    Reporting Services

    • New look and feel and possibility to apply themes and branding using CSS

    • New visualizations, chart types like tree maps and sun bursts

    • Improved flexible parameter panel with support for:

      • Autocomplete

      • Search

      • Hierarchical tree display

    • Runs in all modern browsers on both desktops as tablets (any device)

    • Integration of R analytical engine

    • Power Query as a data source

    • Pin on-premises SSRS reports to Power BI Dashboards (hybrid)

    Integration Services

    • High Availability support

    • Power Query integration

    • Azure Data Factory integration (hybrid)

      • Execute on-premises SSIS packages from Azure Data Factory

      • Azure Data Factory data flow task

      • Azure storage connector

      • Azure commandlets

    • OData 4.0 support

    • Hadoop File System (HDFS) support

    • JSON support

    • New Oracle/Teradata connector (4.0)

    • Incremental deployment options

    • Custom logging levels

    • SSIS package templates to reuse ETL code

    Mobile BI

    • In the cloud with Power BI

      • Power BI App for Windows Phone (coming soon) and iOS

    • On-premises with Datazen Server

      • Now available for free for SQL Enterprise Edition customers (2008 or later)

      • All major platforms: Windows Phone, Android, iOS

      • Beautiful interface and data visualizations

      • Optimizable for Phone, Tablet and Laptop

    SharePoint vNext integration

    • Edit Mode of PowerPivot Excel workbooks in browser

    • Support for Excel vNext (Office 2016) DAX functions

    Master Data Services

    • Improved performance for large models

    • Row-level compression per entity

    • Improved user interface

    • Configurable retention settings

    • Enhanced security possibilities for read, write, delete and create operations and support for multiple system administrators with specific permissions

    • Excel Add-in is 15 times faster and is updated to support bulk entity based staging operation

    Visual Studio

    • Database and BI project types merged into one Visual Studio

    • New scripting language for tabular models. Currently tabular models are wrapped into multidimensional constructs and when you deploy it will be reverse engineered to the tabular model. The new native language for tabular will be easy to understand, modify and deploy.

    • SSIS designer supports previous versions of SQL Server

    Of course there is still also a lot of exiting news coming from the cloud side of Microsoft BI, for example the Azure Data Lake is announced, following the principles of my blogpost about the relational data lake. You can expect a post about the Azure Data Lake on this blog soon!

    P.S. Don’t forget to suggest and vote for feature requests for SQL Server yourself at:


  • Power BI Analysis Services Connector Security

    The Power BI Analysis Services Connector can be used to connect from the Power BI service to your on premises tabular Analysis Services models.

    In this blogpost I will look into the security and authentication possibilities that the SSAS Connector offers. Dynamic row-level security based on the Power BI user name to an on premises SSAS Tabular model is possible, I will show how you can set this up and how it works in the background.

    If you want to know how to install and configure the SSAS Connector follow this excellent support guide.

    The SSAS Connector has some similarities with the BI Semantic Model Connection that’s available for SharePoint. This connection can connect either with Kerberos or by using the EffectiveUserName property, which matches the SSAS Connector:
    - A connection to a tabular model is made with stored credentials that require administrator permissions on the SSAS server.
    - The EffectiveUserName parameter is send to SSAS to impersonate the current user. Only SSAS administrators have permission to connect using EffectiveUserName.

    To investigate exactly how the connector works I have created a tabular model with a very simple underlying database model. The Sales table will be secured based on dynamic row-level security on the content of the EffectiveUserName column in the Department table.


    The tabular model contains the dataset shown in the Excel pivot below. In an on premises situation my EffectiveUserName would be MACAW\jorgk, but because I login to Power BI with
    jorgk[at] I have created entries for both. Let’s see which one will be used later on.


    I have created a SSAS Security Role with read permissions on my Active Directory account:



    Row-level security has been set on the EffectiveUserName column of the Department table:

    The SSAS Connector was installed on the server by my colleague Dave, who has administrator permissions on the SSAS Server (inherited from the local Administrators group). For purpose of the demo we made sure my domain account was not in the local admin group and not in the SSAS Server admin group. Dave’s credentials will be the stored credentials that are used to connect to the tabular model from the SSAS Connector, passing the EffectiveUserName property just like a BISM connection is able to do as I explained before.

    Now I logged in to Power BI, created a report and the security is working, my report is filtered and shows only my data:


    During the creation of my report Dave ran a Profiler trace to see what’s happening in the background. As we expected my connection came in under the account of Dave (blue) but with my account as EffectiveUserName (red):


    It’s interesting to see my EffectiveUserName is JorgK[at] but the Power BI report I’ve created shows the data of MACAW\jorgk! Why didn’t it show the data of the row that equals exactly to my EffectiveUsername JorgK[at] Well that’s because SSAS does not authorize the user based on the textual value of the EffectiveUserName property. Instead it calls to AD to check if the Power BI username (based on the mail address) exists in AD, if this is the case the AD Account is returned and the user will be authorized with it.

    To wrap up this entire process I’ve created the following diagram:

    SSAS Connector Overview 

    1. From the Power BI tenant I connect to the SSAS Connector using my Power BI Account.
    2. The SSAS Connector connects to the tabular model using the stored credentials, in this case of my colleague Dave who has administrator permissions on the model and therefore the permissions to pass on my Power BI account as EffectiveUserName.
    3. The EffectiveUserName is checked in Active Directory. This will only be possible if DirSync has been set up, otherwise my Power BI account that comes from Windows Azure Active Directory will not be mapped to the on premises Active Directory.
    4. After Active Directory authorizes me, my on premises AD Account is sent to the tabular model.
    5. Row-level security is applied as configured in the SSAS Role.


    Next thing to do was of course to share the Power BI report in a dashboard to another colleague. As a test I shared it with my colleague Martijn (blue) who did not have any permissions on the tabular model at al. Unfortunately it seems the credentials of the user who shares the dashboard are stored in that dashboard/report and will be used to make the connection to the underlying SSAS data source. Martijn’s report was identical to mine and the SQL Profiler showed us that indeed the connection was again made with the admin account of Dave, and my account passed in the EffectiveUserName (red).



    It’s great to see we can finally connect to on premises models from Power BI using row-level security, which was not possible with the Data Management Gateway. This makes a hybrid Cloud/On Premises architecture a feasible option to implement. Unfortunately we are not there yet because it isn’t possible to impersonate a user after sharing a dashboard which is clearly not the functionality that’s desired. Let’s hope Microsoft will fix this in the coming months! Until then, be sure to advice your customers about this sharing security issue.

  • Relational Data Lake

    What is a Data Lake?
    Pentaho CTO James Dixon is credited with coining the term "Data Lake". As he describes it in his blog entry, "If you think of a Data Mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the Data Lake is a large body of water in a more natural state. The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples."

    These days, demands for BI data stores are changing. BI data consumers not only require cleansed and nicely modeled data, updated on a daily basis, but also raw, uncleansed and unmodeled data which is available near real-time. With new and much more powerful tooling like Power BI, users can shape and cleanse data in a way that fits their personal needs without the help of the IT department. This calls for a different approach when it comes to offering data to these users.

    BI data consumers also demand a very short time-to-market of new data, they don’t want to wait for a few months until data is made available by a BI team, they want it today. The raw uncleansed form of data in a Data Lake can be loaded very quickly because it’s suitable for generated data loading technologies and replication, which makes this short time-to-market possible. Once users have discovered the data and have acquired enough insights that they want to share with the entire organization in a conformed way, the data can be brought to traditional Data Warehouses and cubes in a predictable manner.

    Furthermore there is rise in the presence of unstructured and or semi-structured data and the need to have “big data” available for adhoc analyses. To store and analyze these forms of data new technologies and data structures are required.

    When the Data Lake comes in place a lot of data streams from sources into the “lake” without knowing up front if it is eligible for answering business questions. The data can’t be modeled yet, because it’s not clear how it will be used later on. Data consumers will get the possibility to discover data and find answers before they are even defined. This differs fundamentally from the concept of a Data Warehouse in which the data is delivered through predefined data structures, based on relevant business cases and questions.

    From a technology view, a Data Lake is a repository which offers storage for large quantities and varieties of both unstructured, semi-structured and structured data derived from all possible sources. It can be formed by multiple underlying databases which store these different structured forms of data in both SQL and NoSQL technologies.

    For the semi-structured/unstructured side of data which is used for big data analytics, Data Lakes based on Hadoop and other NoSQL technologies are common. For the semi-structured/structured data, SQL technologies are the way to go.

    In this blog post I will describe the semi-structured/structured, relational appearance of the Data Lake in the form of a SQL Server database: The Relational Data Lake.

    Extract Load (Transform)
    Data in a Data Lake is in raw form. Transformations will not be performed during loading and relationships and constraints between tables will not be created which is the default for transactional replication and keeps the loading process as lean and fast as possible. Because of the lack of transformations, movement of the data follows the Extract-Load-(Transform) (EL(T)) pattern instead of the traditional E-T-L. This pattern makes loading of data to the Data Lake easier, faster and much more suitable to perform using replication technologies or generated SSIS processes, for example with BIML. This creates a very attractive time-to-market for data which is added to the Data Lake. Latency of data is as low as possible, preferable data is loaded in near real-time: data should stream into the lake continuously.

    Transformations take place after the data is loaded into the Data Lake, where applicable. Cosmetic transformations like translations from technical object and column names to meaningful descriptions which end users understand or other lightweight transformations can be performed in new structures (like SQL views) that are created inside the Data Lake.

    Unlike Data Marts and Data Warehouses, which are optimized for data analysis by storing only the required attributes and sometimes dropping data below the required level of aggregation, a Data Lake always retains all attributes and (if possible) all records. This way it will be future proof for solutions that will require this data in a later moment in time or for users that will discover the data.

    Accessing data
    Data is made accessible through structures which can either be accessed directly, or indirectly through the exposure as OData Feeds. These structures are secured and are the only objects end users or other processes have access to. The feeds can be accessed with any tool or technology that is best suited to the task at any moment in time, for example using Power BI tooling like Excel PowerPivot/PowerQuery.

    We normally create SQL Views in which security rules and required transformation are applied.

    The Data Lake also acts as a hub for other repositories and solutions like Data Warehouses and Operational Cubes.

    Master Data
    Success of the Data Lake depends on good master data. When end users discover new raw data from the Data Lake they need to be able to combine it with high quality master data to get proper insights. Therefore a master data hub is a must have when a Data Lake is created. This hub should just be a database with master data structures in it, master data management on this data is preferable but not required. The master data hub should be a standalone solution, independent from the other BI solutions, as master data isn’t part of these solutions but is only used as data source. It should be sourced independently too, preferable using master data tooling or using tools like SSIS. Just like with data from the Data Lake, master data should also only be accessed through structures which can also be exposed as OData Feeds.

    Next to the purpose of combining master data with data from the Data Lake, the master data can be used as source for other BI solutions like Data Warehouses. In there, the master data structures are often used as Data Warehouse Dimensions. To prevent the unnecessary duplicate loading of master data in the Data Warehouse that already exists in the master data hub, it can be a good choice to leave the master data out of the Data Warehouse Dimensions. Only the business keys are stored which can be used to retrieve the data from the master data hub when required. This way the Data Warehouse remains slim and fast to load and master data is stored in a single centralized data store.

    The entire Data Lake architecture with all the described components are fit in the model below. From bottom to top the highlights are:

    • Extract/Load data from the sources to the Data Lake, preferably in near real-time.
    • The Data Lake can consist of multiple SQL (and NoSQL) databases.
    • Transformations and authorizations are handled in views.
    • The Data Lake acts as hub for other BI solutions like Data Warehouses and Cubes.
    • The master data hub is in the center of the model and in the center of the entire architecture. It’s loaded as a standalone solution and isn’t part of any of the other BI solutions.
    • Traditional BI will continue to exist and continue to be just as important as it has always been. It will be sourced from the Data Warehouses and cubes (and master data hub).
    • The Discovery Platform with its new Power BI tooling is the place where “various users of the lake can come to examine, dive in, or take samples.” These samples can be combined with the data from the master data hub.

    20141211JK_Data Lake BI Architecture

    Data Lake Challenges
    Setting up a Data Lake comes with many challenges, especially on the aspect of data governance. For example it’s easy to create any view in the Data Lake and lose control on who gets access to what data. From a business perspective it can be very difficult to deliver the master data structures that are so important for the success of the Data Lake. And from a user perspective wrong conclusions can be made by users who get insights from the raw data, therefore the Data Warehouse should still be offered as a clean trusted data structure for decision makers and a data source for conformed reports and dashboards.

    The Data Lake can be a very valuable data store that complements the traditional Data Warehouses and Cubes that will stay as important as they are now for many years to come. But considering the increased amount and variety of data, the more powerful self-service ETL and data modeling tooling which appear and the shortened required time-to-market of near real-time data from source up and to the user, the Data Lake offers a future proof data store and hub that enables the answering of yet undefined questions and gives users personal data discovery and shaping possibilities.

    Thanks go to my Macaw colleague Martijn Muilwijk for brainstorming on this subject and reviewing this blog post.

  • Implement SSAS MD cell security using dimension security with blazing performance

    SQL Server Analysis Services (SSAS) Multidimensional (MD) is a great product, and in my opinion it’s still the only real option to go for when building complex enterprise BI solutions. It’s still very fast when implemented correctly and it’s mature and therefore very stable.


    The only real downside is cell security, which, in my opinion, is useless. It makes performance drop dramatically because it evaluates security cell-by-cell. I have seen reports that run in a few seconds without cell security taking 20 minutes with cell security implemented! Try to explain that to your customer.. It’s obvious that you can’t.


    Quite some workarounds exist for quite a while:


    • Make your measures invisible and create MDX calculations that either show or hide the measure value based on a dummy dimension. Drawbacks are the measure is hidden and not really secured and you need to create dummy dimensions/attributes and maintain them.
    • SSAS offers us the “Measures Dimension” which give you the possibility to secure measures like dimension members. Great, this is just what we need, but implementing it and creating MDX calculations based on the secured measures will give you errors for SSAS roles that do not have access to these measures. This is caused by the fact that the MDX script is executed after the security has been implemented. So if a user that doesn’t have access to a measure (set by dimension security) tries to connect to the cube while the MDX script contains a reference to this secured measure, this raises an error.
    • Create a hidden dimension on which you apply dimension security like described here: Unfortunately this doesn’t work for measures.


    For a few years I’ve always implemented a workaround that uses the measures dimension. To prevent errors in the MDX script as described by option B above, I’ve added IsError() checks around all my calculation parts that could raise an error. For example, a simple calculation like Quantity * Price, where the price measure could be secured, looks like this:


    IIf (
        IsError ( [Measures].[Price] ),
        [Measures].[Quantity] * [Measures].[Price]
    IIf (
          IsError ( [Measures].[Price] ),
      } ;


    This calculation would not raise an error but NULL if a user doesn’t have permission to the Price measure. Quite straightforward, only trick here is the IsError() check in the NON_EMPTY_BEHAVIOR, of course you would have to do this in the script view of your SSAS calculations tab. In the form view this will look a bit strange but it doesn’t cause any errors: 




    Using this approach you are able to implement measure security using the dimension security while you are still able to create MDX calculations with NON_EMPTY_BEHAVIOR based on these secured measures. This made the report I talked about before to run in 5 seconds instead of the 20 minutes. I’ve used this approach for quite some years now, and it has always been sufficient.


    At the moment I’m working on a very big enterprise cube that contains almost 100 dimensions, 25 measure groups and millions of records. I’ve also implemented measure security like this and after a while I noticed some really heavy queries (reports) took quite some time, for example 30 seconds or 1 minute. Reason enough to re-think this approach. When I was at the SQL Server Days in Belgium last year, I’ve discussed this approach with Chris Webb. Of course he could understand what I was doing here and we agreed to email about this approach later on. Chris emailed me about a blogpost of him from some time ago: In this post Chris explains how to create secured calculations while using dimension securty by using named sets and scope statements:




    CREATE SET myset1 AS
    IIf (
        IsError (
    StrToMember ( "Measures.[Internet Sales Amount]" ) ),
        { },
        { Measures.Test }
      ) ;


    SCOPE ( myset1 ) ;
    This = Measures.[Internet Sales Amount] ;


    CREATE SET myset2 AS
    IIf (
        IsError (
    StrToMember ( "Measures.[Internet Sales Amount]" ) ),
        { Measures.[Internet Tax Amount] },
        { Measures.[Internet Sales Amount], Measures.[Internet Tax Amount] }
      ) ;


    SCOPE ( myset2 ) ;
    This = Measures.CurrentMember * 2 ;


    I did not know about this approach, so I tried it out. To be able to get good test results I used a very heavy query that hopefully nobody would ever try: a calculation over all sales data (millions of records) against all customers, all products and all time.  Unfortunately, the results were not very pleasing yet:


    1. Calculation with IIF/IsError checks in both the calculation part and the non empty part: ran in 50 seconds.


    2. Calculation with named set/scope approach: ran longer than 5 minutes, after that I stopped the query.


    So Chris and I emailed again and I was wondering if the lack of NON_EMPTY_BEHAVIOR (NEB) could be the cause of the slow performance of the named set/scope approach. Chris said that since SSAS 2008 the usage of NEB was not neccesary anymore, something I’ve heard about before. I always had the idea adding NEB did make a difference in some cases so I never stoped using it. So I kinda merged Chris’ and mine approach, adding the NEB using an IsError() function and I replaced the StrToMember check with a direct reference to the measure in the named sets because in my opinion that part was not neccesary:     

    IsError ( StrToMember ( "Measures.[Internet Sales Amount]" ) ) à became: Measures.[Internet Sales Amount].


    The result was almost unbelievable, the query now took not 5 minutes, not 50 seconds but only 2 seconds!! So adding NEB still can make a huge difference sometimes! After finding out about this, I contacted Chris again and he was also very surprised by the result.


    I’ve created an MDX calculation template with some explanation added as comments. The big performance gain is achieved because the IsError() check doesn’t need to be executed every time the calculation is executed because it’s already executed when the static named set is created (one single time), afterwards this value just seems to be reused. I think the same applies for the IsError() check in the NEB, it seems it’s only executed once and is reused. Anyway, the result is, your MDX calculations are just as fast with security applied as without. Something that was not possible for a long time for me and I guess everybody else too. Just use the template below as a standard for creating your calculations and benefit from this solution:


    --Calculation Template:


       ----------1: CHECKED CALCULATION: The secured "end product" used in other calculations----------



      , VISIBLE = 0 ;


       ----------2: CHECK NAMED SET: Actual permission check performed here----------  

    CREATE SET [Quantity_CHECK] AS

      IIf (

       IsError ( [Measures].[Quantity] )

       , { }

       , { [Measures].[Quantity_CHECKED] }

      ) ;


       ----------3: SCOPE: Assignment of either the measure or "nothing" to CHECKED calculation (1)----------  

    SCOPE ( [Quantity_CHECK] ) ;

        This = [Measures].[Quantity] ;



       ----------4: Second secure calculation created here----------  



      , VISIBLE = 0 ;



      IIf (

       IsError ( [Measures].[Price] )

       , { }

       , { Measures.[Price_CHECKED] }

      ) ;

    SCOPE ( [Price_CHECK] ) ;

        This = [Measures].[Price] ;



       ----------5: Calculation based on the secure calculations.

       ----------Non_Empty_Behavior set with IIf/IsError on measure. Using calculations inside NEB is not possible----------

    CREATE MEMBER CURRENTCUBE.[Measures].[Sales Amount] AS

      [Measures].[Quantity_CHECKED] * [Measures].[Price_CHECKED]



       IIf (

        IsError ( [Measures].[Price] )

        , NULL

        , { [Measures].[Price] }



      , VISIBLE = 1 ;



    Your calculations tab in SSAS will look like:



    When I was investigating the inner working of the first part of the script I’ve added some comments to make it easy for myself to remember. I guess it’s also useful for anyone that want to know what’s really happening here:  

    CREATE MEMBER CURRENTCUBE.[Measures].[Quantity_CHECKED] --> Calculation is initially NULL and will be filled based on user rights later on.

    AS NULL,

    VISIBLE = 0; --> Make it invisible as this is a pure technical calculation that should not be used by end users. End users can use the measure this calculation is based on, if they have permission (set by dimension security on the Measures dimension).


    CREATE SET [Quantity_CHECK] AS --> Named Set will be used to perform the actual check for user rights, as this is a static named set this check will only be executed one time, at initialization, and will be reused.  

    IIf(IsError([Measures].[Quantity]) --> An IsError on the measure that needs to be checked will raise an error if a particular user has no rights on it because in that case the measure will simply not exist.

    , {} --> If an error was raised by the IsError function set the value of this set to nothing: {}

    , {[Measures].[Quantity_CHECKED]}); --> If no error was raised the user has rights to access the measure, in this case set the value of the set to the required CHECKED calculation created in the previous step.

    SCOPE([Quantity_CHECK]);  --> If in the Scope of the CHECK calculation, either the CHECK calculation can be {} (Nothing) or it can be the CHECKED calculation (based on user rights).  

    This=[Measures].[Quantity]; --> Assign the measure to This. This can be the CHECK calculation if the user has rights which will pass the measure through to the initial CHECKED calculation or it can be {}/Nothing which will pass the value of the measure to nothing instead of to the CHECKED calculation.

    END SCOPE-- So this Scope function either passes the measure to the CHECKED calculation or to nothing (the empty set {})

  • SSIS Denali CTP3 – What’s new?

    Last week Microsoft released CTP3 of SQL Server Integration Services (SSIS), code name: Denali. In this blog post I will look into the new key features and some of the minor improvements in SSIS Denali.

    1. Development

    Shared Connection Managers
    Connection Managers can now be shared on SSIS project level.
    You can create them in the solution explorer, in the folder Connection Managers:

    Once created, they will appear automatically in all your SSIS packages. The names are in bold so you can recognize them between your package-level connection managers:

    You can also create Shared Cache Connection Managers for your cached lookups. This will be very useful for lookups that are performed multiple times, for example when you look up dimension tables multiple times from your fact table ETL. You can just convert the local cache connection manager by right clicking it:

    What happened to the Shared Data Sources we knew from prior SSIS versions? Well they are no more. Shared Data Sources only lived at design time and not at runtime so they were not really useful anyway. Shared Connection Managers do live at runtime and offer even more possibilities as I will show later on, so they replace the old Shared Data Sources.

    Data Flow - Column mappings
    SSIS always mapped columns from source to transformations or destinations with the help of lineage ids. Every column had a unique metadata ID that was known by all components in the data flow. If something changed in the source this would break the lineage ids and raised error messages like: The external metadata column collection is out of synchronization with the data source columns.
    To fix this error you would re-map all broken lineage ids with the “Restore Invalid Column References Editor”.
    In Denali lineage-ids are no longer used. Mappings are done on column names, which is great because you can now use auto map on column names and even copy/paste pieces of another data flow and connect them by mapping the corresponding column names.

    Data Flow - Flexible order of authoring
    This improvement helps you edit data flow components even when they don’t have an input attached. Theoretically you can build your data flow backwards; start with the destination and track back to the source.

    Data Flow - Groupings
    You can now group data flow components. Select the components you wish to group, right click and select Group:

    The result is some sort of a data flow sequence container:

    By clicking the arrow it will collapse:

    Data flow groups are 100% eye candy; you can’t set any properties on them.

    Data Flow - Data Quality Services Cleansing transformation
    With this transformation you can apply data quality rules in the data flow. This is done by using a Knowledge Base which can be created by yourself or downloaded from the Windows Azure Marketplace. For example you could apply a rule that checks if a given postal code column is valid for a particular town column in your record.

    Data Flow - Data Tap
    In Denali, we have the possibility to attach a “tap” at a data flow path (arrow). This tap captures all data coming through and dumps it in CSV files.

    2. Configuration

    SSIS Denali is fundamentally different to its predecessors when it comes to the concept and usage of configurations. SSIS package configurations are obsolete (they will still be available if you really want to use them ;-)) and parameters and environments are in the new kids in town.

    SSIS Parameters look a lot like SSIS variables but there are some differences. There are two types of these parameters:
    1. Package Parameters:
    Look at Package parameters as C# parameters, which are passed as input to a C# function(=your package). You can set them when executing (call) a package and the lifecycle of the parameters are limited to the SSIS package itself.

    The difference with SSIS variables? You can set the parameters while executing a package with SSMS or an Execute Package Task.

    Define your Package Parameters at the Package Parameters tab:

    2. Project Parameters:
    Identical to Package Parameters, except for the scope, these Parameters exist on project level and can be referenced throughout all SSIS packages in your project. Package Parameters only live inside a single SSIS package.

    You can define Project Parameters in the solution explorer within your SSIS project:

    Both Package and Project Parameters can be referenced from your SSIS packages, you recognize them by their prefix, $Package or $Project:

    Setting parameters in a Execute Package Task is achieved by the new Parameter bindings tab:

    It’s also possible to parameterize SSIS tasks on the Control Flow by right clicking them and choose Parameterize:

    Loads of capabilities here! I now realize that I have created a workaround for Package Parameters with my SSIS Package design pattern for loading a data warehouse where I (mis)used Package Configurations as Package Parameters. Creating a new package design pattern for Denali definitely goes on my TODO list!

    Environments are a collection of SSIS package settings that can be define on the SSIS Server. At runtime, the environment will override these settings in the SSIS packages. You can create multiple environments and when you run your SSIS packages you can select which environment it should use. It’s also possible to let multiple SSIS projects run under one environment, so flexibility all around the clock.

    To make you understand the principle of Environments right away I have created a simple example that you will commonly use: Create two Environments, one with development settings and one with production settings.

    I have deployed a Test SSIS project to SSMS which contains one SSIS package with one Shared Connection Manager. Notice the new Integration Services folder structure:

    Next right click Environments and choose Create Environment:

    Type Development as name for the Environment and click OK:

    Now double click the Development Environment:

    Click on the Variables tab and create an Environment Variable which will overwrite the Shared Connection Manager. Type in the connection string for the development server as the Value and click OK:

    Next create another Environment with the name Production and also create an Environment Variable with the name SCM_EnvironmentVar. The only difference between these two variables should be the value of the variable; it contains the production server connection string instead of the development value.
    You now have two Environments with one Environment Variable for the Shared Connection Manager each:

    We now need to reference the Development and Production Environments in the Test project so they can be used. Right click the Test project and choose Configure:

    Go to the references page and click Add..

    Add both Environments:

    Both environment are now referenced from the Test SSIS project and are available. Let’s configure the Shared Connection Manager so it will use the value specified in one of the environments. Click on the Parameters page, the Connection Managers tab and the next to the ConnectionString property of the Shared Connection Manager:

    Now select Use Environment Variable and select SCM_EnvironmentVar. Click OK:

    The name of our Environment Variable is shown in the Value box of the ConnectionString property:

    We have now succesfully attached the Environment Variable to override the ConnectionString property with a value from either the Development or the Production Environment. Executing the package under one of both Environments is very easy. To do so, right click the package and choose Run:

    The Run Package dialog box appears and you instantly see a message that reminds you to assign a value to the ConnectionString property of connection manager SharedConnectionManager:

    At the bottom of the dialog box you can select the Environment your package should use. Select the .\Development Environment and click OK:

    In the overview report you see the package has succesfully ran under the Development Environment:

    If you change the package to use the Production Environment and you run the package again, you’ll get the following execution information:

    Pretty straightforward and easy to use. The big difference? All the configuration work has been done in SQL Server Management Studio! Exactly the place where it should be done. In the current SSIS versions most of this work was done within Visual Studio and the configuration was done by editing tables. So unfriendly! Big improvement, great work MS!
    In this example I have set a property of a Shared Connection Manager but of course it is possible to set Package or Project Parameters with Environment Variables.

    3. Deployment

    We have had the option to deploy SSIS packages from Visual Studio for a long time thanks to BIDS Helper. Microsoft finally added this functionality out of the box in Denali. They even did more: you can now also build your projects from within Visual Studio!

    There are two ways to deploy (push) your projects to the new SSIS server, directly from Visual Studio or by using an Integration Services Project Deployment File. They both use the new Integration Services Deployment Wizard.
    Deployments directly from Visual Studio can be done by right clicking your project and then choose Deploy. Now, the Deployment Wizard will pop straight up:

    The other way is building your project first. This was already possible in SSIS version(s) prior to Denali but did not really have useful meanings.

    In Denali this is different. Building your project will generate an .ispac file which contains your entire project and this can be used by the Deployment Wizard. Again a very nice new feature.

    Handing over a new release of a SSIS solution to a customer’s IT department for deployment can now be done in a very nice, professional, manner. Ispac files can also be opened by Visual Studio to import your SSIS project.
    Building your project can be done by right clicking your project and choosing Build:

    The output window displays the following:

    When looking at the bin\Development folder we see the new Test.aspac file:


    Double clicking (or choosing Deploy in Visual Studio) this file will start the new Integration Services Deployment Wizard:

    You can now select the project that you wish to deploy. You can either select to use the .ispac Project deployment file or choose to deploy an existing project located in any Integration Services catalog. The last option is useful when you wish to deploy from, for example, a TEST SSIS solution that is already located in a catalog to an ACCEPTATION environment in another catalog.

    The project is loaded and validated:

    Next, you now need to select the destination, which has to be a SSIS Server:

    Review your selections and click on Deploy to start the actual deployment:

    The results show a successful deployment. Notice the protection level has been changed. What happened here?
    The wizard clears the protection level, as we know from Visual Studio, to prevent “cannot decrypt password” errors.

    Instead of using protection levels, passwords will be stored in clear text. The entire package (with the passwords in it) will be stored encrypted in the SSIS Server tables you are deploying to.

    The project has now been deployed to the server:

    When you right click the Test project and choose for Versions you are able to see the current version of your project:

    If you deploy a newer version later on, the existing version(s) stay on the server and you can easily role back to a previous version if you’ve made a bad deployment.

    The diagram below shows the entire deployment life cycle (source: TechNet):

    4. Management

    The new SSIS Server is the central storage and administration point of your SSIS solutions. No longer is this a standalone server that you’ll need to manage. Basically it is a database with Stored Procedures.

    You now have an Integration Services node available in SSMS when you connect to the database engine:

    Under the Integration Services node you will find your SSISDB catalog which holds all your SSIS solutions with its packages, environments, etc. The physical storage of these objects will be conducted in a SQL Server database with the same name as the catalog:

    This database also contains all the stored procedures containing all the programming code for the SSIS Server:

    With the help of these stored procedures you can manage your SSIS Server: e.g. your parameter values, connection managers, and override properties by using environments.
    Next to the configuration functionalities I have discussed earlier you can also implement security to control access to the catalog, both on folder level and package level. When you right click a folder/package in the catalog and choose the properties you’ll get the following window where you can manage security access:

    Finally you’ll get an out of the box reporting dashboard which is built on the out of the box logging functionality that SSIS offers. On this dashboard you’ll get information about the execution history of your package and its sub packages, view which parameters were used, view specific messages and get a performance report over time.
    All the information is logged automatically if any package runs on the server. The information is very detailed; you can even get the row counts between SSIS data flow transformations!

    When running a package you are able to select a logging level on the Advanced tab:
    You can choose for:
    - None: turn logging of for performance reasons
    - Basic: error and warning logging
    - Performance: detailed trace information
    - Verbose: diagnostics and fault debugging

    When I run my Test package I’ll get the following execution dashboard. There are some hyperlinks that navigate to more detailed reports.

    So no need for creating your own logging framework anymore, it’s all out of the box!

  • SSIS - Connect to Oracle on a 64-bit machine (Updated for SSIS 2008 R2)

    We recently had a few customers where a connection to Oracle on a 64 bit machine was necessary. A quick search on the internet showed that this could be a big problem. I found all kind of blog and forum posts of developers complaining about this. A lot of developers will recognize the following error message:

    Test connection failed because of an error in initializing provider. Oracle client and networking components were not found. These components are supplied by Oracle Corporation and are part of the Oracle Version 7.3.3 or later client software installation.
    Provider is unable to function until these components are installed.

    After a lot of searching, trying and debugging I think I found the right way to do it!


    Because BIDS is a 32 bit application, as well on 32 as on 64 bit machines, it cannot see the 64 bit driver for Oracle. Because of this, connecting to Oracle from BIDS on a 64 bit machine will never work when you install the 64 bit Oracle client.

    Another problem is the "Microsoft Provider for Oracle", this driver only exists in a 32 bit version and Microsoft has no plans to create a 64 bit one in the near future.

    The last problem I know of is in the Oracle client itself, it seems that a connection will never work with the instant client, so always use the full client.
    There are also a lot of problems with the 10G client, one of it is the fact that this driver can't handle the "(x86)" in the path of SQL Server. So using the 10G client is no option!


    • Download the Oracle 11G full client.
    • Install the 32 AND the 64 bit version of the 11G full client (Installation Type: Administrator) and reboot the server afterwards. The 32 bit version is needed for development from BIDS with is 32 bit, the 64 bit version is needed for production with the SQLAgent, which is 64 bit.
    • Configure the Oracle clients (both 32 and 64 bits) by editing  the files tnsnames.ora and sqlnet.ora. Try to do this with an Oracle DBA or, even better, let him/her do this.
    • Use the "Oracle provider for OLE DB" from SSIS, don't use the "Microsoft Provider for Oracle" because a 64 bit version of it does not exist.
    • Schedule your packages with the SQLAgent.

    Background information

    • Visual Studio (BI Dev Studio)is a 32bit application.
    • SQL Server Management Studio is a 32bit application.
    • dtexecui.exe is a 32bit application.
    • dtexec.exe has both 32bit and 64bit versions.
    • There are x64 and x86 versions of the Oracle provider available.
    • SQLAgent is a 64bit process.

    My advice to BI consultants is to get an Oracle DBA or professional for the installation and configuration of the 2 full clients (32 and 64 bit). Tell the DBA to download the biggest client available, this way you are sure that they pick the right one ;-)

    Testing if the clients have been installed and configured in the right way can be done with Windows ODBC Data Source Administrator:
    Administrative tools...
    Data Sources (ODBC)


    It seems that, unfortunately, some additional steps are necessary for SQL Server 2008 R2 installations:

    1. Open REGEDIT (Start… Run… REGEDIT) on the server and search for the following entry (for the 32 bits driver): HKEY_LOCAL_MACHINE\Software\Microsoft\MSDTC\MTxOCI
    Make sure the following values are entered:


    2. Next, search for (for the 64 bits driver): HKEY_LOCAL_MACHINE\Software\Wow6432Node\Microsoft\MSDTC\MTxOCI
    Make sure the same values as above are entered.

    3. Reboot your server.

  • Replication Services as ETL extraction tool

    In my last blog post I explained the principles of Replication Services and the possibilities it offers in a BI environment. One of the possibilities I described was the use of snapshot replication as an ETL extraction tool:
    “Snapshot Replication can also be useful in BI environments, if you don’t need a near real-time copy of the database, you can choose to use this form of replication. Next to an alternative for Transactional Replication it can be used to stage data so it can be transformed and moved into the data warehousing environment afterwards.
    In many solutions I have seen developers create multiple SSIS packages that simply copies data from one or more source systems to a staging database that figures as source for the ETL process. The creation of these packages takes a lot of (boring) time, while Replication Services can do the same in minutes. It is possible to filter out columns and/or records and it can even apply schema changes automatically so I think it offers enough features here. I don’t know how the performance will be and if it really works as good for this purpose as I expect, but I want to try this out soon!”

    Well I have tried it out and I must say it worked well. I was able to let replication services do work in a fraction of the time it would cost me to do the same in SSIS.
    What I did was the following:

    1. Configure snapshot replication for some Adventure Works tables, this was quite simple and straightforward.
    2. Create an SSIS package that executes the snapshot replication on demand and waits for its completion.
      This is something that you can’t do with out of the box functionality. While configuring the snapshot replication two SQL Agent Jobs are created, one for the creation of the snapshot and one for the distribution of the snapshot. Unfortunately these jobs are  asynchronous which means that if you execute them they immediately report back if the job started successfully or not, they do not wait for completion and report its result afterwards. So I had to create an SSIS package that executes the jobs and waits for their completion before the rest of the ETL process continues.

    Fortunately I was able to create the SSIS package with the desired functionality. I have made a step-by-step guide that will help you configure the snapshot replication and I have uploaded the SSIS package you need to execute it.

    Configure snapshot replication

    The first step is to create a publication on the database you want to replicate.
    Connect to SQL Server Management Studio and right-click Replication, choose for New.. Publication…

    The New Publication Wizard appears, click Next

    Choose your “source” database and click Next

    Choose Snapshot publication and click Next

    You can now select tables and other objects that you want to publish

    Expand Tables and select the tables that are needed in your ETL process

    In the next screen you can add filters on the selected tables which can be very useful. Think about selecting only the last x days of data for example.

    Its possible to filter out rows and/or columns. In this example I did not apply any filters.

    Schedule the Snapshot Agent to run at a desired time, by doing this a SQL Agent Job is created which we need to execute from a SSIS package later on.

    Next you need to set the Security Settings for the Snapshot Agent. Click on the Security Settings button.

    In this example I ran the Agent under the SQL Server Agent service account. This is not recommended as a security best practice. Fortunately there is an excellent article on TechNet which tells you exactly how to set up the security for replication services. Read it here and make sure you follow the guidelines!

    On the next screen choose to create the publication at the end of the wizard

    Give the publication a name (SnapshotTest) and complete the wizard

    The publication is created and the articles (tables in this case) are added

    Now the publication is created successfully its time to create a new subscription for this publication.

    Expand the Replication folder in SSMS and right click Local Subscriptions, choose New Subscriptions

    The New Subscription Wizard appears

    Select the publisher on which you just created your publication and select the database and publication (SnapshotTest)

    You can now choose where the Distribution Agent should run. If it runs at the distributor (push subscriptions) it causes extra processing overhead. If you use a separate server for your ETL process and databases choose to run each agent at its subscriber (pull subscriptions) to reduce the processing overhead at the distributor.

    Of course we need a database for the subscription and fortunately the Wizard can create it for you. Choose for New database

    Give the database the desired name, set the desired options and click OK

    You can now add multiple SQL Server Subscribers which is not necessary in this case but can be very useful.

    You now need to set the security settings for the Distribution Agent. Click on the …. button

    Again, in this example I ran the Agent under the SQL Server Agent service account. Read the security best practices here

    Click Next

    Make sure you create a synchronization job schedule again. This job is also necessary in the SSIS package later on.

    Initialize the subscription at first synchronization

    Select the first box to create the subscription when finishing this wizard

    Complete the wizard by clicking Finish

    The subscription will be created

    In SSMS you see a new database is created, the subscriber. There are no tables or other objects in the database available yet because the replication jobs did not ran yet.

    Now expand the SQL Server Agent, go to Jobs and search for the job that creates the snapshot:

    Rename this job to “CreateSnapshot”

    Now search for the job that distributes the snapshot:

    Rename this job to “DistributeSnapshot”

    Create an SSIS package that executes the snapshot replication

    We now need an SSIS package that will take care of the execution of both jobs. The CreateSnapshot job needs to execute and finish before the DistributeSnapshot job runs. After the DistributeSnapshot job has started the package needs to wait until its finished before the package execution finishes.
    The Execute SQL Server Agent Job Task is designed to execute SQL Agent Jobs from SSIS. Unfortunately this SSIS task only executes the job and reports back if the job started succesfully or not, it does not report if the job actually completed with success or failure. This is because these jobs are asynchronous.

    The SSIS package I’ve created does the following:

    1. It runs the CreateSnapshot job
    2. It checks every 5 seconds if the job is completed with a for loop
    3. When the CreateSnapshot job is completed it starts the DistributeSnapshot job
    4. And again it waits until the snapshot is delivered before the package will finish successfully


    Quite simple and the package is ready to use as standalone extract mechanism. After executing the package the replicated tables are added to the subscriber database and are filled with data:


    Download the SSIS package here (SSIS 2008)


    In this example I only replicated 5 tables, I could create a SSIS package that does the same in approximately the same amount of time. But if I replicated all the 70+ AdventureWorks tables I would save a lot of time and boring work! With replication services you also benefit from the feature that schema changes are applied automatically which means your entire extract phase wont break. Because a snapshot is created using the bcp utility (bulk copy) it’s also quite fast, so the performance will be quite good.

    Disadvantages of using snapshot replication as extraction tool is the limitation on source systems. You can only choose SQL Server or Oracle databases to act as a publisher.

    So if you plan to build an extract phase for your ETL process that will invoke a lot of tables think about replication services, it would save you a lot of time and thanks to the Extract SSIS package I’ve created you can perfectly fit it in your usual SSIS ETL process.

  • Replication Services in a BI environment

    In this blog post I will explain the principles of SQL Server Replication Services without too much detail and I will take a look on the BI capabilities that Replication Services could offer in my opinion.

    SQL Server Replication Services provides tools to copy and distribute database objects from one database system to another and maintain consistency afterwards. These tools basically copy or synchronize data with little or no transformations, they do not offer capabilities to transform data or apply business rules, like ETL tools do.
    The only “transformations” Replication Services offers is to filter records or columns out of your data set. You can achieve this by selecting the desired columns of a table and/or by using WHERE statements like this:
    SELECT <published_columns> FROM [Table] WHERE [DateTime] >= getdate() - 60

    There are three types of replication:

    Transactional Replication

    Transactional replication components and data flow

    This type replicates data on a transactional level. The Log Reader Agent reads directly on the transaction log of the source database (Publisher) and clones the transactions to the Distribution Database (Distributor), this database acts as a queue for the destination database (Subscriber). Next, the Distribution Agent moves the cloned transactions that are stored in the Distribution Database to the Subscriber.
    The Distribution Agent can either run at scheduled intervals or continuously which offers near real-time replication of data!

    So for example when a user executes an UPDATE statement on one or multiple records in the publisher database, this transaction (not the data itself) is copied to the distribution database and is then also executed on the subscriber. When the Distribution Agent is set to run continuously this process runs all the time and transactions on the publisher are replicated in small batches (near real-time), when it runs on scheduled intervals it executes larger batches of transactions, but the idea is the same.

    Snapshot Replication

    Snapshot replication components and data flow
    This type of replication makes an initial copy of database objects that need to be replicated, this includes the schemas and the data itself. All types of replication must start with a snapshot of the database objects from the Publisher to initialize the Subscriber. Transactional replication need an initial snapshot of the replicated publisher tables/objects to run its cloned transactions on and maintain consistency.

    The Snapshot Agent copies the schemas of the tables that will be replicated to files that will be stored in the Snapshot Folder which is a normal folder on the file system. When all the schemas are ready, the data itself will be copied from the Publisher to the snapshot folder. The snapshot is generated as a set of bulk copy program (BCP) files. Next, the Distribution Agent moves the snapshot to the Subscriber, if necessary it applies schema changes first and copies the data itself afterwards. The application of schema changes to the Subscriber is a nice feature, when you change the schema of the Publisher with, for example, an ALTER TABLE statement, that change is propagated by default to the Subscriber(s).

    Merge Replication
    Merge replication is typically used in server-to-client environments, for example when subscribers need to receive data, make changes offline, and later synchronize changes with the Publisher and other Subscribers, like with mobile devices that need to synchronize one in a while. Because I don’t really see BI capabilities here, I will not explain this type of replication any further.

    Replication Services in a BI environment
    Transactional Replication can be very useful in BI environments. In my opinion you never want to see users to run custom (SSRS) reports or PowerPivot solutions directly on your production database, it can slow down the system and can cause deadlocks in the database which can cause errors. Transactional Replication can offer a read-only, near real-time database for reporting purposes with minimal overhead on the source system.

    Snapshot Replication can also be useful in BI environments, if you don’t need a near real-time copy of the database, you can choose to use this form of replication. Next to an alternative for Transactional Replication it can be used to stage data so it can be transformed and moved into the data warehousing environment afterwards.
    In many solutions I have seen developers create multiple SSIS packages that simply copies data from one or more source systems to a staging database that figures as source for the ETL process. The creation of these packages takes a lot of (boring) time, while Replication Services can do the same in minutes. It is possible to filter out columns and/or records and it can even apply schema changes automatically so I think it offers enough features here. I don’t know how the performance will be and if it really works as good for this purpose as I expect, but I want to try this out soon!

    I got a questing regarding the supported Replication Services features in the different versions of SQL Server (Standard,Enterprise,etc). There is a nice table on MSDN that shows this!

  • SSIS Denali as part of “Enterprise Information Management”

    When watching the SQL PASS session “What’s Coming Next in SSIS?” of Steve Swartz, the Group Program Manager for the SSIS team, an interesting question came up:

    Why is SSIS thought of to be BI, when we use it so frequently for other sorts of data problems?

    The answer of Steve was that he breaks the world of data work into three parts:

    • Process of inputs

    • BI
    • Enterprise Information Management
      All the work you have to do when you have a lot of data to make it useful and clean and get it to the right place. This covers master data management, data quality work, data integration and lineage analysis to keep track of where the data came from. All of these are part of Enterprise Information Management.

    Next, Steve told Microsoft is developing SSIS as part of a large push in all of these areas in the next release of SQL. So SSIS will be, next to a BI tool, part of Enterprise Information Management in the next release of SQL Server.

    I'm interested in the different ways people use SSIS, I've basically used it for ETL, data migrations and processing inputs. In which ways did you use SSIS?

More Posts Next page »
Privacy Statement