THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Alberto Ferrari

  • DrillThrough Actions and (semi) Security in SSAS OLAP cubes

    Using SSAS 2005/2008, there is no way to apply security to DRILLTHROUGH actions, so we cannot decide whether a specific user is authorized or not to perform a specific action. This is a “by design” behavior, since DRILLTHROUGH actions are initiated at the client side, we can decide whether a user can perform or not DRILLTHROUGH but, once we let him do it, he can query anything he wants.

    Nevertheless, if we are not concerned with security but only with user experience, there is a simple trick to define which users can see which actions. Leveraging the Condition box, we can write an MDX condition that makes use of the UserName function (which returns the currently connected user) and decide to activate the action for only specific users, like in this action, that will be active only for “SQLBI\Alberto”:

    image

    Clearly, since the condition can be any valid MDX expression, we can implement some more data driven logic creating a measure group that stores action names and users, in order not to hard-code the user name inside the action.

    What should be very clear is that this workaround is intended to make the user experience with the OLAP cube a better one (actions are activated for only the users that want to use them) but has nothing at all to do with security. If we enable a user to initiate DRILLTHROUGH actions he will always be able to query for anything. That said, if our customer uses just Excel or a similar client and there are no serious security issues, then this solution might help him navigate the cube and activate only the required actions.

  • Does SSIS knows how to sort a flow of data? Well… no

    Today I ran into a nice misfeature of SSIS that worths sharing.

    If I want to use SSIS to sort a flow of data, containing both lowecase and uppercase strings, I would expect a standard behaviour or, at least, something with some “common sense”. As I am going to show, the final result is completely nonsense and will lead to incorrect computations.

    Let us start with some information that I remember from the old time of ASCII code. Uppercase letters appears before the lowercase ones in the standard ASCII sequence and hence are sorted before them. Is it still true in the 2009 world with Unicode and all our new and fancy ways of storing strings? It seems so, if you run this simple VB script:

    If "A" >= "a" Then
        MsgBox("A is greater or equal than a")
    Else
        MsgBox("A is less than a")
    End If

    You will get the correct result: uppercase comes BEFORE lowercase characters. So, since SSIS sort component does not provide any “consider/ignore case” checkbox, it has two options for sorting strings:

    • It ignores casing and sorts data like SQL does.
    • It adhere to the standard sorting sequence and sorts uppercase BEFORE lowercase.

    Well, armed with all this powerful knowledge, we can build a simpe package that receives three rows from this query:

    SELECT Name = 'ALBERTO', Surname = 'FERRARI'
    UNION ALL
    SELECT Name = 'alberto', Surname = 'ferrari'
    UNION ALL
    SELECT Name = 'alberto', Surname = 'FERRARI'

    Then it sends them into a Sort by Name, Surname and add a data viewer after the SORT, just to check what the component does:

    image

    I have been amazed to look at the data viewer:

    image

    SSIS believes that lowercase strings should be LESS than uppercase ones and produces a reverse sorted list of rows. This is pretty important because, if after the flow you trust data to be sorted, your system will fail, as it has been the case with some code I received to test just today.

    The final lesson is pretty simple: do not trust common sense, for some (very) obscure reason the SSIS team decided to provide “Yet Another Sorting Method” to the IT World, generating some more confusion in the head of the poor SSIS programmer.

  • European PASS 2009

    Guys, I’m excited of being a speaker at the European PASS Conference 2009 (April 22-24, 2009 - Neuss, Germany).

    I will present three sessions, on Slowly Changing Dimension Handling, OLAP partitioning with SSIS and scripts and, most interesting (at least for me), the SQLBI Methodology with Marco Russo.

    If you are going to attend the meeting and want to share a beer and some chat, you’ll be welcome!

    See you.

    Alberto

  • SSAS: Reference materialized dimension might produce incorrect results

    Reference dimensions let you create a relationship between a measure group and a dimension using an intermediate dimension to act as a bridge between them. In the Adventure Works demo, for example, they are used to link reseller sales and orders to the geography dimension.

    mg

    Now, something that is less evident is how this specific kind of relationship is handled by SSAS. Let us look further in the definition of the relationship:

    mat

    If the relationship is not materialized (something SSAS will dislike) then the relationship will be resolved during query time, with slow performances. Nevertheless, everything will work fine.

    If, on the other hand, we follow SSAS suggestion and decide to materialize it, we will need to understand what is going on under the cover, to avoid a very subtle misfeature of SSAS.

    When the relationship is materialized, SSAS will resolve it during processing time issuing a more complex query, during the partition processing (not dimension, I have written the right word: partition!) issuing a JOIN between the fact table and the intermediate dimension, to gather the value of the key for the final dimension from the intermediate one. This will lead to poor processing performances, but it is something that we already know. Anyway, what is important to point out is that the relationship is resolved during partition, not dimension processing.

    The real problem comes when we need to update the relationship. We might think (at least this is what I thought) that issuing a ProcessUpdate on the intermediate dimension would refresh the relationship and perform all the necessary updates. This is not true. Since the materialized relationship is resolved during partition processing, any dimension processing will not refresh it. You can try to ProcessUpdate the intermediate or the final dimension: no way, the relationship still uses the data that was there when the partition was processed first time. If we want to refresh this relationship, the only way is to reprocess the partition, at that point we will have the new data correctly available.

    This might be a very big problem if, for very large cubes, we decide to reprocess only the newest partitions, reprocessing only the needed aggregations for the older ones. If older partition will not be processed they will use the older version of the relationship. We will end up with an inconsistent situation where some partitions use a relationship and other use a different one, depending on when they have been last processed.

    Clearly, if we rely on non-materialized relationships, everything will work fine since the relationship will rely only on dimensional data and does not need to be resolved during partition processing. In fact, issuing a ProcessUpdate on the intermediate dimension will imediately refresh the non materialized relationship, as it is supposed to do.

    I think BOL should be much clearer on this, materializing a relationship is not just a matter of query speed, it has subtle consequences that, at their extreme, might lead to incorrect results from the cube.

  • Measure Tool – Making SSAS measures physical to apply security

    The security model of SSAS prevents us to apply security limitations to calculated members. We can hide physical members but we have no means to hide calculated members to any role. By preventing access to the physical measures, the user will not be able to see the result of the calculated measure, nevertheless he will always see the metadata definition of the member and, if he tries to add it to a query, he will get an error as a result.

    Even if this approach is effective, it leads to a very poor user experience: it would be much better to completely hide the calculated measure to any user who does not have access to it.

    A solution to this situation might be found by fooling SSAS: if we create an empty physical measure and then use MDX code to override its value with our MDX calculation, then we will be able to apply security to the physical measure and still have the ability to define its value using a calculated member.

    An example is much easier to understand that any theoretical explanation: let us suppose that we want to make the Gross Profit calculated measure of Adventure Works a physical one.

    Adapting in some way the technique showed in this post by Marco Russo, we might be tempted to create a measure group containing an empty physical measure (GrossProfit), without linking the measure group to any dimension. Sadly to say, SSAS will refuse to deploy a cube that contains a measure group not related to any dimension. So we will need to create a physical measure group linked to at least one dimension in order to make this technique works.

    We can define a new view:

    CREATE VIEW MeasureTool AS
         SELECT 
            GrossProfit = CAST (0 AS REAL),
            TimeKey
         FROM
            DimTime

    And then add the view to our DSV and create a measure group based on this view, relating it to the time dimension through the TimeKey. Note that we choose the DimTime just because it is a small dimension, any very small dimension will work as we are not using this relationship in any way, it is only a technicism to make the OLAP cube work.

    If we deploy the cube at this point, the new pjysical measure will always contain zero, which is not what we want. Nevertheless, a very simple MDX SCOPE will solve the problem and perform the magic:

    SCOPE (Measures.GrossProfit);
        THIS = [Measures].[Sales Amount] - [Measures].[Total Product Cost];
    END SCOPE;

    Now the GrossProfit measure will perform the correct computation but it will be a physical measure, so we can apply security on it and hide it to all the users that are not allowed to see it.

    The physical space of the cube is grown but, using a small dimension, the growth will be negligible.

  • SQLBI: Yet Another DWH Methodology?

    It is several years now that I and Marco Russo enjoy ourself developing BI solutions and teaching our customers how to build them. We started with Kimball methodology, then we tried Inmon's one and finally a mix of both, tailored to ours and our customer needs. We tested the method on some projects and, at the end, we came up with a complete methodology that we now adopt as "our" standard.

    When we felt confident that our methodology was mature enough to get described, we decided to write the first whitepaper, in which we introduce it. The goal of the publication on the web is that of discussing it with everybody interested in the development of BI solution.

    In the paper we do not spend too much time on theoretical discussion about what a fact or a dimension is, our focus is where our customers are: "on the market". We do not talk about generic tools: we use Microsoft suite for BI and our efforts are in the direction of taking the best out of Microsoft tools.

    We decided to share our experience with other BI experts by publishing the paper, we'd like to get comments and feedbacks on it. If you want to drop a line with your thoughts about it, feel free to do it in the SQLBI Methodology forum.

  • SSIS: living in a parallel world? Not yet...

    Today I come into a nice bug (misfeature?) of SSIS 2008 and I'd like to share it with you. Let's start from the beginning:

    One of the keys to make something speedy is to parallelize it. In TableDifference I read from two inputs, compare them and produce several outputs for new, deleted, updated rows and so on. It is the classical "produce/consumer" paradigm that works fine in parallel, creating a consumer thread that compares rows and different threads to produce them.

    During the porting of the component in SQL 2008 I discovered that my code does not work as expected. It reads all the lines, then terminates abnormally without producing a single row of output. It took me a couple of hours and several coffes to discover the change in the way SSIS handles asyncronous components causing my bug.

    In order to make the problem easier to understand (for microsoft connect too), I produced the same behaviour with a much simpler script component. Imagine to have a script transformation that needs to make some very complex work on each row. You will end up with something like this:

        public override void Input0_ProcessInputRow(Input0Buffer Row) {
            Output0Buffer.AddRow();
            Output0Buffer.LineID = VeryComplexFunction(Row);
            if (Row.EndOfRowset ()) {
               OutputBuffer.SetEndOfRowset ();
            }
        }

    But, if you have a 64 CPU computer. You can imagine to detach many threads, each one handling just one very complex function, in this way you will produce a massive parallel computation. Wow! The solution to get maximum speed would be this:

        public override void Input0_ProcessInputRow(Input0Buffer Row) {
            Thread T = new Thread(SendAsyncRow);
            T.Start(Row.EndOfRowset());
        }
    
        public void SendAsyncRow(Object O) {
            Output0Buffer.AddRow();
            Output0Buffer.LineID = VeryComplexFunction();
            if ((bool)O == true) {
                Output0Buffer.SetEndOfRowset();
            }
        }

    You detach a separate thread for each row, the thread will compute the very complex function and send it to the output buffer as soon as it is ready. In order to make SSIS understand when everything is finished, you signal SetEndOfRowset in the last thread.

    Don't start crying, this code will never work, the SetEndOfRowset is called on the last read line and not on the last produced one, but this is just pseudocode to understand the topic, don't blame me for inconsistencies. :) Let's go on with the main topic.

    Under SSIS 2005 this code (in VB, of course) would work fine as the engine detects the end of processing when it receives the EndOfRowset on the outputs. In SSIS 2008 the engine detects the end of the processing as soon as the last call to ProcessInputRow has finished. So, in our case, after the last thread has been detached SSIS believes that the processing is finished and calls the cleaning up of the component, completely ignoring both the presence of separate threads running and our williness to not terminate the component because we know that we need some more time.

    The result? As with any other parallel bug you will sometime get a crash, on the same inputs sometime finish corectly, sometime get an error... the nightmare of any programmer.

    If you want to try it by yourself, cut and paste this code to a transformation script in a data flow that process some lines ( rows 10/20 lines are enough):

    using System;
    using System.Data;
    using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
    using Microsoft.SqlServer.Dts.Runtime.Wrapper;
    
    using System.Threading;
    
    [Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
    public class ScriptMain : UserComponent {
    
        bool useThreads = false;
    
        /// <summary>
        /// 
        /// Processes input rows
        /// 
        /// </summary>
        /// <param name="Row"></param>
        public override void Input0_ProcessInputRow(Input0Buffer Row) {
            Thread T = new Thread(SendAsyncRow);
    
            if (useThreads) {
                T.Start(Row.EndOfRowset());
            } else {
                SendRow();
            }
        }
    
        /// <summary>
        /// 
        /// Sends a row in sync mode
        /// 
        /// </summary>
        private void SendRow() {
            // Thread.Sleep(100);
            Output0Buffer.AddRow();
            Output0Buffer.LineID = VeryComplexFunction();
        }
    
        /// <summary>
        /// 
        /// Sends a row in async mode
        /// 
        /// </summary>
        /// <param name="O"></param>
        public void SendAsyncRow(Object O) {
            Thread.Sleep(100);
            Output0Buffer.AddRow();
            Output0Buffer.LineID = VeryComplexFunction();
            if ((bool)O == true) {
                Output0Buffer.SetEndOfRowset();
            }
        }
    
        /// <summary>
        /// 
        /// Not very complex... but it's a demo!
        /// 
        /// </summary>
        /// <returns></returns>
        private int VeryComplexFunction() {
            return 10;
        }
    }

    using Serial mode (useThreads se to false) everything works fine. Using parallel mode (useThreads set to true) SSIS will crash and produce no output at all. The bug in TableDifference was very similar to that, I needed to handle the wait for termination of the consumer process during the ProcessInput function.

    It is not the first time I discover inconsistencies in the way SSIS handles parallelism, in order to try to make it easier to write code (using 2008 mode there is no need to call SetEndOfRowset and your program will run fine if you forget it) the architects of SSIS introduce a behavior that will make better and correct code crash. If you believe this is a problem, vote here, anyway... be aware of it, it mighe be useful in the future. :)

  • Reading zip files with SSIS

    Playing with SSIS I normally have to read flat files that come in zip format. The standard procedure here is to launch winzip, decompress the file, read it with a flat file adapter and then delete the uncompressed file. This behaviour generates a lot of useless I/O on temporary disks to decompress a file that will be soon deleted.

    Another way of solving the problem is to have a look at DotNetZip library. A very smart programmer wrote a library that makes opening zip files very easy using .NET. He also added in version 1.6 a great function: OpenReader. You can open a zip file, look into its entries and open a stream reader on it.

    Using this tecnique you can open a stream reader directly into the zip file and the process the stream using a VB source component in your data flow package.

    And... Yes, you cannot use the flat file adapter. You will need to parse the stream and read strings from it but you will end up with very few I/O when compared to the unzip / read / delete solution. As I/Os are very expensive, this tecnique may speed up your package significantly.

  • TableDifference version 2.0 has been released

    Table difference is an SSIS custom component designed to simplify the management of slowly changing dimensions and – in general – to check the differences between two tables or data flow with SSIS.

    Its main advantage over Microsoft standard SCD component is its speed and ease of use, the component receives input from two sorted sources and generates different outputs for unchanged, new, deleted or updated rows.

    The project is freeware, full source code is available at www.sqlbi.com.

    Note to version 1.x users

    The internal structure of the metadata of TableDifference is very different from the previous one. The component is able to read metadata from the previous version and write them with the new format as soon as it is called from inside BIDS so you will be able to reload all the previous metadata without loss of functionality.

    However, in writing the new metadata, the component changes the lineage IDs of all the output columns so, when you open a package that contains a previous version of TableDifference, you will need to open its designer to check that everything worked fine and the correct the subsequent flows by double clicking on the components that use TableDifference outputs. I have converted a lot of packages without any problems but, if you encounter something strange, do not hesitate to contact me.

    Introduction

    One question that arises very often in Data Warehouse programming is “what are the differences between these two tables/flows of data?” It is often the case in SCD management; you receive several millions of customers and should decide what changes need to be done after your last successful load.

    Using SSIS we have the SCD component, it works but it does it slow that in the production environment it is quite always convenient to create an “ad hoc” solution to handle the SCD.

    We decided to write an SSIS component that has two inputs, one for the “old” and one for the “new” data, compares all the rows that come in from the old and the new flow and sends them to different outputs, namely:

    • Unchanged rows (are the same in both input)
    • Deleted rows (appear in old but not in new)
    • New rows (appear in new but not in old)
    • Updated rows (appear in both flows but something is changed)

    The following picture illustrates the component functionality better than thousand words.

    clip_image002

    As not every update is to be handled the same way, it is possible to tag each column with an integer; the number of updated outputs is determined by the number of tags, one for each, in this way you can tag with “10” the historical attributes, with “20” other columns and decide what kind of operation to carry on with the different updates.

    The inputs MUST be sorted and have a collection of fields (keys) that let the component decide when two rows from the inputs represent the same row, but this is easily accomplished by SQL Server with a simple “order by” and a convenient index; moreover the SCD do normally maintain an index by the business key, so the sorting requirement is easily accomplished and do not represent a problem.

    The main structure

    The structure of the component is pretty simple:

    • Old Flow Input: it should be attached to a convenient query that returns all the current rows from the SCD
    • New Flow Input: it can be connected to the flow where the new structure of the SCD has been computed.

    Each input has a buffer where all the incoming rows are directed and maintained.

    A separate thread starts as soon as data from both buffers is available and checks for the differences between them, sending the rows to the appropriate output and then removing the buffered rows from the buffers.

    The buffer does not need to maintain all the data from the SCD, data is removed as soon as it can be compared with a corresponding row: the buffer, even for a several million record table, should be pretty small (but read the document to the end… some tricks may be useful to avoid memory consumption).

    The outputs

    The component has three standard outputs for new, deleted and unchanged rows. The number of updated rows output is determined by the user via the component editor, one output is provided for each different updateID that is inserted by the user. The outputs are named “UpdateID” followed by the updateID defined by the user.

    Of course, you can easily change the name of the output to something more interesting like “Updated historical attribute” and we encourage you to do so!

    The collection of output fields is computed by the component via the intersection of the two inputs: if one column appears in only one input it is not managed, if it appears in both then it will be compared and outputted.

    Installing TableDifference

    Installation is very simple:

    · copy TableDifference.DLL into

    “<Program Files>\Microsoft SQL Server\90\DTS\PipelineComponents”

    · add the DLL to the GAC using “GACUTIL –I TableDifference.DLL”

    No installer for the moment… sorry.

    Note for Vista users: you need to run these command on a command line opened with administrator privileges to perform the tasks.

    Using TableDifference

    After the component is installed you should add it to the toolbar as for every SSIS component and then you are ready to use it.

    After both inputs has been attached, by double clicking on the component, you reach the component editor that shows a single window like this:

    clip_image004

    A few words about what is happening just before the component editor pops up:

    1. TableDifference analyzes all the columns in both inputs and compares their names. If the name of two columns and their corresponding types are identical, TableDifference adds them to the available columns to manage.
    2. If the flows are sorted, their sort columns will be marked as key fields, using the same order in which they appear in the sort.
    3. All other columns are assigned a standard Update ID of 10 and are managed as comparable columns.

    Using the component editor you need to provide these information for the columns:

    • Check Option: you can choose the column type between:
      • Key field: these column will be used to detect when two rows from the inputs represent the same row. Beware that the inputs must be sorted by those columns
      • Compare: these columns will be compared one by one to detect differences
      • Prefer NEW: these columns will be copied from the NEW input directly into the output, no check
      • Prefer OLD: these columns will be copied from the OLD input directly into the output, no check
    • KeyOrder: If a column is of type “Key Field” it is the order under which the field appear under the “order by” clause of your query. Beware that the component do not check for the correct sorting sequence, it is up to you to provide this information.
    • Update ID: each different UpdateID creates a different output flow. If you need to detect when a change appears in some column you can use different update ID. Beware that the lowest update ID wins, i.e. if AccountNumber has update id of 10 and AddressLine1 has update id of 20, then Accountnumber will be checked first and if a change is detected, the row will go to update output 10, no matter if AddressLine has a difference.

    Clicking “OK” is enough for the component to generate the outputs and to define the metadata for all the outputs.

    In version 2.0 there are a three new panels:

    Outputs panel

    In this panel you can choose which output to enable. If you are not interested, for example, in the unchanged output, then you can deselect it from this panel to avoid warnings for unused columns.

    clip_image006

    You can also rename outputs and provide a concise description of them. Renaming is very useful as it gives the data flow clearness.

    Output Details

    This panel let you select the columns for each output. You cannot add any column but you can disable columns for outputs that do not use them.

    clip_image008

    The upper combo box let you choose an output, in the grid you can select or unselect any column. This feature is useful as it avoids warnings for unused columns. In the picture, the deleted output will receive only the customer key as it will make no use of any other information.

    Always remember that removing useless output columns increases the performances of the component.

    Misc Options

    Under this pane there are miscellaneous options.

    clip_image010

    Here you can define, for string comparisons:

    1. The culture ID to use to perform the comparison. If not specified TableDifference will use the culture ID of the running task. The default is “empty”.
    2. If you want it to ignore casing during string comparisons. The default is unchecked so TableDifference will perform comparison considering case.

    Warnings

    This panel will list any unused column from the input. As you might recall, if two columns are not identical regarding name and type, TableDifference will ignore them. This might be an error but the decision is up to you. By checking the warnings panel you can see if TableDifference is working with all the columns you need it to compare.

    FlowSync

    The component works fine and do not consume memory if and only if the input from both flows come in at a synchronized speed: if one input is much faster than the other then the component will start buffering data and consume memory that will be freed only when the buffer starts to shrink.

    Using the component to compare two tables with 5 millions of records we had several problems with memory, because data came in from one buffer much faster than the other and, after the difference (and hence the buffer on one input) reached 1.5 millions of records, the whole memory of the DtExec process (in a 32 bit virtual space) was filled in.

    As both inputs were from a simple “Select * From” and the speed from both lines was the same, we discovered that – for some obscure reason – SSIS prefers one input to the other and do not leave enough time to both tasks, resulting in memory consumption by the component.

    FlowSync is a component that will make two or more flows of data in an SSIS data flow package run at the same speed, by stopping one flow if the others run too slow. It has been created as a convenient companion to TableDifference to resolve the problems with memory occupation, it can be used by its own in the case where you want flows to run at the same speed.

    It makes use of semaphores to handle synchronization, so no CPU is ever wasted, when the faster flows is stopped all the CPU is free for others (more useful) processes.

    You can find source code, executable and description of FlowSync at www.sqlbi.com. We normally use flowsync when we need to compare more than half a million rows while we use TableDifference without flow sync for all the smaller tables in a project.

  • Table Difference 2.0 - Call for beta

    The new version 2.0 of SSIS SCD handling component TableDifference is in beta, I am searching for beta testers and some help in writing an installer for it. You will find all the informations at www.sqlbi.eu.

    The final release of the component will be as usually freely available with sources as soon as it has been tested enough.

  • SSIS Multicast and Trash or Separate and Union?

    Today I was optimizing a package that handles a very common topic: process some rows and send all them to a flow but, for some of them, make additional processing.

    A simple Multicast and a Conditional Split solve the problem, the picture explains it better than my words:

    Now, I was wondering why we should duplicate one million rows to two millions and then trash half of them away. I thought a better solution was to separate the rows before and duplicate only half a million, bringing them together again with the other half million from the other flow. So I tried this solution, thas does exactly the same but does not trash anything. We are not in the age of consumerism and we do not like to trash anything, don't we?

    image

    Well, even if I thought it would be a better solution, it is not. The first one (with one million rows of 1K each) takes approximately 13 seconds to run, the second one with the same data 16 seconds, that is 23% slower than before.

    So, the conclusion is that SSIS still lives in the age of consumerism, the process of UNION ALL takes a very long time to process and makes trashing a better solution. There are good explanations for this (UNION ALL does a lot more work than Multicast does) but I did not think that a 23% degradation would occur.

    The lesson is: it is best to trash then try to recycle, at least with data with SSIS. :)

  • Visio and SQL Extended Properties: a tool to marry them

    Visio database models have column descriptions, SQL Server has extended properties where to store them but, even if they share the same producer, it seems that Visio is unable to store column descriptions in extended properties so that other tools can gather the same information easily and produce a decent documentation of a database.

    After some browsing on the web, I was unable to find a tool that let Visio save the table and column descriptions in SQL Server, so I decided it was time to code it by myself.

    First you need to generate the DDL script of the database having DDL Script commens ON for all the columns (which can be configured in the driver options) as in the following picture:

     Tp

    In the DDL file, VISIO adds header of comments to each table like this:

    /* Create new table "Sales.Dim_Customers".                                         */
    /* "Sales.Dim_Customers" : Contains all the customer with various attributes used  */
    /* to analyze them. Does not contain (bla bla bla)                                 */
    /* "ID_Customer" : Primary key of the table                                        */
    /* "CustomerCode" : Customer code as seen in the OLTP database                     */
    /* "ID_Geography" : Key in Dim_Geography. It is not shown to the end user          */
    /* "MaritalStatus" : Description of the marital status (Single / Married)          */
    /* "Gender" : Description of the gender (Male, Female)                             */
    ... ... ...
    create table "Sales.Dim_Customers" ( 
    "ID_Customer" int identity not null,
    "CustomerCode" nvarchar(15) not null,
    "ID_Geography" int not null,
    "MaritalStatus" nvarchar(20) not null,
    "Gender" nvarchar(20) not null)

     

    I have written a simple parser (GenerateExtendedAttrib) that looks for comments in this form in a DDL file (starting with "Create new table", then checks for column names and so on) and will finally generate the sp_addextendedproperty calls to define all the column and table descriptions.

    The code fragment above generates this:

    EXEC sys.sp_addextendedproperty 
        @name=N'MS_Description', 
        @value=N'Contains all the customer with various attributes used to analyze...' , 
        @level0type=N'SCHEMA',
        @level0name=N'Sales', 
        @level1type=N'TABLE',
        @level1name=N'Dim_Customers';
    EXEC sys.sp_addextendedproperty 
        @name=N'MS_Description', 
        @value=N'Primary key of the table' , 
        @level0type=N'SCHEMA',
        @level0name=N'Sales', 
        @level1type=N'TABLE',
        @level1name=N'Dim_Customers', 
        @level2type=N'COLUMN',
        @level2name=N'ID_Customer';

    You can call the utility with "/Append" in order to have the property definitions appended to the original file (which, in turn, can be run to have the database created) or you can call it without parameters, in this case the code is written to the console and you can redirect it wherever you want.

    The tool can be downloaded at www.sqlbi.eu and is provided with sources (it is indeed a very simple yet effective gadget) so you can adapt it to whatever your needs are.

    Have fun and, if you make any change to the code that might be useful, send me a note so I can update the public release.

  • SQL: Merge two &quot;history tracking&quot; tables

    If you have two columns in a table that "remember" their old values in two different historical tables, as it is the case in AdventureWorks with EmployeePayHistory and EmployeeDepartmentHistory, you may need (or just wonder how) to merge these two tables into only one historical table that will keep track of both columns.

    This has been my "problem of the day" and, after some thoughts, I came up with a pattern of solution that I'd like to share with you both to see if you have a better solution and to have the code at hand when I'll need later. :)

    Before using the real table, I made up a simple test case with a Product table with only two fields a couple of variation tables (VariationA and VariationB):

    --
    --  Sample table Products, only two columns A and B. The current value of the row is (D, 4)
    --
    CREATE TABLE Products (
        A CHAR (1),
        B INT
    )
    INSERT INTO products (A, B) VALUES ('D', 4)
    
    --
    --  Variations for column A
    --
    --      ------------------
    --      Year    Old Value
    --      ------------------
    --      2001        A
    --      2002        B
    --      2003        C
    --      2005        D
    --
    CREATE TABLE VariationA (   
        Position INT,
        OldValue VARCHAR (5),
        ChangeDate DATETIME )
    
    INSERT VariationA (Position, OldValue, ChangeDate) VALUES (1, 'A', '01/01/2001')
    INSERT VariationA (Position, OldValue, ChangeDate) VALUES (2, 'B', '01/01/2002')
    INSERT VariationA (Position, OldValue, ChangeDate) VALUES (3, 'C', '01/01/2003')
    INSERT VariationA (Position, OldValue, ChangeDate) VALUES (4, 'D', '01/01/2005')
    
    --
    --  Variations for column B
    --
    --      ------------------
    --      Year    Old Value
    --      ------------------
    --      2000        1
    --      2003        2
    --      2004        3
    --
    CREATE TABLE VariationB (   
        Position INT,
        OldValue INT,
        ChangeDate DATETIME )
    
    INSERT VariationB (Position, OldValue, ChangeDate) VALUES (1, 1, '01/01/2000')
    INSERT VariationB (Position, OldValue, ChangeDate) VALUES (2, 2, '01/01/2003')
    INSERT VariationB (Position, OldValue, ChangeDate) VALUES (3, 3, '01/01/2004')
    

     

    The query that mixes all these variations into a single table is pretty easy but it took me some time to discover. My solution is to detect what was the value of "B" when a variation in "A" happened and vice versa. As the values stored are "old" values, we know that the value of A at a certain date is the value stored in the first variation for A AFTER that date. The only special case is the current time: if no variation record is found then we know that the value to use is the current value of the record in the product table. The same is obviously true for B. 

    I ended up with this pattern query:

    --
    --  AllEvents contains all the events from both VariationA and VariationB,
    --  and will detect, for each variation that happened, what was
    --  the value of the other column at that time, building in this way the
    --  merged variation list.
    --  The final SELECT will return the sorted and DISTINCTed result. 
    --  Then final COALESCE is needed because if a value is NULL it 
    --  means that it should contain the "current" value of the column
    --
    WITH AllEvents AS (
        SELECT 
            OldValueOfA  = Events.OldValueOfA,    -- Old value of A
            OldValueOfB  = Events.OldValueOfB,    -- Old value of B
            ChangeDate   = Events.ChangeDate      -- Date of change
        FROM 
                (SELECT 
                    OldValueOfA = OldValue, 
                    OldValueOfB = (SELECT TOP 1 OldValue 
                                     FROM VariationB V
                                    WHERE V.ChangeDate >= VariationA.ChangeDate
                                    ORDER BY ChangeDate), 
                    ChangeDate 
                FROM VariationA 
            UNION ALL 
                SELECT 
                    OldValueOfA = (SELECT TOP 1 OldValue 
                                     FROM VariationA V
                                    WHERE V.ChangeDate >= VariationB.ChangeDate
                                    ORDER BY ChangeDate), 
                    OldValueOfB = OldValue, 
                    ChangeDate 
                FROM VariationB
            ) Events
        )
    SELECT DISTINCT
        ChangeDate   = ChangeDate,
        OldValueOfA  = COALESCE (OldValueOfA, (SELECT A FROM Products)),
        OldValueOfB  = COALESCE (OldValueOfB, (SELECT B FROM Products))
    FROM
        AllEvents
    ORDER By ChangeDate

     

    Clearly, i could have written some VB code in SSIS to solve the same problem. Nevertheless, having a pattern query at hand is useful because the pattern can be easily adapted to any real world situation with only SQL Management Studio at hand. The following code is the implementation of the pattern for AdventureWorks to merge Department and Payment history for Employees:

    WITH AllEvents AS (
        SELECT 
            EmployeeID      = Events.EmployeeID,
            OldValueOfRate  = Events.OldValueOfRate,  
            OldValueOfDep   = Events.OldValueOfDep,   
            ChangeDate      = Events.ChangeDate     
        FROM 
                (SELECT 
                    EmployeeID      = EmployeeID,
                    OldValueOfRate  = Rate, 
                    OldValueOfDep   = (SELECT TOP 1 DepartmentID
                                         FROM HumanResources.EmployeeDepartmentHistory V
                                        WHERE V.StartDate >= PayHistory.RateChangeDate
                                          AND V.EmployeeID = PayHistory.EmployeeID
                                        ORDER BY V.StartDate), 
                    ChangeDate      = RateChangeDate 
                FROM HumanResources.EmployeePayHistory PayHistory
            UNION ALL 
                SELECT 
                    EmployeeID      = EmployeeID,
                    OldValueOfRate  = (SELECT TOP 1 Rate
                                         FROM HumanResources.EmployeePayHistory V
                                        WHERE V.RateChangeDate >= DepHistory.StartDate
                                          AND V.EmployeeID = DepHistory.EmployeeID
                                        ORDER BY V.RateChangeDate), 
                    OldValueOfDep   = DepHistory.DepartmentID, 
                    ChangeDate      = StartDate
                FROM HumanResources.EmployeeDepartmentHistory DepHistory
            ) Events
        )
    SELECT DISTINCT
        EmployeeID     = EmployeeID,
        ChangeDate     = ChangeDate,
        OldValueOfDep  = COALESCE (
                             OldValueOfDep,
                             (SELECT DepartmentID 
                               FROM HumanResources.EmployeeDepartmentHistory DepHistory
                              WHERE EndDate IS NULL 
                                AND DepHistory.EmployeeID = AllEvents.EmployeeID)),
        OldValueOfRate = COALESCE (
                             OldValueOfRate,
                             (SELECT TOP 1 Rate
                               FROM HumanResources.EmployeePayHistory PayHistory
                              ORDER BY RateChangeDate))
                        
    FROM
        AllEvents
    ORDER By EmployeeID, ChangeDate
    

     

    The code can be easily checked filtering for a single EmployeeID (4 is a good candidate for this check). Please note that - in this case - the "current value" is kept in the history table and so the final COALESCE is a bit more intricated but still pretty easy both to write and to understand.

    If you have a better pattern for this kind of situation or any comments on it... I'll be glad to read your comments.

  • Dimensional modeling with Ranged Dimensions

    A ranged dimension is a dimension that is used to have a discrete view of a continuous measure.

    A good example of this is the analysis of amount sold per order. In AdventureWorks we have, for each line of an order, the amount and quantity sold. We would like to divide orders in three groups (HIGH, MEDIUM, LOW) based on the total of the order. This is a very frequent kind of analysis that is used to determine how much money customers are willing to spend in a single order and may be very useful to check whether there is any kind of relationship between the characteristics of a customer and his/her medium spending capability.

    In SQL it is very easy to get this analysis:

    WITH
        OrderTotals AS (
            SELECT 
                SalesOrderId = SalesOrderId, 
                OrderValue   = SUM (OrderQty * UnitPrice)
            FROM 
                Sales.SalesOrderDetail
            GROUP BY
                SalesOrderId),
        OrdersRanged AS (
            SELECT
                SalesOrderId,
                OrderValue,
                OrderRange = CASE
                    WHEN OrderValue <= 1000  THEN 'LOW'
                    WHEN OrderValue <= 10000 THEN 'MEDIUM'
                    ELSE                          'HIGH'
                END
            FROM
                OrderTotals)
    SELECT
        OrderRange  = OrderRange, 
        OrderNumber = COUNT (*),
        OrderValue  = SUM (OrderValue)
    FROM
        OrdersRanged
    GROUP BY
        OrderRange
    

     

    The query, executed on AdventureWorks, leads to this result:

    OrderRange OrderNumber OrderValue
    HIGH 1,827 75,745,964.3559
    MEDIUM 12,574 31,075,706.5634
    LOW 17,064 3,552,218.3941

    This SQL query is correct but what we really want to do is to create a new OrderRange dimension in our BI solution to let the user further analyze the characteristics of the orders. The granularity of this attribute is at the Order level and so we have to add an ID_ValueRange column in the fact table of orders.

    If we define the Dim OrderRange dimension we will get something like it:

    • ID_OrderRange Primary Key, usually INT IDENTITY
    • MinimunValue Left value for the range
    • MaximunValue Right value for the range
    • OrderRange Textual description of the range

    At the end of the processing, our relational database will look like this:

    image

    This solution is good when we have a very clear definition of what the ranges are during design time. This often does not happen in the real world. More likely, we will let the user define ranges through a table in the configuration database and we will define several attributes and hierarchies in the dimension to analyze how orders are ranged. All these attributes and hierarchies should be derived from the configuration database and processed during the ETL phase.

    Ranges are dimensions that have a high variation rate over time and – normally – they can be defined only after the data is available because, before then, users do not have a clear understanding of what kind of ranges can be useful or not.

    This leads to a very annoying problem i.e. the ranges can be determined only at the cube process time and not at ETL time. In other words we are not able to compute the ID_OrderRange key used to join the dimension to the fact table because the ranges change over time due to the user change of mind. Please note that in this case it is the dimension that changes its meaning, not the facts.

    Computing the ranges during cube process time will lead us to define queries for the fact table that are complicated and uselessly consume processor time. So the challenge is to avoid this situation by means of changing the dimensional model.

    The solution that we recommend is to define ranged dimensions with a different surrogate key. Even if we are not able to define the exact ranges, we can normally define the maximum granularity that will be used to define ranges. In the example the granularity can be 1,000, but it can be more cleverly set to 100 in order to gain some flexibility at the lowest level of the range. Supposing that 100,000 is the maximum value of an order (at least from the DSS analyst point of view), we will have a maximum range of values that starts at 100 and ends at 100,000 with a step of 100, leading to a range of only 1,000 different values. This will be the Dim OrderRange dimension.

    Once we have defined the granularity we can redefine the Dim_OrderRange assigning to ID_OrderRange a simple INTEGER field that represents the value of FLOOR (OrderRange / 100). The same value, computed for each column of the order fact table, will be the ID used to join the fact table to the dimension.

    The dimension will contain all the values from 0 to the maximum order value divided by the granularity and so it will be larger but still contain only a few thousand rows. In our example the table switched from 3 values to 1,000 but, from SSAS point of view, it is still a very small dimension. Moreover, as the distinct values of the attributes are still three, the aggregation will work very well leading to optimum performances.

    The situation, with ranged dimensions, will be this:

    image

    Using the ranged dimension pattern you will be able to change all the attributes of the ranged dimension without having to worry about changes. The fact table will remain valid unless you decide to change the granularity.

    Granularity is not subject to changes because it can be normally set to a value clearly defined by the user and there is very rarely the need of updating it.

    Clearly ranged dimensions are a variation of standard dimension because their surrogate key has a clear meaning while the surrogate keys, in Kimball’s methodology, should not have any meaning at all. Nevertheless, we believe that in specific situation (and ranged dimension are among those) the “no meaning at all” constraint can be relaxed in order to get maximum flexibility in the final solution the customer will use.

    Another very common situation where ranged dimension are useful is in the ranging of time duration. When handling time you can easily set the granularity to days, months, years or seconds, depending on the minimum and maximum values that the duration will have. Once done it you can compute the duration using the correct granularity and define any ranged dimension on it.

  • How are GUIDs sorted by SQL Server?

    Today I ran into a nice problem with SQL Server fancy way of sorting GUIDS.

    I know, there should be no reason at all to sort GUID columns but, if what you need is run two tables side by side to check whether they are aligned or not (this is the task of TableDifference) and a GUID is the best key you have, then you will end up sorting GUIDS. The question now is: how do you sort GUIDS? The fact is that SQL Server and .NET give different answers. Let's have a look at it in deeper detail.

    I was using .NET guids to read data from SQL Server and then use .Compare to check for the sorting, supposing that a uniqueidentifier in SQL is sorted the same way a guid is sorted in .NET.

    This is not true, I discovered here that you should use SqlGuid instead. Let's not spend too many words on why having two sort algorithm may be clever or not, what is frightening about that article is the fact that it says that SqlGuid uses only the last six bytes to sort data. If it's working this way then we will end up with a sorting algorithm that does not really sorts. Seemed to be too stupid, so I double checked it and ran into this that says something more interesting about how GUIDS are sorted. Anyway, the algorithm is not still clear. Will I end up with a completely sorted list if I issue an ORDER BY on a uniqueidentifier column? and, how will this data be sorted?

    After all, the best way is always that of trying, so I end up with this very simple and funny query (that you can use to show friends that bits are still here, in 2007!):

    With UIDs As (--                         0 1 2 3  4 5  6 7  8 9  A B C D E F
                Select ID =  1, UID = cast ('00000000-0000-0000-0000-010000000000' as uniqueidentifier)
        Union   Select ID =  2, UID = cast ('00000000-0000-0000-0000-000100000000' as uniqueidentifier)
        Union   Select ID =  3, UID = cast ('00000000-0000-0000-0000-000001000000' as uniqueidentifier)
        Union   Select ID =  4, UID = cast ('00000000-0000-0000-0000-000000010000' as uniqueidentifier)
        Union   Select ID =  5, UID = cast ('00000000-0000-0000-0000-000000000100' as uniqueidentifier)
        Union   Select ID =  6, UID = cast ('00000000-0000-0000-0000-000000000001' as uniqueidentifier)
        Union   Select ID =  7, UID = cast ('00000000-0000-0000-0100-000000000000' as uniqueidentifier)
        Union   Select ID =  8, UID = cast ('00000000-0000-0000-0010-000000000000' as uniqueidentifier)
        Union   Select ID =  9, UID = cast ('00000000-0000-0001-0000-000000000000' as uniqueidentifier)
        Union   Select ID = 10, UID = cast ('00000000-0000-0100-0000-000000000000' as uniqueidentifier)
        Union   Select ID = 11, UID = cast ('00000000-0001-0000-0000-000000000000' as uniqueidentifier)
        Union   Select ID = 12, UID = cast ('00000000-0100-0000-0000-000000000000' as uniqueidentifier)
        Union   Select ID = 13, UID = cast ('00000001-0000-0000-0000-000000000000' as uniqueidentifier)
        Union   Select ID = 14, UID = cast ('00000100-0000-0000-0000-000000000000' as uniqueidentifier)
        Union   Select ID = 15, UID = cast ('00010000-0000-0000-0000-000000000000' as uniqueidentifier)
        Union   Select ID = 16, UID = cast ('01000000-0000-0000-0000-000000000000' as uniqueidentifier)
    )
    Select * From UIDs Order By UID, ID
    

     

    The result is simple, the ID's are completely reversed even if they appear as second argument in the SORT request. This means that the evaluation is carried on in byte groups this way:

    • 0..3 are evaluated in left to right order and are the less important, then
    • 4..5 are evaluated in left to right order, then
    • 6..7 are evaluated in left to right order, then
    • 8..9 are evaluated in right to left order, then
    • A..F are evaluated in right to left order and are the most important

    Well, I know it is not very important, but it is funny and maybe useful in some future. :) The really important thing is that ORDER BY GUID works, all bytes are evaluated and sorted, even if in a strange way.

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement