THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Alberto Ferrari

  • PowerPivot, Stocks Exchange and the Moving Average

    In this post I want to analyze a data model that perform some basic computations over stocks and uses one of the most commonly needed formula there (and in many other business scenarios): a moving average over a specified period in time. Moving averages can be very easily modeled in PowerPivot, but there is the need for some care and, in this post, I try to provide a solid background on how to compute moving averages.

    In general, a moving average over period T1..T2 is the average of all the values between T1 and T2. Sounds easy, so why do I feel the need to write a post about it? Because in stock fixings, as in sales and many many other scenarios, the moving average cannot be computed that easily. If the period is – say – 10 days, then it can contain one or more holidays inside. during holidays, there is no data and this does not mean that the value during holidays does not exists, it means that holidays should not be used to compute the average. Moreover, different periods contain a different number of vacation days, making things a little bit more complicated.

    The business scenario is clear, let us describe the data model. We basically have one table, called Fixing, which contains the value of stocks for the last 10 years:

    image

    There are a lot of columns here, since this is the data structure provided by my trading system but we are interested in only three columns, for the sake of this example:

    • Stock: the name of the stock, i.e. Microsoft.
    • Date: the date of the values
    • Close: the final price of the stock

    As a very gross rule of thumb, if the price of the stock is higher than the moving average over a period of time, then the price of the stock will further increase, if it is below the moving average, it will decrease further. Normally two moving averages are used in conjunction and the points where the stock is changing direction are detected by the intersection of the two moving averages.

    Thus, we want to use PowerPivot to produce a chart like the following one, which represents the last three years of the Microsoft stock. When the red line crosses the green one going down, it is time to sell, when it crosses going up it is time to buy. If you use this rule to trade… don’t blame me for any loss, I only tried to explain why moving averages are pretty useful in the stocks market.

    image

    Now, let us get back to business: how do we compute the moving average? In a previous post, I spoke about how to compute the difference in working days between two dates. This scenario is very similar since the moving average over 50 days is, in fact, the average of the last 50 “working” days of the trading market, i.e. the dates where the value of the stock exists in our table. Thus, the problem is now shifted to “how do I compute the date that is 50 working days before the current one?”.

    The answer is pretty easy: first of all we need to assign to each date in the tale an incremental number, counting the number of dates before the current one. Clearly, since we have many different stocks in the table, we must count only the dates where the price of the current stock exists. Let us add a calculated column to the Fixing table with this formula:

    DayNumber=COUNTROWS (
        FILTER (
            Fixing, 
            Fixing[Date] <= EARLIER (Fixing[Date]) 
         && Fixing[Stock] = EARLIER (Fixing[Stock])
         )
    )

    Now each row in the Fixing table has a number that uniquely identifies each date for a stock. Let us add a new calculated column to hold the date that represents “50 working days before now”:

    FirstDateOfRange50=CALCULATE (
        VALUES (Fixing[Date]),
        FILTER (
            Fixing,
            Fixing[DayNumber] = EARLIER (Fixing[DayNumber]) - 50 
         && Fixing[Stock] = EARLIER (Fixing[Stock])
         )
    )

    Basically, we use CALCULATE to get VALUES of the Date column in a filter context that shows the only one row that has a DayNumber equals to the current day number less 50. Now that we have the value of the first date of the range, knowing that the last date of the range is always the current date, we can compute the moving average over 50 days using this formula:

    =CALCULATE (
    AVERAGE (Fixing[Close]),
    FILTER(
    Fixing,
    Fixing[Date] >= EARLIER(Fixing[FirstDateOfRange50])
    && Fixing[Date] <= EARLIER (Fixing[Date])
    && Fixing[Stock] = EARLIER (Fixing[Stock])
    )
    )
    The key here is the FILTER expression, that creates a filter context that shows 50 rows from the Fixing table, exactly the 50 rows that are needed to compute our moving average. Repeating the same procedure for the 200 days average, you end up with this table inside PowerPivot:

    image

    Et voilà, the hard part of the work is done. The remaining work is just to add a PivotChart as in the following figure. where we have selected Apple instead of Microsoft:

    image

    As you can see from the figure, I added the Calendar table to the data model with the necessary relationship to be able to add the year to the chart but, this time, I did not use the Calendar table to perform working days computation since the addition of the stock name makes the model a bit more complex (a date might be “working” for a stock but not for another one) and the DAX formulas need to be more complex too.

    This is just another example of what PowerPivot can do for your reporting needs. If you want to touch the real Karma of PowerPivot, don’t miss the opportunity to follow one of the workshops I and Marco Russo are bringing all over Europe in the next few months.

    You can find more info on www.powerpivotworkshop.com. Hope to see you  there!

  • Self-Service BI: BI to Excel or Excel to BI?

    I spend most of my time developing complex BI solutions and, doing that, I always talk with DBA, developers, CEO, data analysts and the many kind of people that work around a BI solution. Nevertheless, in these last months, publishing posts about PowerPivot and teaching the PowerPivot course around Europe, I am keeping in touch with the vast number of people who really use Excel to solve their daily analysis problems and, with no surprise, I am learning that I am very ignorant about Excel capabilities.

    This last post of mine has received many interesting comments from Excel users and the previous one got some useful comments too. Many other posts are scheduled in the next weeks, so I think I will get more interesting feedbacks. Now, I don’t have any problem admitting that I am not an Excel guru and, probably, I will never become one. That said, it is clear that a good knowledge of PowerPivot (something I am supposed to own) is not enough to create a true PowerPivot solution. A very good knowledge of Excel is necessary too, to create all the complex functions needed to automate steps in the building of intermediate tables, reports and all the fancy features of a good Excel solution.

    Thus, the question I am wondering about today is: will BI professionals need to learn all the insights of Excel (i.e. BI to Excel) or are Excel guru going to learn all the complexities of building a BI solution (i.e. Excel to BI)? The truth is probably somewhere in the middle. Nevertheless, I tend to prefer the second scenario, where Excel power users will learn PowerPivot and some fundamentals of BI development against the other one. The rationale behind this opinion of mine (apart from my laziness in learning Excel) is that it will be easier for Excel gurus to learn BI development than for BI professionals to learn how to leverage all of the Excel functionalities.

    Excel power users already have a strong knowledge of their real world scenario of data and will not need to read the many books about data modeling to start playing with PowerPivot. Instead, working by trials and errors, they will create simple data models that will help them to create some first reports and, one step after the other, they will learn how to build a complete BI solution, at least for their specific needs. On the other hand, whenever I try to write some VBA code to automate a process in Excel, I am lost in the vastness of its features and functions and soon spend more time trying to write the code, while my time is better spent in data modeling. Thus, I believe that the Self-Service BI road will be full of Excel people learning the basics of BI, with very few BI professionals learning Excel. But… my opinion might have been guided by the fact that I find BI easier, since I already know the topic. Sorriso

    Nevertheless, if you are serious about introducing PowerPivot technology in your company, keep in mind that you will need to have both strong Excel and PowerPivot skills. Excel can no longer be considered simply a “BI client”. In the PowerPivot era, Excel is becoming more and more an active part of the BI solution with all of its programmability and functionalities.

  • Working days computation in PowerPivot

    A very frequent question, when performing business analysis, is to count the number of working days in a certain period of time. Creating a data model with PowerPivot that gracefully handles working days is pretty easy, nevertheless it is worth a post.

    As you might already know, PowerPivot needs a calendar table to compute most of its time intelligence functions and this table needs to satisfy some important conditions:

    • There should be no hole in the whole calendar, i.e. all dates of a year should be present.
    • The key of the table should be of DATETIME type.

    Building this table is really straightforward using Excel: start filling a column with dates (use alt-dragging to fill it very quickly) and then add some very simple formulas to the worksheet, to obtain a table like the following one. Clearly, the number of dates that should be included in the table really depends on your data, always remember that the table should always start at the 1st of January and end at the 31 of December, with all the necessary years included. Leaving a hole means getting wrong data later.

    image

    Before loading the table inside PowerPivot, it is necessary to add some info about working days and vacation, due to the title of this post Sorriso. To do this, let us create a new Excel table containing the holidays, with a description of the non working reason, like the following one:

    image

    You can add as many vacation days as you need to this table (and, if you wonder whether I consider holding a PowerPivot Workshop vacation… well, I surely enjoy it: it is very close to vacation!). Then, using Excel VLOOKUP function, we can bring the information about vacation directly inside the Calendar table, which now will contain some new columns:

    image

    The interesting formulas are:

    • WorkingDays =IF (OR ([@WeekDay] = 6, [@WeekDay] = 7), 0, IF (ISNA (VLOOKUP ([@Date], HolidaysTable[#Data], 2, FALSE)), 1, 0))
    • Holiday =IF([@WorkingDays]=0,"Holiday","Working Day")
    • HolidayReason =IF (ISNA (VLOOKUP ([@Date], HolidaysTable[#Data], 2, FALSE)), [@Holiday], VLOOKUP ([@Date], HolidaysTable[#Data], 2, FALSE))

    Here we state that a day is a working day if it is not Saturday or Sunday and it does not appear in the holidays table. Other formulas are very simple ones and define a string representation of holiday and holiday reason, as you can see from the figure. Please note that the WorkingDays column is an integer, holding 0 or 1. Moreover, I decided to name it plural instead of singular… this is not a mistake, it will make sense in the next figure.

    If you now create a PivotTable using only the new Calendar table and put Year on columns, MonthName on rows and the WorkingDays column as value, you get this first nice result, which shows the number of working days for each month. Due to the fact that the value is computed by simply summing values, it works even with different period selections, with or without holes.

    image

    I have already put inside PowerPivot some tables coming from the AdventureWorks database, in particular iI have the SalesOrderHeader table, which contains the OrderDate. Thus, I can create a relationship based on the order date, as you can see in the following figure:

    image

    Having done this, you can use the WorkingDays column to compute, for example, the average amount sold per working day, using this formula:

    AmountPerWorkingDay=SUM (SalesOrderDetail[LineTotal]) / SUM (Calendar[WorkingDays])

    Which provides an interesting report:

    image

    Well… this report looks interesting, yet it is not working as we would expect, since the average of 2001 is not 44,092,64. The issue here is that the sum of working days in 2001 is 257 but not all the months contain values for sales, yet their working days are counted in the denominator. If we really strive for perfection (which we do) then it is necessary to leverage some basic DAX knowledge and change the formula of the AmountPerWorkingDay to filter out the months where there are no sales. A more correct formula is the following:

    =  SUM (SalesOrderDetail[LineTotal]) 
        / 
        SUMX (VALUES (Calendar[MonthNumber]),
            IF (
                CALCULATE (SUM (SalesOrderDetail[LineTotal])) = 0,
                0,
                CALCULATE (SUM (Calendar[WorkingDays]))
            )
        )

    The complete explanation is a bit long, here we basically iterate over months and, if a month does not contain values for the sales, we remove it from the count of working days. Take some time to study it, then move on. Sorriso The result is:

    image

    Which now shows the correct result and a good course in the value, which grows year over year. AdventureWorks business is going in the right direction!

    Nevertheless, as it has been correctly noted by @Reuvain in the comments, the Grand Total column is still incorrect, since we have solved the issue with months but not with years. The value of zero over January, 2001 is counted to perform the average of January Grand Total. To solve this issue we need to iterate over years too, as in this (please note the outer and inner loop, over years and months).

    =  SUM (SalesOrderDetail[LineTotal]) 
        / 
        SUMX (VALUES (Calendar[Year]),
            SUMX (VALUES (Calendar[MonthNumber]),
                IF (
                    CALCULATE (SUM (SalesOrderDetail[LineTotal])) = 0,
                    0,
                    CALCULATE (SUM (Calendar[WorkingDays]))
                )
            )
        )

    Using this new formula the values will be correct at the grand total column too:

    image

    Now that we can decide whether a day is working or not, there is another very interesting computation that I would like to show: computing the difference, in working days, between two dates. Computing the delta between two dates is a matter of a subtraction in PowerPivot, since dates are stored internally as floating numbers. But, how many working days do exist between two dates is a different problem, which requires some DAX acrobatics.

    Before moving on, we understand that we need at least two dates to perform a delta. Luckily, in the SalesOrderHeader table there are the OrderDate and the ShipDate, it might be interesting to calculate the difference, in working days, between these two dates. Now, time to think at the algorithm.

    The main idea is to assign an incremental number, to each date, that increases by 1 only on working days and remains the same over holidays. It is much easier to understand it looking at the next figure, the value of WorkingDayNumber in the box.

    image

    You can see that the WorkingDayNumber column contains the value 3 from January 4 to January 6, since the last two days are vacation. If we had such a value, then to compute the delta in working days between two dates, it would be enough to perform a simple subtraction between the two WorkingDayNumbers. Now, the issue is how to compute such a value.

    It is not easy if you are not familiar with the way PowerPivot formulas need to be expressed but, believe me, after some time with PowerPivot the formula will come to your mind in a snap. The idea is to define a calculated column that counts the number of dates before the current one filtering only working days. Or, said in other words, sum 1 for working days, 0 for holidays, for all the dates that exist before the current one. Guess what? We have a column that contains 0 for vacation and 1 for working days: it is the WorkingDays column, so we need to sum it up for all the dates before the current one.

    Here is the formula:

    WorkingDayNumber = CALCULATE (SUM(Calendar[WorkingDays]), Calendar[Date] <= EARLIER(Calendar[Date]))

    Take your time to study and understand it… it is something that requires you to have already touched the Karma of PowerPivot… Once you have digested it, it is time to bring this information inside the SalesOrderHeader table. In fact, the SalesOrderHeader table contains two dates (ShipDate and OrderDate) but we need to compute the difference between the related WorkingDayNumber, which are not part of the SalesOrderHeader table.

    Since there exists a relationship between SalesOrderHeader and the calendar table, the value of the WorkingDayNumber can be easily computed using RELATED. The value for ShipDate, on the other hand, needs a slightly more complex DAX formula, since we need to search, inside the Calendar table, the value of WorkingDayNumber for the date that equals ShipDate. Here are the formulas:

    • WorkingDayOrder = RELATED (Calendar[WorkingDayNumber])
    • WorkingDayShip = CALCULATE (VALUES (Calendar[WorkingDayNumber]), Calendar[Date] = EARLIER (SalesOrderHeader[ShipDate]))
    And now, the final result is simply computed by subtracting the WorkingDayShip from the WorkingDayOrder, which returns the number of working days that passed from the date of the order to the date of the shipment. Sounds easy? No, it is not very easy… yet it can be done and, after having studied it, you will find it easy. PowerPivot requires you to change your mind and think in the DAX way, roll up your sleeves and start studying! Sorriso

    This is just another example of what PowerPivot can do for your reporting needs. If you want to touch the real Karma of PowerPivot, don’t miss the opportunity to follow one of the workshops I and Marco Russo are bringing all over Europe in the next few months.

    You can find more info on www.powerpivotworkshop.com. Hope to see you  there!

  • PowerPivot Workshop: new announcement and early bird expiring soon #ppws #PowerPivot

    As always, I am a bit later than Marco in producing news. Sorriso Nevertheless, I am very excited to tell you  the new date for the Frankfurt workshop on PowerPivot: February 21-22, 2011. Save the date and find all the relevant information on www.powerpivotworkshop.com, where you can also register a seat for the workshop with the early bird rate.

    Moreover, the early bird for the London date is quickly approaching: it will expire on January, 17., Thus, hurry up and don’t miss the opportunity to save some money and learn all the insides of PowerPivot directly from us.

  • End of Early Bird for the London PowerPivot Workshop is approaching

    Hurry up! On January, 17 the early bird rate for the London PowerPivot Workshop will finish. You have a few days to save £150 from the regular price. Don’t miss the opportunity to attend the best PowerPivot workshop in Europe saving some money too. Sorriso

  • Dynamic Ranking with Excel and PowerPivot

    Ranking is useful and, in our book, I and Marco provide a lot of information about how to perform ranking with PowerPivot. Nevertheless, there is an interesting scenario where ranking can be performed without complex DAX formulas, but with just some creative Excel usage. I would like to describe it here.

    Let us start with some words about the scenario: we want to rank products based on sales in a year (e.g. 2002) and see how the top 10 of these products performed in the following or preceding years.

    Let us start loading inside PowerPivot some tables from the AdventureWorksDW database: DimProduct, FactInternetSales and DimTime. We can then use a PivotTable to see sales in 2002 (using the slicer) and, leveraging the sorting option of the PivotTable, we sort by SalesAmount, putting the best performers on top.

    image

    Doing this, we have solved the first part of the scenario, i.e. we have identified which are the best products in 2002, yet what we want to do is to perform an analysis on how they performed during subsequent years. To do so, we must first assign a ranking number and/or category to each product, and then reload this table inside PowerPivot, so that we can use the ranking category as a new slicer in a new PivotTable.

    There is no way to reload the result of a PivotTable inside PowerPivot directly. Nevertheless, by simply copying the result of the PivotTable inside an Excel Range, we can then format that range as a table and link it into Excel. In the following picture I did exactly this, using very simple Excel formulas to compute the rank (i.e. ROW() – 4) and the RankCategory (if rank is less than 10, then “BEST 10” else “OTHERS”):

    image

    Now, before we format the range as a table, we need to face a small issue: if we later decide to change the slicer to 2001 or to change in any way the filters, the number of products returned by the PivotTable will change, resulting in less or more rows, depending on the filter. Since we want to assign a rank to ALL of the products, we want the PivotTable to always return all the products, regardless of the filter. This can be easily accomplished using the PivotTable Options of Excel, asking to show even rows with no data:

    image

    Now, the PivotTable will always return all the products. We can complete our Excel range with the correct formulas, format is as a table and call it RankedProducts. Done it, we can use the Linked Table feature of PowerPivot and load it inside the Vertipaq engine:

    image

    The work is almost done. We still need to create a relationship between RankedProducts and DimProduct, as shown in the next figure:

    image

    Now everything is ready for our report: just create a new PivotTable, put the RankCategory on the slicer, the product name on rows, the year on columns and SalesAmount on values, the resulting report satisfies our initial request: products are filtered based on the ranking in 2002 but the values shown are pertinent to all the years available in the database:

    image

    The interesting part of this scenario is that now, if we want to change the ranking filter using a different selection for time (in the following example we have ranked based on December 2001), it is enough to use the Update All button in the PowerPivot tab of the Ribbon to reload the new RankedProducts table and get a complete different ranking. The very same technique can be used to create custom sets and many other interesting analysis, which is something I leave to your imagination. Sorriso

    image

    As you can see, all of this has been done without ever writing a single line of DAX. Does it mean that DAX is not useful? Not at all, using DAX we can get much more powerful reports, nevertheless this is some kind of real self service BI, that can be authored by an Excel Power User with some basic knowledge of PowerPivot and a creative mind.

    This is just a very simple example of what PowerPivot can do for your reporting needs. If you want to touch the real Karma of PowerPivot, don’t miss the opportunity to follow one of the workshops I and Marco Russo are bringing all over Europe in the next few months.

    You can find more info on www.powerpivotworkshop.com. Hope to see you  there!

  • PowerPivot Workshop in London #ppws

    As you might have read in Marco’s post, the PowerPivot Workshop I and Marco Russo have prepared is now starting its European tour. We will be in London on February 7,8 2011 and then touch most of the European countries during the next few months.

    We have provided the workshop first in Amsterdam and it has been a huge success, it is now time to start the roadshow and meet people who want to learn PowerPivot all over Europe: I am really excited about this! You will find updated information and workshop dates on the www.powerpivotworkshop.com site, where you can also suggest us new cities where you would like us to be in the future.

    I really hope to see all of you soon and start together the new era of Self Service BI in the best way.

  • PowerPivot: editing measures when you reach 45

    I have always been used to small fonts but now, as I am getting older, I’d better admit that a greater font is much more relaxing.

    Editing PowerPivot measures has always been a pain, since all you have available is a small text box and I hate to admit that I got used to leverage ZoomIt for a long time to edit measures.

    Today I ran into a great Windows feature that I did not know about: ctrl-wheel on the mouse inside a textbox increases the font size of the text box. It seems to work with most textboxes in Windows, even if I use it mostly inside PowerPivot to be able to read at a reasonable size. Sorriso

    image

    This is the final effect… great, as years go by!

  • Visio Forward Engineer 1.1

    You know, I love Visio to develop simple databases and I have written a Visio add-in that performs the forward engineer of data models, called Visio Forward Engineer. Now I have found some time to update the project to version 1.1. You can download the add-in from the SQLBI website here.

    The most important fix is that now it can be installed on computers without Visual Studio. Moreover, there are some features and limitations that I would like to share, just to answer the many questions I receive by e-mail about the project:

    • The tool is free and provided with full source code.
    • There is no support for calculated columns… sorry, but it cannot be added.
    • The primary key index is always crated as CLUSTERED, due to some limitation in the Visio library.
    • If you use a dot in the table name, as in “accounting.customers”, the add-in will automatically create the schema accounting and add the table to that schema. I find this simple feature very useful.
    • You can now add some SQL code to the generated script, useful for creation of views or other SQL objects. The default name of this code file is the same as the Visio with “SQL Code” appended at the end, as you can see in the figure

    image

    Happy downloading. Sorriso

  • PowerPivot and Many to Many Relationships

    If you have already read this post from Marco Russo and have understood everything, then this new post of mine will be useless. But if you are like me, i.e. you have read it and have broken your head trying to understand the final (really smart!) formula, then this new post might help you to understand it better. Thus, this post does not contain anything new, it only shows graphically how to make many to many relationships work fast with PowerPivot.

    I am going to use as an example a database structure that I am currently working with, i.e. a database used to perform analysis of audience for a TV broadcasting company. The database structure is shown in the next figure:

    image

    The fact table says that an individual is watching a network in specific point in time of a single date. The red box contains the many to many relationship (the evil, from the performance point of view). An individual belongs to many categories (she is a woman, she has less than 30 years and so on, each characteristic of the individual makes it belong to a category) and the user wants to browse the data using targets. A target is nothing but a boolean expression of categories (a target might be “women with less than 30 years”, thus being an AND of categories). So, the bridge table links individuals to targets through a many to many relationship.

    Now, let us see the same figure with an indication of the number of rows expected for a (very small, indeed) period of time:

    image

    The fact table contains a lot of rows but it is in the range that PowerPivot can handle without any problem. After all, aggregating several millions of rows is not a problem for a great engine like PowerPivot is. The problem is that the user will make a filter on the Targets table and we need to propagate that filter to the Audience one, which is something that does not happen magically.

    In fact, the relationship between the bridge and the targets table indicates that, when a filter is imposed on the targets table, the bridge is filtered to show only the values that are in relationship with the targets. The other relationships are not affected by the filter, so the execution context sees all the individuals and all the audience rows. We need to write a DAX formula to make PowerPivot follow the relationships.

    My first trial, with this structure, was the naïf one: using SUMX I iterate over the bridge table and then, for each row, I use CALCULATE to consolidate the row context in the bridge to a filter context that propagates over relationships. Thus, the first formula looks like this:

    SUMX (
    TargetForIndividuals,
    CALCULATE (…)
    )

    Now, this formula will work but it is not fast. Indeed, it is incredibly slow. The reason is that if we filter only one target, SUMX will iterate over (more or less) 1/140 of the bridge table, thus it will perform an iteration over 10,000 rows. For each row it will open a new row context that CALCULATE will consolidate in a filter one. Since each row in the bridge corresponds to only one row in the Individuals table, the relationship, along with the filter context, will make only one individual visible. Now, if we compute a formula over the Audience table it will be computed for that individual only. By simple math, we know that, on average, each individual has +/- 900 rows in the Audience table. Thus, the computation for each individual is very fast, but there will be 10,000 of those computation, making the full process very slow. The following figure is a graphical representation of the algorithm:

     
    image

    In order to make this formula work fast we need to reduce the number of iterations over the bridge table. The idea is that if we are able to filter all the individuals belonging to a target in a single step, then we can use that filter to reduce the size of the individuals table which, in turn, will filter the audience table. Said in other words, we want to do a single step where we filter the Individuals table using the TargetForIndividuals as a sort of bitmap mask.

    With this algorithm in mind, it is now easy to read this formula (yep, I understand that writing it is harder, but reading it is at least affordable):
     
    CALCULATE (
    …,
    FILTER (
    VALUES (Individuals),
    CALCULATE (
    COUNTROWS (
    TargetsForIndividuals
    ) > 0

    )
    )

    Let us read it. We use CALCULATE, the ellipsis contain the formula, which is not interesting for this post, the only thing to note is that the formula will work on the Audience table, performing some kind of calculation.

    CALCULATE will create a new filter context where we impose a filter over Individuals, using FILTER. The individuals table is unconstrained, before the filter context created by CALCULATE so VALUES(Individuals) inside FILTER will return all the individuals. But then, the FILTER call returns only the individuals where COUNTROWS (TargetForIndividuals) is greater than 0. At this point, we need to remember that the bridge table (TargetsForIndividuals) shows only the rows (10,000) for individuals who belong to a specific target, due to the filter present on the Targets table. Thus, the FILTER call will return only the individuals who belong to a target, again 10,000 rows.

    CALCULATE will then create a filter context where the Individuals table shows only those 10,000 rows and, due to the presence of the relationship, the Audience table gets filtered too, showing only the rows who belong to individuals who, in turn, belong to the specified target. Let us see this in the next figure:

    image

    Now the situation is completely different. The formula does not perform any iteration, it will do a one-step filter of the Individuals table and then a single pass of the Audience table, looking for nine millions rows at once. Needless to say, this formula runs at a blazing speed.

    If you are interested in this kind of games, this is one of the topics of my incoming PASS Summit session and, needless to say, of the great PowerPivot workshop that SQLBI will bring to the Netherlands in December 1 and 2. I really hope to see you at both events! Sorriso

  • Banding with PowerPivot

    Banding is one of the most common requirements when you have the need to analyze continuous values, like the retail price of a product which may change over time, due to discounts and price variations. I have already spoken about banding with regards to SSAS here, now I would like to spend some words on banding with PowerPivot. This topic is covered in much more details in the upcoming book I have written with Marco Russo: “Microsoft PowerPivot for Excel 2010: Give Your Data Meaning”.

    To show the examples, I am using the fact reseller sales table in AdventureWorks and this banding table:

    image

    Now, let us take a look at the various options that we have available in PowerPivot to perform banding.

    The first one is the naïf one. You can add a calculated column with a long formula that computes the price band:

    = IF (
        FactResellerSales[DiscountedPrice] <= 5,
        "01 LOW", 
        IF (
            FactResellerSales[DiscountedPrice] <=30, 
            "02 MEDIUM",
            IF (
                FactResellerSales[DiscountedPrice] <=100, 
                "03 MEDIUM", 
                IF (
                    FactResellerSales[DiscountedPrice] <= 500, 
                    "04 HIGH", 
                    "05 VERY HIGH"))))

    It is quick, dirty, but it works. Clearly, even if I expect that this approach will be used by many Excel users, I don’t think that this formula is worth a post Sorriso. Moreover, using this approach requires you to update the formula whenever you want to change the banding table (which is hardcoded inside the formula) thus, in the search for a better data driven approach, I am going to show some better solutions.

    Nevertheless, even if I don’t like this formula, it is worth noting that with PowerPivot you have a quick and dirty option. If you (as I do) extensively use PowerPivot to perform fast prototyping of complex BI solutions, then you will be able to show banding to end users with a real minimal effort. Then, when the demo is finished, it is time to search for something better.

    A second approach would be that of relying on the PowerPivot engine to create a relationship between the FactResellerSales table and the band one. Nevertheless, you cannot create such a relationship since the banding table, in its current format, does not have a valid key. It contains ranges, not values which can be the target of a relationship. Nevertheless, performing a simple change in the data model, you can follow the same approach used with SSAS and expand the banding table so that it gains a structure which can be used to create the relationship.

    The new band table format might be like this:

    image

    Now, Price is a valid key and you can create a relationship between the FactResellerSales table and this table using the price. Clearly, since this table is too long to be filled by hand, you will need to create a simple VBA script that starts from the previous table and creates the expanded one, where each band is repeated for each single price. The code is not complex and might look like this:

    For Each Row In ActiveSheet.ListObjects("PriceBands").ListRows
        Dim MinValue As Integer
        Dim MaxValue As Integer
        Dim Value As Integer
        Dim newRow As ListRow
        MinValue = Row.Range(1, 2).Value
        MaxValue = Row.Range(1, 3).Value - 1
        For Value = MinValue To MaxValue
            Set newRow = PriceBandsExpanded.ListRows.Add
            newRow.Range(1, 1) = Row.Range(1, 1)
            newRow.Range(1, 2) = Value
         Next
    Next 

    This solution works in a data driven way, is very simple to implement and solves the banding problem. Nevertheless, it requires one step (the creation of the expanded table) every time you update the configuration sheet. Moreover, it requires a bit of knowledge of VBA programming, which might not be in the hands on the average Excel user. Thus, it is better to find a pure PowerPivot solution to banding.

    The key point here is that we want to create a relationship based on BETWEEN, while in PowerPivot we can only leverage relationships based on identity of values. Thus, to override the limitation of PowerPivot, it will be necessary to avoid using its engine and leverage the full power of DAX to create a calculated column that mimics the BETWEEN relationship.

    Let us revert back to our original table, let us suppose that we create a linked table inside PowerPivot and call it PriceBands. We do not create any relationship between FactResellerSales and PriceBands, instead we can leverage the FILTER function to find, for each row of the sales, the correct price band. My first trial has been this:

    BandName=MAXX (
        FILTER (
            PriceBands,
            FactResellerSales[DiscountedPrice] >= PriceBands[MinPrice] 
         && FactResellerSales[DiscountedPrice] < PriceBands[MaxPrice]
        ),
       [PriceBand]
    )

    In this formula we leverage FILTER to fine the PriceBands table and find the right row. Now, since FILTER returns a table (which, we know, contains only one row), we use MAXX to transform this table into a scalar value, which should contain the PriceBand name. As simple as this solution seems, it will not work because the MAXX function in PowerPivot does not work with strings, it only computes MAX of numbers or dates. If you wonder why, as I do, you’d better know that this is for “Excel compatibility”, thus it is by (wrong) design. The very creative reader might now think at a new solution, that is add an identity column to the PriceBand table, compute the MAX of that identity value (which is now a number, thus MAXX will work) and then use it to setup a relationship with the PriceBand table. Believe me, it will work fine (I tested it) but, clearly, it is not very elegant.

    The elegant solution is to leverage the CALCULATE function. By using CALCULATE you will end up with this formula:

    BandName = CALCULATE(
        VALUES (PriceBands[PriceBand]),
        FILTER (
            PriceBands, 
            FactResellerSales[DiscountedPrice] >= PriceBands[MinPrice]
         && FactResellerSales[DiscountedPrice] < PriceBands[MaxPrice]
        )
    )

    The key is to use CALCULATE to create a new execution context where the PriceBand table is filtered so that it will contain only one row, using the same filter expression used before. Then, when PowerPivot calculates the VALUES (PriceBands[PriceBand]) expression to compute the distinct values of the PriceBand column, it performs the calculation in a context where there exists only one row, so the returning value will be the right band name. If, for some error in the configuration, the VALUES call returns more than one row, PowerPivot will raise an error. Otherwise, it the resulting table contains one row only, then the value is automatically converted into a scalar value and the formula correctly computes the band name.

    As it often happens with PowerPivot, the correct solution requires you to use and understand CALCULATE, and to have a creative mind to search for non trivial solutions to the problem. Nevertheless, the final result is a very compact, elegant and fast formula that works in a complete data drive way and let the user change the configuration sheet, refresh data and test several bandings at a glance.

    .csharpcode, .csharpcode pre { font-size: small; color: black; font-family: consolas, "Courier New", courier, monospace; background-color: #ffffff; /*white-space: pre;*/ } .csharpcode pre { margin: 0em; } .csharpcode .rem { color: #008000; } .csharpcode .kwrd { color: #0000ff; } .csharpcode .str { color: #006080; } .csharpcode .op { color: #0000c0; } .csharpcode .preproc { color: #cc6633; } .csharpcode .asp { background-color: #ffff00; } .csharpcode .html { color: #800000; } .csharpcode .attr { color: #ff0000; } .csharpcode .alt { background-color: #f4f4f4; width: 100%; margin: 0em; } .csharpcode .lnum { color: #606060; }
  • Final Update of the Rough Cuts of our PowerPivot book available!

    The upcoming book Microsoft PowerPivot for Excel 2010: Give Your Data Meaning that I wrote with Marco is its final stage. You can see the final update through the “Rough Cuts” initiative by O’Reilly.

    The book will be available in bookstores in the first days of october.

  • Visio Forward Engineer Addin for Office 2010

    Most of my database model are written with Visio. I don’t want to start a digression whether Visio is good or not to build a simple data model: Visio is enogh for my modeling needs and customers love its colours and the ability to open the model with Office when I need to discuss it with them. When I have finished modeling, I generate the database and everything works fine.

    Nevertheless, Microsoft seems not to like the forward engineer capabilities of Visio. The last release that supports forward engineering is the Enterprise Architect version of Visio 2003, which requires Visual Studio 2005 to be installed on the box. Since I am really tired to install old releases of Visio just to perform forward engineer (moreover, the 2003 release does not support the new data types) I decided it was time to follow the standard approach to Microsoft products: “if Microsoft does not help, do not ask, do it by yourself”.

    Thus, I wrote an Office add-in that performs forward engineer of a Visio database model to SQL Server. It does not support any other database driver and has some big limitations, since the library that should let programmers have access to the underlying Visio data model is non documented and full of uninplemented interfaces. Thus, I needed to collect information over the web, searching for people who tried the same before me. Nevertheless, for a standard data model it works fine and saves me to the need to install old software on new computers. :)

    It has two basic functions, available through a new ribbon:

    image

    • Validate Data Model simply asks the model to validate. I have not found a way to give a feedback whether the model has been correctly validated or not, it is up to you to check if any error appears in the output window… nevertheless, better than nothing.
    • Forward engineer opens a very simple dialog box from which you can choose the database name, some options and then generate the SQL script. I personally love the option to open the script directly after generation, so that SSMS pops up and I can create the database.

    image

    The add-in is still in beta, it needs Office 2010 and has been written with Visual Studio 2010, .NET 4.0. If you are interested in testing it you can download the first beta here.

    The nice part of the story is that, without any previous knowledge of Office programming and no knowledge at all of the Visio internals, it took me roughly one day from the idea and some hints found on the web to the working add-in and it has been a nice arena for me to try VS 2010… I wonder why Microsoft still refuses to add this feature to Visio, they have much better programmers than me, after all I am a BI guy.

    Comments welcome, of course. :)

  • Remote Execution of SSIS Packages

    Having the need to execute a package on a remote server, I discovered  that DtExec has no option to launch the package remotely (something that everybody apart me already knew…).

    So… time to write some code. I wrote a WCF Service that executes a package, hosted it into a Windows Service (it can be hosted in IIS too, I just like services more than I like the web interfaces) and developed a test program that uses WCF to communicate with the server to execute the package. All this stuff has been indeed really easy to write (and this induces me to argue why Microsoft did not provide such a tool with standard SQL Server installation).

    Using the WCF callback methods I succeeded in returning to the caller full details of log and events produced by the package, so that I can run a package remotely and still monitor what happens on the server side from the client. Moreover, since I needed to call the package from inside an application, I implemented a nice progressbar that shows to the user the progress of the package, by inspecting the package events, so that he knows if he can have a coffee during processing or not.

    At the end, the solution is a very nice example of what can be done using WCF as a communication media and, since I don’t think that I’m the only one having such a need, I am sharing the sources with the web. If you are interested in getting the code, with some basic documentation, follow the link to the source code and have fun.

    BTW: the test application can be used without the need to understand all the code and is basically a DTEXEC replacement that runs packages remotely… something that can be useful to anybody developing SSIS code.

  • SqlBulkCopy Performance Analysis

    In the endless search for optimal performance in Bulk loading data, I have written a paper which describes how to reach top performance using the SqlBulkCopy class in .NET, trying to find the best combination of its parameter settings to load heaps, clustered table with and without indexes performing all the load in a highly parallelized environment.

    In order to produce test, I have written a simple solution that implements a producer/consumer pattern where many producers and consumer can run in a parallel environment, that can be used to perform detailed tests by simply switching some parameters on and off. It is not a user friendly program, just a tool that you can use to test different bulk insert operations. Nevertheless, I found it very useful in understanding how bulk insert work and how to optimize it.

    The paper does not go as deep as the Data Loading Performance Guide from Microsoft does in describing the internals of SQL Server and bulk loading. Nevertheless, the Microsoft paper describes some esoteric hardware configurations that are pretty difficult to find in the real world, while my paper is much closer to a real user experience. I think that reading both might help any ETL code writer to better understand how to boos his code performance. Mine is easier, Microsoft’s is far more complete.

    If you are interested in the matter it please follow

    I am very interested in feedback and, if you find time to make tests on your specific hardware, any result you find that are in some ways different than mine. So feel free to contact me to provide both, so that I can add different results to the paper and increase the whitepaper completeness.

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement