A place for my thoughts and experiences on SQL Server, Business Intelligence and .NET
As you may have already learned from my old post or Adam’s or Kalen’s posts, there will be two SQL Rally in North Europe.
In the Stockholm SQL Rally, with my friend Thomas Kejser, I’ll be delivering a pre-con on Data Warehouse Modeling:
Data warehouses play a central role in any BI solution. It's the back end upon which everything in years to come will be created. For this reason, it must be rock solid and yet flexible at the same time. To develop such a data warehouse, you must have a clear idea of its architecture, a thorough understanding of the concepts of Measures and Dimensions, and a proven engineered way to build it so that quality and stability can go hand-in-hand with cost reduction and scalability. In this workshop, Thomas Kejser and Davide Mauri will share all the information they learned since they started working with data warehouses, giving you the guidance and tips you need to start your BI project in the best way possible―avoiding errors, making implementation effective and efficient, paving the way for a winning Agile approach, and helping you define how your team should work so that your BI solution will stand the test of time.
- Data warehouse architecture and justification
- Agile methodology
- Dimensional modeling, including Kimball vs. Inmon, SCD1/SCD2/SCD3, Junk and Degenerate Dimensions, and Huge Dimensions
- Best practices, naming conventions, and lessons learned
- Loading the data warehouse, including loading Dimensions, loading Facts (Full Load, Incremental Load, Partitioned Load)
- Data warehouses and Big Data (Hadoop)
- Unit testing
- Tracking historical changes and managing large sizes
With all the Self-Service BI hype, Data Warehouse is become more and more central every day, since if everyone will be able to analyze data using self-service tools, it’s better for him/her to rely on correct, uniform and coherent data. Already 50 people registered from the workshop and seats are limited so don’t miss this unique opportunity to attend to this workshop that is really a unique combination of years and years of experience!
See you there!
Some days ago my SolidQ colleague Luca pointed me to the followin SQLCAT article
When To Break Down Complex Queries
that, as part of the reorganization of SQL CAt website, has emerged again from the depth of Internet.
The article is very interesting and also allows me to underline again the importance of having a way to tell the optimizer when it should *not* expand subqueries into the main query. In this way the usage of #temp tables could be avoided in many cases, allowing for the usage of views or inline UDF instead of stored procedures.
I opened the connect item back in 2010, so it’s time to bring more attention to it in order to have it in the product ASAP, and this article give me the option to do that.
Please vote for it:
Now that data volumes are increasing every day, more and more I find myself fighting with bad query plans due to bad cardinality estimation. When you are joining several big tables, you can be sure you’ll have to break your query in several pieces, to be sure to have good performance. Last time I had to do this was….the day before yesterday!
I think it’s really time for an extended NOEXPAND hint. If you’re with me, vote vote and vote!
Next months look full of interesting conferences and I’ll be speaking in several of those. I’ll be speaking at the following ones:
17 – 18 September 2013, Germany
SQL Saturday 260 Milano
8 October 2013, Italy
SQL PASS Nordic
4 – 6 November 2013, Sweden
SQL PASS Amsterdam
6 – 8 November 2013, Nederland
SQL Saturday 257 Verona
9 November 2013, Italy
Advanced Data Warehouse Workshop
20 November 2013, Stockholm
SQL Saturday 264 Ancona
13 December 2013, Italy
I’ll deliver session on Data Quality & Master Data Management, Techniques for the Agile Business Intelligence, Automating ETL using Metadata (and BIML) and, at SQL PASS Nordic I’ll also deliver, again with my friend Thomas Kejser, a very nice workshop:
Data Warehouse Modeling – Making the Right Choices http://www.sqlpass.org/sqlrally/2013/nordic/Agenda/PreconferenceSeminars.aspx#ThomasKejserDavideMauri
It’s very likely that I’ll be missing the PASS Summit this year, but at some point I also have to do some work here and now…:)
If you’re going to be to one of those conferences, please come and say “Hi”: it’s always great to meet people all around the world!
I’m really happy to be able to announce that we’re going to have a SQL Saturday in Milano! We’re going to have a great list of speaker and exceptional content, as you can see from the list of proposed sessions:
we had proposals from Klaus Aschenbrenner, Silvano Coriani and many other high-level speakers. But the most special session we’re going to have is the one delivered by Fabian Pascal one of the most knowledgeable and respected people on the Database Modeling topic!
He’s going to deliver a very interesting session:
Denormalization for performance: a costly illusion
which is something is really not easy to find elsewhere. This kind of modeling session can really change the way you see the world and, most of all, are rarely free, especially if done by such kind of person.
This is surely a SQL Sat you don’t want to miss! Plus, you also have the opportunity to make your wife/girlfriend happy, bringing her in one of the most (if not THE most) well known Fashion Cities! If you’re staying for more than one night you can stay just near one of the well known street, Corso Como, where you can just walk down right to the Dome, surrounded by nightlife and cool shops.
We’re going to organize something EPIC! Don’t miss it!
In the last months I’ve been playing with Hekaton a while, since a customer of ours is evaluating its usage for one of its flagship products.
Here’s I’d like to share with you what I’ve found doing this test. I think it is really interesting for two main aspect
- they show that Hekaton is not magic and still good best practices and good hardware must be in place
- Hekaton can be really fast, competing – in performances – with some NoSQL databases (but I’ll discuss about this in another post)
For the test I’ve used a VM running on VMWare Workstation, with the guest having 2 vCPU and 4GB of Ram. Disk is pre-allocated vDisks stored on a Samsung 840 Pro SSD.
Surely it’s not a server, but we wanted to try Hekaton on (somehow) cheap machines in order to understand which kind of performance we can have also on (somehow) cheap hardware.
The test was a simple insert of 10,000 entities with the following structure
Here’s the results:
SO = Schema_Only
SAD = Schema_And_Data
NC = Native Compilation
We didn’t test the usage of Bulk Insert ‘cause we were interested in understanding how fast we can be on singleton inserts. Insert was done calling a stored procedure made of only one INSERT command.
As visible the usage of Hekaton at its full potential (SO+NC) improved the performance of near three times. Not really bad! (This kind of performance are fast as some NoSQL DB in my tests!)
Dropping the Native Compilation increased the elapsed time a little bit, but that’s the price you have to pay if you want, for example, to use SEQUENCES or other features not actually supported by Native Compiled procedures. (Using SQL Server 2014 CTP1 right now)
The real surprise was the fact that using the Schema_And_Data option for an Hekaton table basically removed all performance gains, making the difference between Hekaton tables and “classic” tables almost invisible.
I really wanted to understand why since I was not really expecting this.
Analyzing wait stats turned out that we where hitting a bottleneck at the transaction log level:
Here’s the wait stats using SO:
and here’s the SAD (by all means!) wait stats
Now, everything it’s much more clear. Putting Log and Data on the same disk is STILL not a good idea. So what are the thing we’ve learned in these tests?
Well, first of all, Hekaton can really change the game as it’s performance are really impressive and can really compete with some of the fastest NoSQL databases (CouchBase for example).As said before I’ll do a post on that in future, but before I want to do additional tests and wait for Hekaton to RTM status.
As a second point, it shows that there is no magic going around here and one can’t simple hope to solve it’s performance problems simply enabling Hekaton. A good system architecture is still needed and even good understanding of the engine is a must (as always) in order to find the bottleneck and fix the problem.
Unfortunately I wasn’t able to do additional test to show that the bottleneck can be removed due to time and hardware restrictions but I hope to be able to them in the near future so that I can update this post. In the meantime if someone wants to try to execute the same test, just drop me an email and I’ll share the very simple tool I created with you.
As soon as I had the possibility to test HDInsight on Azure I promptly started using it. Nothing really exciting, just creating the cluster and do some tests following the official tutorial you can find here:
I’m using my MSDN MVP subscription for which I have 1500 hours of “Compute Services”. Well, be careful to keep your HDInsight cluster turned on. Even if you don’t use it it will consume resources. The the resource created behind the scene is a “LargeSKU” VM, as the detailed usage report that you can download says.
And, wow, look! In just less then a week it has consumed ALL my available hours, and thus my CC started to be drained. Non that much, luckily, just a hundred of bucks (which is also not so few in this times of crisis) and even more luckily I’m very careful to monitor Azure expenses often since I still not trust the pay-per-use system so much to just leave it alone and without constant supervision. And as soon as I discovered that there was something strange going on I shout the HDInsight cluster down immediately.
Unfortunately the expenses are reported under the generic “Compute Hours – Cloud Services” summary so without downloading the detailed billing report and analyzing it with Excel was impossible to me to understand that the resources consumption came from the HDInsight cluster.
That’s why took me two days to understand the problem (at the beginning I thought the problem was my website) but on May 24th I finally understood what was happening.
To be honest this is the only case where I had such bad surprise (I’m also using VMs, Web Roles and SQL azure and I never consumed more then what I expected so far so I’m not blaming MS at all here), but this is a lesson learned that I want to share with the community, hoping to help someone to avoid even worse surprises.
The feedback to MS I’d like to give is that it would be very good if, at least for the “preview” features, there could be a specific settings to decide how much resources they could use before they are automatically shut down.
And the conclusion, on my side, is that, at least for now, playing with Hadoop on this area of the Big Data universe is better done on-premise on the spare machine I have in the office.
Of course what happened could also totally be my fault, even if I just did everything following the tutorial, but if someone with more experience on Azure HDInsight would like to leave a feedback I’ll be more than happy the ear it.
So…keep your eyes open and your money safe
Another feedback to MS: it would be very very very nice if the billing details can be accessed via OData. It would be a perfect match with PowerPivot capabilities!
In the last days I had to work quite a lot with extended events in order to deeply monitor SQL Server performance. One interesting request that came out while implementing the monitoring infrastructure, was the possibility to monitor the performance of a set of stored procedures, vital for the correct handling of an online booking procedure.
The challenge was to give a sort of real-time monitor of procedure performances so that one can then create alert and/or do some stream-analysis to keep response time always under the desired amount of time.
Here’s how you can do it using Extended Events, monitoring, for example, the execution of procedure uspGetManagerEmployees in AdventureWorks2012:
CREATE EVENT SESSION [monitor_procedure_performance] ON SERVER
ADD EVENT sqlserver.rpc_completed
ADD EVENT sqlserver.module_end
ADD TARGET package0.ring_buffer
Once done, it’s all about decoding the XML that Extended Events returns:
Analyze XEvent data
IF (OBJECT_ID('tempdb..#t') IS NOT NULL) DROP TABLE #t
IF (OBJECT_ID('tempdb..#r') IS NOT NULL) DROP TABLE #r
cast(target_data as xml) xdoc
sys.dm_xe_session_targets t on t.event_session_address = s.address
s.name = 'monitor_procedure_performance'
with cte as
event_number = ROW_NUMBER() over (order by T.x),
event_timestamp = T.x.value('@timestamp', 'datetimeoffset'),
T.x.query('.') as event_data
cte2 as (
--data_field = T2.x.value('local-name(.)', 'varchar(100)'),
data_name = T2.x.value('@name', 'varchar(100)'),
data_value = T2.x.value('value', 'varchar(100)'),
data_text = T2.x.value('text', 'varchar(max)')
cte3 as (
data_name in ('collect_system_time', 'object_name', 'cpu_time', 'duration', 'logical_reads', 'row_count', 'database_name', 'database_id')
(max(data_value) for data_name in (database_id, database_name, object_name, cpu_time, duration, logical_reads, row_count)) T
--SELECT * FROM #t
SELECT * FROM #r
execution_date = cast(event_timestamp as date),
execution_hour = datepart(hour, event_timestamp),
execution_minute = datepart(minute, event_timestamp),
duration_msec = avg(cast(duration as int)) / 1000.
cast(event_timestamp as date), datepart(hour, event_timestamp), datepart(minute, event_timestamp), [object_name]
That’s it. You can now keep monitored execution times of your procedure.
One very useful usage of Extended Events is the ability to trap SQL Server error without the need to have a server trace running (which, btw, is deprecated), with the additional feature of being able to query the data as soon as it comes in. This means that we a solution to monitor and trap errors as soon as they happen can be easily created, in order to help developers to fix problems as soon as they are detected. This is really, really, really helpful especially in very big applications, where the code base is quite old and there is no-one really knowing everything of the solution.
To start a Extended Events sessions in order to trap SQL Server errors with severity greater than 10, just run the following script:
CREATE EVENT SESSION [error_trap] ON SERVER
ADD EVENT sqlserver.error_reported
ADD TARGET package0.event_file
SET filename=N'D:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\XEvents\error_trap.xel'
ALTER EVENT SESSION [error_trap] ON SERVER
STATE = START;
The problem with Exended Events is that they only talks XML which is surely flexible and extensible, but not at all confortable to be queried. That’s why I prefer to turn it into something relational. Using the xml nodes function and pivoting data can make the trick:
IF (OBJECT_ID('tempdb..#e') IS NOT NULL) DROP TABLE #e
WITH cte AS
CAST(event_data AS XML) AS event_data
sys.fn_xe_file_target_read_file('D:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\XEvents\error_trap*.xel', NULL, NULL, NULL)
event_number = ROW_NUMBER() OVER (ORDER BY T.x)
, event_name = T.x.value('@name', 'varchar(100)')
, event_timestamp = T.x.value('@timestamp', 'datetimeoffset')
SELECT * INTO #e FROM cte2
WITH cte3 AS
--data_field = T2.x.value('local-name(.)', 'varchar(100)'),
data_name = T2.x.value('@name', 'varchar(100)'),
data_value = T2.x.value('value', 'varchar(max)'),
data_text = T2.x.value('text', 'varchar(max)')
data_name IN ('error_number', 'severity', 'message', 'database_name', 'database_id', 'client_hostname', 'client_app_name', 'collect_system_time', 'username')
(MAX(data_value) FOR data_name IN ([error_number], [severity], [message], database_name, database_id, username, client_hostname, client_app_name, collect_system_time)) T
[severity] > 10
and voilà, now all errors can be easily identified:
It’s now very easy to create reports and dashboards to monitor the system in (near) real time. And this is vital when you run a 24x7 online business.
Today I was using a 3rd Party Data Flow component that has several Custom Properties for which I need to change some of their values at runtime using a DTS Expression.
To mimic the situation let’s use the “Percentage Sampling” that has two Custom Properties:
Of the two Custom Properties only some (one in this case) are available also outside that dataflow, so that they can be targeted by a DTS Expression. Such properties are listed under the “Misc.” section of Data Flow properties
and also in the Property Expression Editor window:
Now, what if you need to make also the “hidden” custom properties available for DTS Expression usage? As you may have noticed, the SamplingSeed is not exposed outside the Data Flow. I tried to search the web, but after several minutes I gave up since I wasn’t able to find anything that could help. I then started to look into the SSIS object model and I found a nice property named expressionType in the IDTSCustomProperty interface that tells to the engine if the property value can be specified using DTS Expression or not:
If the value is set to Notify than the usage of DTS Expression is possible otherwise, if the property is set to None, as the name implies, DTS Expression cannot be used.
So all you need to do is to open the .dtsx file, look for the component you want to touch and its properties
and add (if not exists) or change the expressionType attribute to Notify:
Save the file and voilà, the property is now available for DTS Expression usage!
Now, just be aware that is a sort of an hack, so double check it works for you. On the 3rd party components we’re using it works like a charm, and it just saved that day since without the ability to change some properties at run time, the deployment of our package in production could have been a nightmare.
I tested this approach both on SQL Server 2012 and SQL Server 2008 and in both cases I hadn’t had any problems.
Hope this helps someone , enjoy!
In the last days I needed a quick and efficient way to monitor the performance of a job and of each step over time. There’s a lot of interesting data in MSDB database and the various sysjob tables, but the stored data is not really easy to use outside the Management Studio. Date and Time are not stored as date/time data type, there is no easy way to correlate the performance of a step with the related job execution (for example is not that easy to return all the steps of a job executed, say, yesterday at 10:00).
So I decided to write some views to make the querying easier and then the next move was to create a Tabular Model with Power Pivot on Excel to make also analysis easier.
Since this is a somehow recurring task that I find myself doing during my job – not often, but when I have to do it now I have something much more powerful in my hands – I decided to put it also con CodePlex so that that little work can be shared with the community.
You can find the scripts with the objects (3 UDFs and 4 Views) that must be created inside MSDB and the Power Pivot Excel 2010 model here:
I’m really really pleased to announce that it’s possible to register to the Data Warehouse Workshop that I and Thomas Kejser developed togheter. Several months ago we decided to join forces in order to create a workshop that would contain not only the theoretical stuff, but also the experience we both have and all the best practices and lesson learned that can make the difference between a success and a failure when building a Data Warehouse.
The first sheduled date is 7 February in Kista (Sweden):
and until 30th November there is the Super Early Bird to save more the 100€ (150$).
The workshop will be very similar to the one I delivered at PASS Summit summit, with some extra technical stuff since it’s one hour longer. In addition to that for this first version both me and Thomas will be present, so it’s a great change to make sure you super-charge your DW/BI project with insights that aren’t available anywhere else!
If you’re into the BI field and you live in Europe, don’t miss this opportunity!
The topic is really not new but since it’s the second time in few days that I had to explain it different customers, I think it’s worth to make a post out of it.
Many times developers would like to insert a new row in a table or, if the row already exists, update it with new data. MySQL has a specific statement for this action, called REPLACE:
or the INSERT …. ON DUPLICATE KEY UPDATE option:
With SQL Server you can do the very same using a more standard way, using the MERGE statement, with the support of Row Constructors.
Let’s say that you have this table:
CREATE TABLE dbo.MyTargetTable
id INT NOT NULL PRIMARY KEY IDENTITY,
alternate_key VARCHAR(50) UNIQUE,
INSERT [dbo].[MyTargetTable] VALUES
('GUQNH', 10, 100, 1000, 10000, 100000),
('UJAHL', 20, 200, 2000, 20000, 200000),
('YKXVW', 30, 300, 3000, 30000, 300000),
('SXMOJ', 40, 400, 4000, 40000, 400000),
('JTPGM', 50, 500, 5000, 50000, 500000),
('ZITKS', 60, 600, 6000, 60000, 600000),
('GGEYD', 70, 700, 7000, 70000, 700000),
('UFXMS', 80, 800, 8000, 80000, 800000),
('BNGGP', 90, 900, 9000, 90000, 900000),
('AMUKO', 100, 1000, 10000, 100000, 1000000)
If you want to insert or update a row, you can just do that:
(SELECT * FROM (VALUES ('ZITKS', 61, 601, 6001, 60001, 600001)) Dummy(alternate_key, col_1, col_2, col_3, col_4, col_5)) S
T.alternate_key = S.alternate_key
NOT MATCHED THEN
INSERT VALUES (alternate_key, col_1, col_2, col_3, col_4, col_5)
MATCHED AND T.col_1 != S.col_1 THEN
T.col_1 = S.col_1,
T.col_2 = S.col_2,
T.col_3 = S.col_3,
T.col_4 = S.col_4,
T.col_5 = S.col_5
If you want to insert/update more than one row at once, you can super-charge the idea using Table-Value Parameters, that you can just send from your .NET application.
Easy, powerful and effective
Yesterday has been published my Q&A interview on my Pre-Conference Workshop at SQL Pass 2012. It’s a good way to understand what will be discussed in the workshop, so if you’re interested or you’re into the Data Warehouse / Business Intelligence field and want to understand how the Agile approach can help you, you can read it here:
I’ve received a buch of emails from PASS Summit “First Timers” that are also somehow new to SQL Server (for “somehow” I mean people with less than 6 month experience but with some basic knowledge of SQL Server engine) or are catching up from SQL Server 2000. The common question regards the session one should not miss to
- have a broad view of the entire SQL Server platform
- have some insight into some specific areas of SQL Server
Given that I’m on (semi-)vacantion and that I have more free time (not true, I have to prepare slides & demos for several conferences, PASS Summit - Building the Agile Data Warehouse with SQL Server 2012 - and PASS 24H - Agile Data Warehousing with SQL Server 2012 - among them…but let’s pretend it to be true), I’ve decided to make a post to answer to this common questions. Of course this is my personal point of view and given the fact that the number and quality of session that will be delivered at PASS Summit is so high that is very difficoult to make a choice, fell free to jump into the discussion and leave your feedback or – even better – answer with another post. I’m sure it will be very helpful to all the SQL Server beginners out there.
I’ve imposed to myself to choose 6 session at maximum for each Track. Why 6? Because it’s the maximum number of session you can follow in one day, and given that all the session will be on the Summit DVD, they are the answer to the following question: “If I have one day to spend in training, which session I should watch?”.
Of course a Summit is not like a Course so a lot of very basics concept of well-established technologies won’t be found here. Analysis Services, Integration Services, MDX are not part of the Summit this time (at least for the basic part of them).
Enough with that, let’s start with the session list ideal to have a good Overview of all the SQL Server Platform:
for what concern Database Development, I recommend the following sessions
For Business Intelligence Information Delivery
and for Business Intelligence Architecture & Development
Beside the listed sessions, First Timers should also take a look the the page PASS set up for them:
See you at PASS Summit!